S1: Ch 14: Index Structures for Files
- Learning Goals:
- File organizations with additional access methods
- Single level indexes
- ...Primary, Secondary and Clustering
- Multi-level dynamic indexes
- ...B-trees, B+trees: properties, algorithms & data-structures
- Strategies for point queries: scan, index-search
- Strategies for range queries: scan, index-scan
- Using I/O cost models
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S2: Summary of Ch#5. Record Storage and File Organization
- Storage Hierarchy, Magnetic Disks
- Records and Files
- Mapping them to disk blocks
- File operations
- Organizing records in a file: unordered, ordered, hashed
- Different cost of Find(), FindNext()
- Additional data-structures used in dynamic/extendible Hashing
- Index = additional data-structure to speedup search
- Idea similar to card-catalogs (library), indexes in books
- Fast search of index via ordering index entries
- ...or multi-level index (search trees)
- Focus of Ch# 14
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S3: 14.1 Index: Basic Concepts and Classification
- What is an index? f : Indexing Field Value -- > disk page(s)
- Collection of < K(i), P(i) > (1NF)
- Distinct from datafile, though shares the indexing field
- Search procedure: (a) search index : K(i) -- > set of P(i)s
- ...(b) search contents of the Pages P(i)s in main memory
- Search time w/ index: log(bi) + |p(i)s| I/Os , bi = blocks in index
- 3 Types of Single-level indexes
- 1. Primary Index (Fig. 14.1, pp 458)
- Datafile ordered by indexing field = primary key
- nondense, i.e. #index-entries < #dat records
- One index entry per disk page of datafile
- Takes less space, faster search but slow insert
- ...Insert needs shifting, update of anchor records
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S4: Single Level Indexes (Contd.)
- 2. Clustered Index (Fig. 14.2, pp 460)
- Datafile ordered by indexing field
- Indexing field value NOT UNIQUE for each record
- One entry per distinct value of indexing field
- Q? Is it a dense index?
- Q? Are insert() easier than primary index?
- ...need shifting or free space per block (Fig. 14.3, pp 461)
- 3. Secondary Index (Fig. 14.4, pp 463)
- Datafile may not be ordered by indexing field
- Dense index, i.e. an index entry per data record
- Indexing field may be a key or a non-key field
- Q?How to handle duplicate values of indexing field?
- ...dense index, variable length index record, or
- ...extra level of indirection (see Fig. 14.5)
- Summary table 14.1, 14.2, pp. 465-6
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S5: Single Level Indexes: Exercises
- Q? How many indices of each type can be defined on a datafile?
- Heap, Hashed: Many secondary indexes
- Sortedfile: One primary/clustered, Many secondary indexes
- Fully inverted file = one secondary index per field
- Q? Which index provides most gain in search time ?
- Secondary: O(bd/2 - log(bi)); Primary: O(log(bd) - log(bi))
- Cost Model Ex. 1 & 2 (pp 459, 462): r = 30,000;
- ...B = 1 Kbyte; R = 100 byte; V = 9 byte; Pointer = 6 bytes
- ...Compute I/O Cost Saving from Primary Index.
- ...Compute I/O cost savings from a Secondary index (unordered datafile).
- ...Hints: calculate bfr(d), bd, #index entries, bfr(i), bi
- Q? What is the cost of FindNext() for different indexes?
- O(1) for all, since index is sorted.
- Table 14.1-2 (pp 465-6), give row/col. headers, ask for entries.
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S6: Multilevel Indexes (Fig. 14.6, pp 467)
- Concept: Speedup Step(a), i.e. index search
- By adding a 2nd level index to 1st level index file
- How many levels can a multilevel index have?
- ...log(bi, base = fo), where fo = fanout = bfr(i)
- Search procedure: level traversal (Alg. 14.1, pp 468)
- Search time: log(bi, base = fo)) vs. log(bi) vs. log(bd) vs. bd/2
- Commercial ISAM = 2 level primary index, ordered file
- Ex. computer #levels and search time for multilevel index to
- ...unordered datafile, dense secondary index, r = 300 million.
- ...B= 1 Kbyte; R= 100 byte; V= 9 byte; Pointer= 6 bytes
- Q? What kind of index is 2nd level index?
- ...Primary, Clustered or Secondary?
- Summary: Faster Search, But Slower Insert(), Delete()
- To reduce shifting = > leave freespace in blocks OR
- ...use overflow blocks
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S7: 14.3 Dynamic Multilevel Index: B-tree, B+tree
- Definitions: Tree, nodes, root, leafs, internal nodes, level (node)
- ...parent-child relationship, descendants, subtree(node)
- ...Fig. 14.7 (pp 469)
- Search Trees of order p (see Fig. 14.9/ pp 471)
- variation of multilevel indexes
- Node has (p-1) search values and p pointers
- ...in order < P1, K1, P2, K2, ..., K(q-1), Pq), q < = p
- ...where Pi = pointer to child / null; Ki = key value
- ...For simplicity, assume search values to be unique
- Properties (see Fig. 14.8, pp 470)
- ...Given key value can occur in a unique child(node)
- ...1. Within each node, K1 < K2 < ... < K(q-1)
- ...2. all X in subtree(Pi); K(i-1) < X < Ki for 1 < i < q
- ... X < Ki for i = 1; and K(i-1) < X for i = q
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S8: B-tree, B+tree (Fig. 14.10, 14.11 on pp 473, 475)
- B-tree = Search tree + Balanced + free space mangement
- Node = {P1, < K1, Pr1 > , ..., P(q-1), < K(q-1), Pr(q-1) > , Pq}
- ...Pr(i) = data pointer, Pi = tree pointer or data pointer
- Non-root nodes have ceil(p/2) < = q < = p (avg. 69% full)
- Root has > 2 pointers, unless it has no children.
- Leaf nodes have Pi = Null.
- Key values in nodes are UNIQUE
- B+tree is like B-tree, except for
- Leaf and non-leafs have different formats (Higher fan-out):
- ...Non-leaf node do not have Pr(i), Pi is always tree pointer
- ...Leaf nodes do not have Pi (which were Null anyway!)
- Leaf nodes are threaded (for sequential access)
- Key values NON-LEAF nodes are unique
- B+tree used in most commercial databases
- Q? Is B+tree an index file distinct from datafile?
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S9: Comparison + Search() on B+trees
- Q? Compare B-tree and B+-tree on fanout,
- ...Avg. #nodes, #entries, #pointers @ levels root, 1, 2, 3 for
- ...V= 9 bytes, P= 6 bytes, B= 512 bytes, 69% utilization
- fo (17,23), Level 2 (4912,11638), Level 3 (83520,267674)
- Search(key = k, Page), called with root-page (Alg. 14.2/pp 477)
- ...binary search on page within main memory
- ...Case K = Ki in the page = > Follow Pr(i)
- ...Case K in subtree(notNull Pi) = > Search(K, page(Pi))
- ...Case K in subtree(Null Pi) = > not in tree
- Time = depth of B+tree = log(r, base = fo), i.e. 2 to 4
- Ex. List strategies for point and range search using B+tree
- ...as primary (secondary) index.
- Primary: Find first record, scan datafile from there
- Secondary: Find first index-leaf, scan index-leafs
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S10: Insert(), Delete() with B+trees
- Complex recursive algorithms, Better use intuition!
- Insert(key = k, ...), see Alg. 14.3 (pp 478)
- Produce a legal B+tree compatible with following steps:
- 1. Search(key = k, ...) to locate right Leaf
- 2. Case Leaf has space = > insert in leaf
- 2. Case Leaf is full = > insert; split leaf into 2 pages
- 3. Parent(Leaf) is updated to add pointer to new leaf
- ...overflow = > recursion of {insert, split}up the level
- Ex. Fig. 14.12 (pp 480): Insert 8, 5, 1, 7, 3, 12, 9, 6
- ...insert(1) = > splits root
- ...insert(12) , insert(6) = > split propagates
- Deletion from a B+tree(order 3): 5, 12, 9
- Ex. 14.13 (pp 481)
- Ex. How many distinct B+ tree are possible for
- p = 3, depth < = 3
- Draw all the trees.
- Identify sequence of trees from inserting
- ...a sorted list of keys
- ...an unordered list of keys
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S11: 14.4 Index on Multiple Keys
- Motivation
- Star joins in data warehouses
- Applications - Geographic Information, multimedia
- keys are not linear
- Idea 1: 14.4.1 Linearize
- Order (linear) keys -Ex. (pp. 483)
- Idea 2: Partitioned Hashing
- hash-bucket-address = concat( h1(key1), ..., hn(keyN) )
- Useful for point query, not useful for range query
- Idea 3: Truly multi-dim index
- Grid Files (Fig. 14.14, pp. 484)
- Example of search
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)