Indexing | Sahithyan's Notes

Indexing and hashing is used to speed up data retrieval in databases. Instead of scanning an entire file, indexes provide fast access paths to records using keys. When a file is modified, every index on the file must be updated.

Search Key

An attribute or set of attributes used to look up records.

Index File

A separate data structure that stores pairs (index records) of the form:

(search-key, pointer to record)

Much smaller than the original file.

Index-Sequential File

Ordered sequential file with a primary index.

B+ Tree Index File

Dynamic, self-balancing tree structure for indexing. An alternative to indexed-sequential files.

Automatically reorganizes after inserts/deletes (no file rebuild). Avoids block overflows. Keeps all keys in sorted order. Excellent for range and equality queries.

Slightly higher space and update overhead. Still, dominant indexing method in modern DBMS.

Index Evaluation Metrics

When comparing index types, measure:

Access time: how quickly data is found.
Insertion/deletion time: update performance.
Space overhead: extra storage used by index.
Supported access types: exact-match or range-based.

Dense Index

One index entry per search-key value. Points directly to the record. Faster lookups. More space. More maintenance cost.

Sparse Index

Indexes only some search-key values. Can only be used when data is physically ordered on key.

Less space. Easier maintenance. Slow lookups.

To find a key $K$ :

Find the index entry with largest key $\lt K$ .
Scan sequentially in the file.

Ordered Index

Keys stored in sorted order.

Primary Index

Aka. clustering index. Order of index matches file order. Usually (but not always) based on the primary key. One per file. Sequential scan is efficient.

Secondary Index

Aka. non-clustering index. Based on non-ordering attribute(s). Can be many per file. Must be dense. Sequential scan is inefficient.

Hash Index

Keys mapped to buckets using a hash function.

Multilevel Index

When a sparse index is built from a index file. Named outer index and inner index respectively.

All indices must be updated during insert/delete.

Used when the inner index is too large to fit in memory.

Bitmap Index

A bitmap is an array of bits. Can only be used when the attribute has a small domain. One bitmap per attribute value.

Hashing

Static Hashing

Each record is stored in a bucket determined by a hash function:

h(key) → bucket\ address

Each bucket is typically one disk block.
Used for equality search (not range queries).
Example: h(DeptName) = sum(char codes) mod 10.

Bucket Overflow

Occurs when too few buckets or non-uniform data distribution.

Handling Overflow

Overflow chaining: overflow buckets linked in a list.
Called closed hashing.
Open hashing (without overflow buckets) is unsuitable for DBMS use.

Hash Function Design

A good hash function:

Distributes keys uniformly across buckets.
Minimizes overflow chains.
Avoids clustering of similar key values.

Poor hash functions map many keys to the same bucket → slow access.