Indexes: Accelerating Data Lookups¶
As tables grow to thousands or millions of records, query performance becomes critical. Indexes are data structures that enable fast lookups by specific attributes, dramatically reducing query times from scanning every row to near-instantaneous retrieval.
Think of an index like the index at the back of a textbook: instead of reading every page to find a topic, you look it up in the index and jump directly to the relevant pages. Database indexes work the same way—they create organized lookup structures that point directly to matching records.
Prerequisites¶
This chapter assumes familiarity with:
Primary Keys — Understanding unique entity identification
Foreign Keys — Understanding table relationships
Tables — Basic table declaration syntax
How Indexes Are Created in DataJoint¶
In DataJoint, indexes are created through three mechanisms:
| Mechanism | Index Type | Purpose |
|---|---|---|
| Primary key | Unique index (automatic) | Fast lookups by entity identifier |
| Foreign key | Secondary index (automatic) | Fast joins and referential integrity checks |
| Explicit declaration | Secondary index (manual) | Fast lookups by frequently queried attributes |
The first two mechanisms are automatic—every table has a primary key index, and foreign keys create indexes unless a suitable one already exists. The third mechanism gives you control over additional indexes for your specific query patterns.
Demonstrating Index Performance¶
Let’s create a table with many entries and measure the performance difference between indexed and non-indexed lookups.
import datajoint as dj
import random
schema = dj.Schema('indexes')Consider a mouse tracking scenario where each mouse has a lab-specific ID (primary key) and a separate tag ID issued by the animal facility:
@schema
class Mouse(dj.Manual):
definition = """
mouse_id : int # lab-specific ID
---
tag_id : int # animal facility ID
"""def populate_mice(table, n=200_000):
"""Insert random mouse records for testing."""
table.insert(
((random.randint(1, 1_000_000_000), random.randint(1, 1_000_000_000))
for i in range(n)),
skip_duplicates=True
)
populate_mice(Mouse())Mouse()Primary Key Lookup (Fast)¶
Searching by mouse_id uses the primary key index—this is extremely fast:
%%timeit -n6 -r3
# Fast: Uses the primary key index
(Mouse() & {'mouse_id': random.randint(0, 999_999)}).fetch()Non-Indexed Lookup (Slow)¶
Searching by tag_id requires scanning every row in the table—this is slow:
%%timeit -n6 -r3
# Slow: Requires a full table scan
(Mouse() & {'tag_id': random.randint(0, 999_999)}).fetch()Declaring Secondary Indexes¶
To speed up searches on non-primary-key attributes, you can declare secondary indexes explicitly in the table definition.
Syntax¶
Indexes are declared below the --- line in the table definition:
index(attr1, ..., attrN) # Regular index
unique index(attr1, ..., attrN) # Unique index (enforces uniqueness)Example: Adding a Unique Index¶
Since each mouse should have a unique tag_id, we can add a unique index:
@schema
class Mouse2(dj.Manual):
definition = """
mouse_id : int # lab-specific ID
---
tag_id : int # animal facility ID
unique index(tag_id)
"""populate_mice(Mouse2())Now both types of lookups are equally fast:
%%timeit -n6 -r3
# Fast: Uses the primary key index
(Mouse2() & {'mouse_id': random.randint(0, 999_999)}).fetch()%%timeit -n6 -r3
# Fast: Uses the secondary index on tag_id
(Mouse2() & {'tag_id': random.randint(0, 999_999)}).fetch()Composite Index Ordering¶
When a primary key (or index) contains multiple attributes, the order matters. The index can only be used efficiently when searching from the leftmost attribute.
This is analogous to searching in a dictionary that orders words alphabetically:
Searching by the first letters is easy (use the index)
Searching by the last letters requires scanning every word
Let’s demonstrate with a multi-attribute primary key:
@schema
class Rat(dj.Manual):
definition = """
lab_name : char(16) # name of the lab
rat_id : int unsigned # lab-specific rat ID
---
date_of_birth = null : date # birth date (optional)
"""def populate_rats(table):
"""Insert random rat records for testing."""
lab_names = ("Cajal", "Kandel", "Moser", "Wiesel")
dates = (None, "2024-10-01", "2024-10-02", "2024-10-03", "2024-10-04")
for date_of_birth in dates:
table.insert(
((random.choice(lab_names), random.randint(1, 1_000_000_000), date_of_birth)
for i in range(100_000)),
skip_duplicates=True
)
populate_rats(Rat)Rat()The primary key creates an index on (lab_name, rat_id). This means:
| Query Pattern | Uses Index? | Performance |
|---|---|---|
lab_name only | Yes | Fast |
lab_name + rat_id | Yes | Fast |
rat_id only | No | Slow (full scan) |
%%timeit -n2 -r10
# Fast: Uses the primary key index (both attributes)
(Rat() & {'rat_id': 300, 'lab_name': 'Cajal'}).fetch()%%timeit -n2 -r10
# Slow: rat_id is not first in the index, requires full table scan
(Rat() & {'rat_id': 300}).fetch()Adding Indexes for Common Query Patterns¶
If we frequently need to search by rat_id alone or by date_of_birth, we should add explicit indexes:
@schema
class Rat2(dj.Manual):
definition = """
lab_name : char(16) # name of the lab
rat_id : int unsigned # lab-specific rat ID
---
date_of_birth = null : date # birth date (optional)
index(rat_id) # enables fast lookup by rat_id alone
index(date_of_birth) # enables fast lookup by date
"""populate_rats(Rat2())%%timeit -n3 -r6
# Fast: Uses the secondary index on rat_id
(Rat2() & {'rat_id': 300}).fetch()%%timeit -n2 -r2
# Fast: Uses the secondary index on date_of_birth
len(Rat2 & 'date_of_birth = "2024-10-02"')String Pattern Matching and Indexes¶
Indexes on string columns follow similar rules. Pattern searches with LIKE can only use an index when the starting characters are specified:
%%timeit -n2 -r2
# Fast: Exact match uses the index
len(Rat & 'lab_name="Cajal"')%%timeit -n2 -r2
# Slow: Wildcard at start prevents index use
len(Rat & 'lab_name LIKE "%jal"')Viewing Table Indexes¶
Use the describe() method to see all indexes defined on a table:
Rat2.describe();Equivalent SQL Syntax¶
For reference, here’s how indexes are declared in standard SQL:
(DataJoint)
@schema
class Mouse(dj.Manual):
definition = """
mouse_id : int
---
tag_id : int
unique index(tag_id)
"""(Equivalent SQL)
CREATE TABLE mouse (
mouse_id INT NOT NULL,
tag_id INT NOT NULL,
PRIMARY KEY (mouse_id),
UNIQUE INDEX (tag_id)
);You can also add indexes to existing tables in SQL:
-- Add a regular index
CREATE INDEX idx_date ON rat (date_of_birth);
-- Add a unique index
CREATE UNIQUE INDEX idx_tag ON mouse (tag_id);
-- Remove an index
DROP INDEX idx_tag ON mouse;Quiz¶
# Check the table definition to see all indexes
Rat2.describe();Answer
Three indexes:
Primary key index on
(lab_name, rat_id)— automaticSecondary index on
rat_id— explicitSecondary index on
date_of_birth— explicit
Summary¶
Indexes are essential for query performance in tables with many records:
Primary keys automatically create unique indexes for fast entity lookups
Foreign keys automatically create secondary indexes for fast joins
Explicit indexes can be added for frequently queried non-key attributes
Composite index order matters — only leftmost attributes benefit from the index
Unique indexes enforce uniqueness in addition to speeding up lookups
# To re-run the notebook, drop the schema to create anew
# schema.drop()