Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Indexes

Indexes: Accelerating Data Lookups

As tables grow to thousands or millions of records, query performance becomes critical. Indexes are data structures that enable fast lookups by specific attributes, dramatically reducing query times from scanning every row to near-instantaneous retrieval.

Think of an index like the index at the back of a textbook: instead of reading every page to find a topic, you look it up in the index and jump directly to the relevant pages. Database indexes work the same way—they create organized lookup structures that point directly to matching records.

Prerequisites

This chapter assumes familiarity with:

  • Primary Keys — Understanding unique entity identification

  • Foreign Keys — Understanding table relationships

  • Tables — Basic table declaration syntax

How Indexes Are Created in DataJoint

In DataJoint, indexes are created through three mechanisms:

MechanismIndex TypePurpose
Primary keyUnique index (automatic)Fast lookups by entity identifier
Foreign keySecondary index (automatic)Fast joins and referential integrity checks
Explicit declarationSecondary index (manual)Fast lookups by frequently queried attributes

The first two mechanisms are automatic—every table has a primary key index, and foreign keys create indexes unless a suitable one already exists. The third mechanism gives you control over additional indexes for your specific query patterns.

Demonstrating Index Performance

Let’s create a table with many entries and measure the performance difference between indexed and non-indexed lookups.

import datajoint as dj
import random

schema = dj.Schema('indexes')

Consider a mouse tracking scenario where each mouse has a lab-specific ID (primary key) and a separate tag ID issued by the animal facility:

@schema
class Mouse(dj.Manual):
    definition = """
    mouse_id : int  # lab-specific ID
    ---
    tag_id : int  # animal facility ID
    """
def populate_mice(table, n=200_000):
    """Insert random mouse records for testing."""
    table.insert(
        ((random.randint(1, 1_000_000_000), random.randint(1, 1_000_000_000)) 
         for i in range(n)), 
        skip_duplicates=True
    )

populate_mice(Mouse())
Mouse()

Primary Key Lookup (Fast)

Searching by mouse_id uses the primary key index—this is extremely fast:

%%timeit -n6 -r3

# Fast: Uses the primary key index
(Mouse() & {'mouse_id': random.randint(0, 999_999)}).fetch()

Non-Indexed Lookup (Slow)

Searching by tag_id requires scanning every row in the table—this is slow:

%%timeit -n6 -r3

# Slow: Requires a full table scan
(Mouse() & {'tag_id': random.randint(0, 999_999)}).fetch()

Declaring Secondary Indexes

To speed up searches on non-primary-key attributes, you can declare secondary indexes explicitly in the table definition.

Syntax

Indexes are declared below the --- line in the table definition:

index(attr1, ..., attrN)           # Regular index
unique index(attr1, ..., attrN)    # Unique index (enforces uniqueness)

Example: Adding a Unique Index

Since each mouse should have a unique tag_id, we can add a unique index:

@schema
class Mouse2(dj.Manual):
    definition = """
    mouse_id : int  # lab-specific ID
    ---
    tag_id : int  # animal facility ID
    unique index(tag_id)
    """
populate_mice(Mouse2())

Now both types of lookups are equally fast:

%%timeit -n6 -r3

# Fast: Uses the primary key index
(Mouse2() & {'mouse_id': random.randint(0, 999_999)}).fetch()
%%timeit -n6 -r3

# Fast: Uses the secondary index on tag_id
(Mouse2() & {'tag_id': random.randint(0, 999_999)}).fetch()

Composite Index Ordering

When a primary key (or index) contains multiple attributes, the order matters. The index can only be used efficiently when searching from the leftmost attribute.

This is analogous to searching in a dictionary that orders words alphabetically:

  • Searching by the first letters is easy (use the index)

  • Searching by the last letters requires scanning every word

Let’s demonstrate with a multi-attribute primary key:

@schema
class Rat(dj.Manual):
    definition = """
    lab_name : char(16)           # name of the lab
    rat_id : int unsigned         # lab-specific rat ID
    ---
    date_of_birth = null : date   # birth date (optional)
    """
def populate_rats(table):
    """Insert random rat records for testing."""
    lab_names = ("Cajal", "Kandel", "Moser", "Wiesel")
    dates = (None, "2024-10-01", "2024-10-02", "2024-10-03", "2024-10-04")
    for date_of_birth in dates:
        table.insert(
            ((random.choice(lab_names), random.randint(1, 1_000_000_000), date_of_birth) 
             for i in range(100_000)), 
            skip_duplicates=True
        )

populate_rats(Rat)
Rat()

The primary key creates an index on (lab_name, rat_id). This means:

Query PatternUses Index?Performance
lab_name onlyYesFast
lab_name + rat_idYesFast
rat_id onlyNoSlow (full scan)
%%timeit -n2 -r10

# Fast: Uses the primary key index (both attributes)
(Rat() & {'rat_id': 300, 'lab_name': 'Cajal'}).fetch()
%%timeit -n2 -r10

# Slow: rat_id is not first in the index, requires full table scan
(Rat() & {'rat_id': 300}).fetch()

Adding Indexes for Common Query Patterns

If we frequently need to search by rat_id alone or by date_of_birth, we should add explicit indexes:

@schema
class Rat2(dj.Manual):
    definition = """
    lab_name : char(16)           # name of the lab
    rat_id : int unsigned         # lab-specific rat ID
    ---
    date_of_birth = null : date   # birth date (optional)

    index(rat_id)                 # enables fast lookup by rat_id alone
    index(date_of_birth)          # enables fast lookup by date
    """
populate_rats(Rat2())
%%timeit -n3 -r6

# Fast: Uses the secondary index on rat_id
(Rat2() & {'rat_id': 300}).fetch()
%%timeit -n2 -r2

# Fast: Uses the secondary index on date_of_birth
len(Rat2 & 'date_of_birth = "2024-10-02"')

String Pattern Matching and Indexes

Indexes on string columns follow similar rules. Pattern searches with LIKE can only use an index when the starting characters are specified:

%%timeit -n2 -r2

# Fast: Exact match uses the index
len(Rat & 'lab_name="Cajal"')
%%timeit -n2 -r2

# Slow: Wildcard at start prevents index use
len(Rat & 'lab_name LIKE "%jal"')

Viewing Table Indexes

Use the describe() method to see all indexes defined on a table:

Rat2.describe();

Equivalent SQL Syntax

For reference, here’s how indexes are declared in standard SQL:

(DataJoint)

@schema
class Mouse(dj.Manual):
    definition = """
    mouse_id : int
    ---
    tag_id : int
    unique index(tag_id)
    """

(Equivalent SQL)

CREATE TABLE mouse (
    mouse_id INT NOT NULL,
    tag_id INT NOT NULL,
    PRIMARY KEY (mouse_id),
    UNIQUE INDEX (tag_id)
);

You can also add indexes to existing tables in SQL:

-- Add a regular index
CREATE INDEX idx_date ON rat (date_of_birth);

-- Add a unique index
CREATE UNIQUE INDEX idx_tag ON mouse (tag_id);

-- Remove an index
DROP INDEX idx_tag ON mouse;

Quiz

# Check the table definition to see all indexes
Rat2.describe();

Summary

Indexes are essential for query performance in tables with many records:

  1. Primary keys automatically create unique indexes for fast entity lookups

  2. Foreign keys automatically create secondary indexes for fast joins

  3. Explicit indexes can be added for frequently queried non-key attributes

  4. Composite index order matters — only leftmost attributes benefit from the index

  5. Unique indexes enforce uniqueness in addition to speeding up lookups

# To re-run the notebook, drop the schema to create anew
# schema.drop()