Skip to article frontmatterSkip to article content

Design Alterations

Table indexes are data structures that allow fast lookups by an indexed attribute or combination of attributes.

In DataJoint, indexes are created by one of the three mechanisms:

  1. Primary key
  2. Foreign key
  3. Explicitly defined indexes

The first two mechanisms are obligatory. Every table has a primary key, which serves as an unique index. Therefore, restrictions by a primary key are very fast. Foreign keys create additional indexes unless a suitable index already exists.

Let’s test this principle. Let’s create a table with a 10,000 entries and compare lookup times:

import datajoint as dj
schema = dj.schema('dimitri_indexes')
schema.drop() # drop previous schema definition (if any) and create anew
schema = dj.schema('dimitri_indexes')
[2022-10-04 18:21:37,285][INFO]: Connecting dimitri@db.ust-data-sci.net:3306
[2022-10-04 18:21:37,668][INFO]: Connected dimitri@db.ust-data-sci.net:3306
Proceed to delete entire schema `dimitri_indexes`? [yes, No]: yes

Let’s say a mouse in the lab has a lab-specific ID but it also has a separate id issued by the animal facility.

@schema
class Mouse(dj.Manual):
    definition = """
    mouse_id : int  # lab-specific ID
    ---
    tag_id : int  # animal facility ID
    """
import random
def populate_mice(table, n=200_000):
    """insert a bunch of mice"""
    table.insert(
        ((random.randint(1,1000_000_000), random.randint(1,1000_000_000)) 
         for i in range(n)), skip_duplicates=True)
populate_mice(Mouse())
Mouse()
Loading...
%%timeit -n6 -r3

# efficient! Uses the primary key
(Mouse() & {'mouse_id': random.randint(0, 999_999)}).fetch()
37.4 ms ± 1.83 ms per loop (mean ± std. dev. of 3 runs, 6 loops each)
%%timeit -n6 -r3

# inefficient! Requires a full table scan
(Mouse() & {'tag_id': random.randint(0, 999_999)}).fetch()
84.4 ms ± 11.8 ms per loop (mean ± std. dev. of 3 runs, 6 loops each)

The indexed searches are much faster!

To make searches faster on fields other than the primary key or a foreign key, you can add a secondary index explicitly.

Regular indexes are declared as index(attr1, ..., attrN) on a separate line anywhere in the table declration (below the primary key divide).

Indexes can be declared with unique constraint as unique index (attr1, ..., attrN).

Let’s redeclare the table with a unique index on tag_id.

@schema
class Mouse2(dj.Manual):
    definition = """
    mouse_id : int  # lab-specific ID
    ---
    tag_id : int  # animal facility ID
    unique index (tag_id)
    """
populate_mice(Mouse2())
Mouse2()
Loading...

Now both types of searches are equally efficient!

%%timeit -n6 -r3

#efficient! Uses the primary key
(Mouse2() & {'mouse_id': random.randint(0, 999_999)}).fetch()
36.9 ms ± 1.23 ms per loop (mean ± std. dev. of 3 runs, 6 loops each)
%%timeit -n6 -r3

#efficient! Uses the seconary index on tag_id
(Mouse2() & {'tag_id': random.randint(0, 999_999)}).fetch()
36.4 ms ± 273 µs per loop (mean ± std. dev. of 3 runs, 6 loops each)

Let’s now imagine that rats in the Rat table are identified by the combination of lab the lab_name and rat_id in each lab:

@schema
class Rat(dj.Manual):
    definition = """
    lab_name : char(16) 
    rat_id : int unsigned # lab-specific ID
    ---
    date_of_birth = null : date
    """
def populate_rats(table):
    lab_names = ("Cajal", "Kandel", "Moser", "Wiesel")
    for date_of_birth in (None, "2019-10-01", 
                          "2019-10-02", "2019-10-03", "2019-10-04"):
        table.insert((
            (random.choice(lab_names), random.randint(1, 1_000_000), date_of_birth) 
                   for i in range(100_000)), skip_duplicates=True)
populate_rats()
Rat()
Loading...

Note that dispite the fact that rat_id is in the index, search by rat_id alone are not helped by the index because it is not first in the index. This is similar to search for a word in a dictionary that orders words alphabetically. Searching by the first letters of a word is easy but searching by the last few letters of a word requires scanning the whole dictionary.

In this table, the primary key is a unique index on the combination (lab_id, rat_id). Therefore searches on these attributes or on lab_id alone are fast. But this index cannot help searches on rat_id alone:

%%timeit -n2 -r10

# inefficient!  Requires full table scan.
(Rat() & {'rat_id': 300}).fetch()
140 ms ± 32.1 ms per loop (mean ± std. dev. of 10 runs, 2 loops each)
%%timeit -n2 -r10

# efficient! Uses the primary key
(Rat() & {'rat_id': 300, 'lab_name': 'Cajal'}).fetch()
36.7 ms ± 981 µs per loop (mean ± std. dev. of 10 runs, 2 loops each)
%%timeit -n2 -r10

# inefficient! Requires a full table scan
len(Rat & {'rat_id': 500})
127 ms ± 21 ms per loop (mean ± std. dev. of 10 runs, 2 loops each)

Pattern searches in strings can benefit from an index when the starting characters are specified.

%%timeit -n2 -r2

# efficient! Uses the primary key
len(Rat & 'lab_name LIKE "Caj%"')
298 ms ± 20 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
%%timeit -n2 -r2

# inefficient! requires a full table scan
len(Rat & 'lab_name LIKE "%jal"')
489 ms ± 127 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

Similarly, searching by the date requires an inefficient full-table scan:

%%timeit -n3 -r6

len(Rat & 'date_of_birth > "2019-10-02"')
384 ms ± 67.9 ms per loop (mean ± std. dev. of 6 runs, 3 loops each)

To speed up searches by the rat_id and date_of_birth, we can explicit indexes to Rat:

@schema
class Rat2(dj.Manual):
    definition = """
    lab_name : char(16) 
    rat_id : int unsigned # lab-specific ID
    ---
    date_of_birth = null : date

    index(rat_id)
    index(date_of_birth)
    """
populate_rats(Rat2())
%%timeit -n3 -r6

# efficient!  uses index on rat_id
(Rat2() & {'rat_id': 300}).fetch()
37.6 ms ± 2.64 ms per loop (mean ± std. dev. of 6 runs, 3 loops each)
%%timeit -n2 -r2

# efficient! uses index on date_of_birth
len(Rat2 & 'date_of_birth > "2019-10-02"')
262 ms ± 41.8 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

Quiz: How many indexes does the table Rat have?

Rat.describe();

Answer

Three: primary key, rat_id, date_of_birth

Indexes in SQL

import json
import pymysql 

pymysql.install_as_MySQLdb()

with open('cred.json') as f:
    creds = json.load(f)

connection_string = "mysql://{user}:{password}@{host}".format(**creds)

%load_ext sql
%config SqlMagic.autocommit=True
%sql $connection_string
%%sql

create database dimitri_indexes
 * mysql://dimitri:***@db.ust-data-sci.net
(pymysql.err.ProgrammingError) (1007, "Can't create database 'dimitri_indexes'; database exists")
[SQL: create database dimitri_indexes]
(Background on this error at: https://sqlalche.me/e/14/f405)
%%sql

SHOW TABLES in dimitri_indexes;
Loading...
%%sql

CREATE TABLE dimitri_indexes.mouse(
mouse_id  int  NOT NULL,
tag_id int NOT NULL,
primary key(mouse_id)
)
 * mysql://dimitri:***@db.ust-data-sci.net
(pymysql.err.OperationalError) (1050, "Table 'mouse' already exists")
[SQL: CREATE TABLE dimitri_indexes.mouse(
mouse_id  int  NOT NULL,
tag_id int NOT NULL,
primary key(mouse_id)
)]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
%%sql

drop table dimitri_indexes.mouse
 * mysql://dimitri:***@db.ust-data-sci.net
0 rows affected.
[]
%%sql

CREATE TABLE dimitri_indexes.mouse(
mouse_id  int  NOT NULL,
tag_id int NOT NULL,
primary key(mouse_id),
index (tag_id)
)
 * mysql://dimitri:***@db.ust-data-sci.net
0 rows affected.
[]
%%sql

CREATE UNIQUE INDEX mouse_idx ON dimitri_indexes.mouse (tag_id)
 * mysql://dimitri:***@db.ust-data-sci.net
0 rows affected.
[]
%%sql

DROP INDEX mouse_idx ON dimitri_indexes.mouse;
 * mysql://dimitri:***@db.ust-data-sci.net
0 rows affected.
[]