Table indexes are data structures that allow fast lookups by an indexed attribute or combination of attributes.
In DataJoint, indexes are created by one of the three mechanisms:
- Primary key
- Foreign key
- Explicitly defined indexes
The first two mechanisms are obligatory. Every table has a primary key, which serves as an unique index. Therefore, restrictions by a primary key are very fast. Foreign keys create additional indexes unless a suitable index already exists.
Let’s test this principle. Let’s create a table with a 10,000 entries and compare lookup times:
import datajoint as dj
schema = dj.schema('dimitri_indexes')
schema.drop() # drop previous schema definition (if any) and create anew
schema = dj.schema('dimitri_indexes')
[2022-10-04 18:21:37,285][INFO]: Connecting dimitri@db.ust-data-sci.net:3306
[2022-10-04 18:21:37,668][INFO]: Connected dimitri@db.ust-data-sci.net:3306
Proceed to delete entire schema `dimitri_indexes`? [yes, No]: yes
Let’s say a mouse in the lab has a lab-specific ID but it also has a separate id issued by the animal facility.
@schema
class Mouse(dj.Manual):
definition = """
mouse_id : int # lab-specific ID
---
tag_id : int # animal facility ID
"""
import random
def populate_mice(table, n=200_000):
"""insert a bunch of mice"""
table.insert(
((random.randint(1,1000_000_000), random.randint(1,1000_000_000))
for i in range(n)), skip_duplicates=True)
populate_mice(Mouse())
Mouse()
%%timeit -n6 -r3
# efficient! Uses the primary key
(Mouse() & {'mouse_id': random.randint(0, 999_999)}).fetch()
37.4 ms ± 1.83 ms per loop (mean ± std. dev. of 3 runs, 6 loops each)
%%timeit -n6 -r3
# inefficient! Requires a full table scan
(Mouse() & {'tag_id': random.randint(0, 999_999)}).fetch()
84.4 ms ± 11.8 ms per loop (mean ± std. dev. of 3 runs, 6 loops each)
The indexed searches are much faster!
To make searches faster on fields other than the primary key or a foreign key, you can add a secondary index explicitly.
Regular indexes are declared as index(attr1, ..., attrN)
on a separate line anywhere in the table declration (below the primary key divide).
Indexes can be declared with unique constraint as unique index (attr1, ..., attrN)
.
Let’s redeclare the table with a unique index on tag_id
.
@schema
class Mouse2(dj.Manual):
definition = """
mouse_id : int # lab-specific ID
---
tag_id : int # animal facility ID
unique index (tag_id)
"""
populate_mice(Mouse2())
Mouse2()
Now both types of searches are equally efficient!
%%timeit -n6 -r3
#efficient! Uses the primary key
(Mouse2() & {'mouse_id': random.randint(0, 999_999)}).fetch()
36.9 ms ± 1.23 ms per loop (mean ± std. dev. of 3 runs, 6 loops each)
%%timeit -n6 -r3
#efficient! Uses the seconary index on tag_id
(Mouse2() & {'tag_id': random.randint(0, 999_999)}).fetch()
36.4 ms ± 273 µs per loop (mean ± std. dev. of 3 runs, 6 loops each)
Let’s now imagine that rats in the Rat
table are identified by the combination of lab the lab_name
and rat_id
in each lab:
@schema
class Rat(dj.Manual):
definition = """
lab_name : char(16)
rat_id : int unsigned # lab-specific ID
---
date_of_birth = null : date
"""
def populate_rats(table):
lab_names = ("Cajal", "Kandel", "Moser", "Wiesel")
for date_of_birth in (None, "2019-10-01",
"2019-10-02", "2019-10-03", "2019-10-04"):
table.insert((
(random.choice(lab_names), random.randint(1, 1_000_000), date_of_birth)
for i in range(100_000)), skip_duplicates=True)
populate_rats()
Rat()
Note that dispite the fact that rat_id
is in the index, search by rat_id
alone are not helped by the index because it is not first in the index. This is similar to search for a word in a dictionary that orders words alphabetically. Searching by the first letters of a word is easy but searching by the last few letters of a word requires scanning the whole dictionary.
In this table, the primary key is a unique index on the combination (lab_id, rat_id)
. Therefore searches on these attributes or on lab_id
alone are fast. But this index cannot help searches on rat_id
alone:
%%timeit -n2 -r10
# inefficient! Requires full table scan.
(Rat() & {'rat_id': 300}).fetch()
140 ms ± 32.1 ms per loop (mean ± std. dev. of 10 runs, 2 loops each)
%%timeit -n2 -r10
# efficient! Uses the primary key
(Rat() & {'rat_id': 300, 'lab_name': 'Cajal'}).fetch()
36.7 ms ± 981 µs per loop (mean ± std. dev. of 10 runs, 2 loops each)
%%timeit -n2 -r10
# inefficient! Requires a full table scan
len(Rat & {'rat_id': 500})
127 ms ± 21 ms per loop (mean ± std. dev. of 10 runs, 2 loops each)
Pattern searches in strings can benefit from an index when the starting characters are specified.
%%timeit -n2 -r2
# efficient! Uses the primary key
len(Rat & 'lab_name LIKE "Caj%"')
298 ms ± 20 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
%%timeit -n2 -r2
# inefficient! requires a full table scan
len(Rat & 'lab_name LIKE "%jal"')
489 ms ± 127 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Similarly, searching by the date requires an inefficient full-table scan:
%%timeit -n3 -r6
len(Rat & 'date_of_birth > "2019-10-02"')
384 ms ± 67.9 ms per loop (mean ± std. dev. of 6 runs, 3 loops each)
To speed up searches by the rat_id
and date_of_birth
, we can explicit indexes to Rat
:
@schema
class Rat2(dj.Manual):
definition = """
lab_name : char(16)
rat_id : int unsigned # lab-specific ID
---
date_of_birth = null : date
index(rat_id)
index(date_of_birth)
"""
populate_rats(Rat2())
%%timeit -n3 -r6
# efficient! uses index on rat_id
(Rat2() & {'rat_id': 300}).fetch()
37.6 ms ± 2.64 ms per loop (mean ± std. dev. of 6 runs, 3 loops each)
%%timeit -n2 -r2
# efficient! uses index on date_of_birth
len(Rat2 & 'date_of_birth > "2019-10-02"')
262 ms ± 41.8 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Quiz: How many indexes does the table Rat
have?¶
Rat.describe();
Answer¶
Three: primary key, rat_id, date_of_birth
Indexes in SQL¶
import json
import pymysql
pymysql.install_as_MySQLdb()
with open('cred.json') as f:
creds = json.load(f)
connection_string = "mysql://{user}:{password}@{host}".format(**creds)
%load_ext sql
%config SqlMagic.autocommit=True
%sql $connection_string
%%sql
create database dimitri_indexes
* mysql://dimitri:***@db.ust-data-sci.net
(pymysql.err.ProgrammingError) (1007, "Can't create database 'dimitri_indexes'; database exists")
[SQL: create database dimitri_indexes]
(Background on this error at: https://sqlalche.me/e/14/f405)
%%sql
SHOW TABLES in dimitri_indexes;
%%sql
CREATE TABLE dimitri_indexes.mouse(
mouse_id int NOT NULL,
tag_id int NOT NULL,
primary key(mouse_id)
)
* mysql://dimitri:***@db.ust-data-sci.net
(pymysql.err.OperationalError) (1050, "Table 'mouse' already exists")
[SQL: CREATE TABLE dimitri_indexes.mouse(
mouse_id int NOT NULL,
tag_id int NOT NULL,
primary key(mouse_id)
)]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
%%sql
drop table dimitri_indexes.mouse
* mysql://dimitri:***@db.ust-data-sci.net
0 rows affected.
[]
%%sql
CREATE TABLE dimitri_indexes.mouse(
mouse_id int NOT NULL,
tag_id int NOT NULL,
primary key(mouse_id),
index (tag_id)
)
* mysql://dimitri:***@db.ust-data-sci.net
0 rows affected.
[]
%%sql
CREATE UNIQUE INDEX mouse_idx ON dimitri_indexes.mouse (tag_id)
* mysql://dimitri:***@db.ust-data-sci.net
0 rows affected.
[]
%%sql
DROP INDEX mouse_idx ON dimitri_indexes.mouse;
* mysql://dimitri:***@db.ust-data-sci.net
0 rows affected.
[]