Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

DataJoint Specs


License

This DataJoint Specification document is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).

Creative Commons License

You are free to:

Under the following terms:

This license is available at: http://creativecommons.org/licenses/by-sa/4.0/


Introduction

Purpose of DataJoint Specs

The DataJoint Specs define the standards, conventions, and best practices for designing and managing DataJoint pipelines. These specifications ensure that all DataJoint implementations remain consistent, scalable, and interoperable across various scientific workflows and computing environments.

By following the DataJoint Specs, users and developers can:


Purpose of DataJoint

DataJoint provides scientists with enhanced capabilities to design, implement, and manage the data operations underpinning their research.

At its core, DataJoint treats data and computations jointly within a formal computational database, unifying data structures and analysis code. It extends the traditional relational data model so that some tables contain raw or curated input data, while others define computations and store their results. This makes it possible to precisely construct and express entire scientific workflows—data acquisition, transformation, analysis—as structured, executable pipelines.

In this architecture, the database becomes more than just a storage system; it evolves into a living pipeline that encodes the logic and lineage of scientific work:

DataJoint maintains the rigor of relational databases—ensuring data integrity, ACID-compliant transactions, and declarative query logic—but extends their scope to meet the demands of modern science:

This structured approach enables researchers to pursue ambitious goals, achieve engineered reliability, and build, refine, and share robust workflow systems.

By unifying computational logic and relational data modeling, DataJoint provides a foundation for building high-integrity, high-performance scientific data ecosystems.

Open-Source Development and the DataJoint Standard

DataJoint is defined by an open standard designed to ensure interoperability and compatibility across different implementations and scientific workflows. Aligned with Open Science principles, this standard focuses on the core components that form the foundation of any DataJoint pipeline, often visualized as the inner circle in the architectural diagram below.

DataJoint Architecture Diagram Diagram illustrating the core components defined by the DataJoint standard (inner circle) and the broader ecosystem integration points (outer circle).

These core components, covered by these specifications, include:

These DataJoint Specs primarily define this core framework and its principles, ensuring that pipelines built using different tools or platforms remain consistent and interoperable. Adherence to this standard is fundamental for achieving data integrity, automated computation, reproducibility, collaboration, and integrated multi-modal data handling.

The current reference implementation for this standard is the open-source DataJoint-Python library datajoint-python, available under the Apache 2.0 License. It provides the essential tools for research teams to build, manage, and integrate their own pipelines using the DataJoint standard.

While the standard defines the core, a fully operational scientific pipeline often integrates with a broader ecosystem (e.g., specific instruments, ELNs, job orchestrators, visualization tools – the outer circle in the diagram). These ecosystem integration points may include:

While these extensions build upon the core framework, they are generally outside the scope of the base DataJoint specification, allowing flexibility and choice in implementation. Various DIY implementations and commercial offerings may diverge in implementing such extensions.

The DataJoint Platform (not covered by the specs or open standard)

For research teams seeking a fully managed, turnkey solution, DataJoint Inc. offers its Data Operations Platform for Scientific Research. This commercial platform is built upon the open-source core and standards described previously but adds integrated components and services to simplify deployment and reduce operational overhead.

The platform provides a cohesive infrastructure designed to enhance operational excellence in data-intensive research, typically including:

The DataJoint Platform aims to provide a reliable, scalable, and secure environment for large-scale scientific workflows, deployable on-premise, in the cloud, or in a hybrid infrastructure.

Crucially, while leveraging the platform’s managed services, research teams retain full ownership and control over their pipeline code and scientific data. This ensures that choosing between a DIY approach with the open-source tools or partnering with DataJoint Inc. for the commercial platform maintains data sovereignty and allows for flexibility and interoperability, grounded in the common DataJoint standard.


Key Objectives and Design Principles

The DataJoint standard is designed around the following key objectives and principles to support robust and scalable scientific data workflows:

Terminology

DataJoint adopts familiar terms from relational database theory and defines additional terms specific to its computational framework.

TermDefinition
Data PipelineA structured system for managing scientific data and computations, encompassing a relational database, code repository, and potentially object storage. It serves as the system of record for a scientific project, enabling structured, reproducible workflows. Also referred to as a DataJoint project or workflow.
Computational DatabaseThe core concept where a relational database is extended to treat data and computations jointly, unifying data structures and analysis code to represent and automate entire workflows.
Schema(a) A collection of related table definitions and integrity constraints within the database; (b) A namespace organizing related tables, typically corresponding to a Python module in the code repository.
TableThe core relational data structure representing entities or computations. Tables have named and typed attributes and store data as rows. See also Table Tier and Master/Part Table.
Attribute (Column/Field)A named, typed element of a table definition, representing a specific property or data point. Always referenced by name.
Row (Record/Tuple)A single entry (entity instance) in a table, comprising values for each attribute. Rows are uniquely identified by their primary key.
Primary KeyA designated set of attributes whose values uniquely identify each row within a table.
Foreign KeyA set of attributes in one table that refers to the primary key of another (parent) table, establishing a dependency and enforcing referential integrity.
Table TierThe classification of a table (lookup, manual, imported, computed) indicating its role and how its data is populated (static reference, manual entry, automated import, automated computation). See Table Tiers.
Master Table / Part TableA design pattern where a master table represents a primary entity, and one or more part tables (defined as nested classes) store dependent details. This ensures group integrity, meaning the master and its parts are inserted and deleted together atomically. See Master-Part Relationship.
make MethodA required method within Computed and Imported table classes that defines the computation or data import process for populating the table’s rows based on upstream data. See Computation.
Key SourceFor Computed and Imported tables, the set of primary key values from upstream tables for which a computation or import needs to be performed. It identifies the pending tasks for the make method.
QueryA function operating on stored or derived data, typically defined via a Query Expression, resulting in a new derived table (relation).
Query ExpressionA formal definition of a query expressed using DataJoint’s query operators (e.g., restriction &, join *, projection .proj()) acting on tables or other query expressions.
FetchThe execution of a query expression on the database server and the transfer of the resulting data (rows) to the client application.
TransactionA sequence of database operations executed as a single, atomic, consistent, isolated, and durable (ACID) unit. Ensures all operations within the sequence succeed or fail together.
Object (Attribute Type)An attribute type (object) used to store references to large data entities (e.g., files, arrays) managed by DataJoint but typically stored externally in an Object Store rather than directly within the database table row (unlike the blob type). See Object Types.
Object StoreA storage backend (e.g., filesystem, S3, GCS) used in conjunction with the relational database to store large data objects referenced by attributes of type object. See Object Storage.
Custom Type / Type AdaptorA user-defined mechanism to handle the conversion between specialized scientific data objects (e.g., specific file formats, complex data structures) and a supported underlying stored attribute type (e.g., blob, object, varchar). See Custom Types.

Frequently Asked Questions

Is DataJoint an ORM?

Object-Relational Mapping (ORM) is a technique allowing developers to interact with relational databases through object-oriented programming, abstracting direct SQL queries. Popular Python ORMs include SQLAlchemy and Django ORM, often used in web development.

Although DataJoint shares certain ORM characteristics, it is primarily a computational database framework designed explicitly for scientific workflows. Unlike traditional ORMs, which focus mainly on simplifying database interactions for web applications, DataJoint explicitly defines data dependencies and computational relationships, ensuring data integrity, traceability, and reproducibility.

Thus, DataJoint can be considered an ORM specialized for scientific databases, purpose-built for structured experimental data and computational workflows.

Is DataJoint a Workflow Management System?

Not exactly. DataJoint provides robust capabilities for embedding computations within a relational database structure, managing derived data, and tracking explicit data dependencies. However, DataJoint itself does not handle scheduling, distributed execution, or orchestration of parallel computational tasks, which are typical roles of workflow management systems such as Apache Airflow or Nextflow. Instead, DataJoint complements these systems, formalizing data dependencies so that external workflow schedulers can effectively manage computational tasks.

Is DataJoint a Lakehouse?

DataJoint and lakehouses share some similar goals—such as integrating structured data management with scalable storage and computational capabilities. However, a lakehouse typically merges the flexibility of data lakes (handling raw, semi-structured data at scale) with the structured schemas and transactional guarantees of traditional databases.

In contrast, DataJoint focuses specifically on scientific data workflows, emphasizing rigorous schema definitions, explicit computational dependencies, and robust data integrity. While lakehouses offer flexible analytics on structured and unstructured data, DataJoint prioritizes precise data modeling, reproducibility, and computational traceability within structured scientific datasets.

Therefore, DataJoint complements lakehouse architectures but is tailored specifically for managing structured experimental data and computational pipelines in science.

Does DataJoint require SQL knowledge?

No, DataJoint does not require SQL knowledge, but understanding relational concepts can be helpful.

DataJoint provides a Python-based API that abstracts SQL, allowing users to define schemas, insert data, and query tables without writing SQL. Instead of composing SQL queries, users interact with the database using intuitive Python methods.

Examples comparing SQL vs. DataJoint

SQLDataJoint-Python
CREATE TABLEDefine tables as Python classes
INSERT INTO.insert() method
SELECT * FROM ....fetch(), .proj(), .aggr()
JOINtable1 * table2
WHERE ...table & condition

Since DataJoint uses relational database backends, all data can be accessed through SQL as well.


Pipeline Design

A data pipeline within the DataJoint framework is a structured system designed to manage scientific data, its inherent dependencies, associated computations, and execution workflows.

Pipeline ≡ Python Package

A DataJoint pipeline SHALL be implemented as a dedicated Python package. Within this package, Python modules correspond to database schemas. This architectural convention ensures that data, computations, and their interdependencies are organized, traceable, and reproducible.

A DataJoint pipeline adheres to a Directed Acyclic Graph (DAG) structure, wherein:

Pipeline Design

Constraint: Acyclicity of Dependencies

Cyclical dependencies are strictly prohibited at multiple levels:

  • All referential constraints (foreign keys) within a single schema MUST form a DAG.

  • The dependencies between schemas (formed by foreign keys linking tables in different schemas, and by Python import statements between modules) MUST also collectively form a DAG.

This ensures a unidirectional flow of data and computational dependencies throughout the pipeline.

Database Schema ≡ Python Module

Each database schema in a DataJoint pipeline SHALL correspond to a distinct Python module. This module serves as a namespace for organizing a collection of logically related tables. Adherence to a one-to-one mapping between database schemas and Python modules promotes modularity and maintainability of the pipeline code.

Schema Design

Key principles of this mapping include:

This structured correspondence between the database organization and the codebase ensures synchronization, thereby facilitating reproducibility, collaboration, and long-term maintenance.

Database Table ≡ Python Class

In the DataJoint framework, each database table SHALL be represented by a Python class. A consistent naming convention MUST be followed to map Python class names to database table names:

Python ConstructDatabase Construct
Class NamesWritten in CamelCase
Table NamesWritten in snake_case
Fully Qualified Python Class Name<module>.<ClassName>
Fully Qualified Database Table Name<schema_name>.<table_name>

Example Mapping:

This naming convention ensures clarity, consistency, and unambiguous alignment between the Python implementation of the pipeline logic and the underlying relational database schema.

Table Tiers

Each table within a DataJoint pipeline SHALL be assigned to one of four predefined tiers. These tiers classify tables based on their role and the method by which their data are populated and maintained. In graphical representations of the pipeline (e.g., diagrams generated by dj.Diagram), distinct color codes are conventionally used to visually differentiate these tiers:

TierDescriptionColor Code (in Diagrams)
lookupStatic reference data that is part of the schema definition (e.g., parameters, controlled vocabularies).Gray
manualData manually entered from external sources, typically by users.Green
importedData automatically ingested from external sources (e.g., raw data files, external databases).Blue
computedData generated from upstream tables through automated computations.Red

Note on Color Codes: Standard GitHub Flavored Markdown (GFM) does not support the direct rendering of colored text. However, these color codes are a standard convention used in DataJoint’s graphical schema visualization tools (e.g., dj.Diagram outputs) and MAY be applied in other documentation formats that support rich text styling, such as HTML or LaTeX.

The classification of tables into these tiers provides a clear conceptual separation between manually curated data, externally sourced data, and computationally derived data. This distinction is fundamental to maintaining data integrity, provenance, and understanding the data flow within the pipeline.

Table Definition

The definition of a table in DataJoint specifies its structural and functional characteristics within the pipeline. Each table definition encompasses the following elements:

A common method for specifying these elements is through a multi-line string assigned to a class attribute named definition within the table’s Python class declaration.

Example: Basic Table Definition Structure

@schema # Decorator associating the class with a DataJoint schema object
class TableName(dj.Table): # Base class depends on the table tier
    definition = """
    # A descriptive comment for the table
    primary_key_attribute_1 : type       # Comment for attribute 1
    primary_key_attribute_2 : type       # Comment for attribute 2
    ---                                  # Primary key separator
    secondary_attribute_1   : type       # Comment for attribute 3
    secondary_attribute_2   : type = default_value # Attribute with a default
    -> ParentTable                       # Foreign key definition
    index (secondary_attribute_1)        # Secondary index definition
    """

While the definition string is a prevalent convention, alternative programmatic methods for defining table structures MAY be supported by specific DataJoint client implementations, provided they ultimately resolve to these fundamental elements.

Attribute Definition

Each attribute in a table definition SHALL be specified by its name, data type, an optional default value, and an optional comment. The syntax for defining an attribute within the definition string is:

attribute_name : type[ = default_value][ # comment]

Where:

Examples of Attribute Definitions:

experiment_id   : int unsigned       # Unique identifier for the experiment
recording_date  : date               # Date of data recording
notes           : varchar(1000) = '' # Optional notes, defaults to empty string
analysis_param  : float = null       # A nullable analysis parameter

Primary Key

Every table in a DataJoint pipeline MUST have a primary key. The primary key is a set of one or more attributes whose collective values uniquely identify each row in the table.

Example: Primary Key Definition

    definition = """
    subject_id    : varchar(32)   # Part 1 of the composite primary key
    session_id    : uint16        # Part 2 of the composite primary key
    ---                           # Primary key separator
    session_date  : date
    experimenter  : varchar(64)
    """

In this example, (subject_id, session_id) jointly form the primary key.

Schema Normalization

Tables within a DataJoint pipeline SHOULD be designed according to the principles of database normalization. The objective is to ensure that each table represents a single, well-defined entity type or relationship.

Adherence to normalization principles facilitates data consistency, reduces anomalies during data modification, and improves the overall clarity and maintainability of the pipeline schema.

Foreign Keys

A foreign key establishes a referential link between a “child” table (the table containing the foreign key definition) and a “parent” table (the table referenced by the foreign key). This link ensures **referential integrity, meaning that rows in the child table can only reference existing rows in the parent table.

Foreign keys SHALL be defined on separate lines within the definition string of the child table, using the -> operator to point to the Python class name of the parent table.

Syntax for Foreign Key Definition

-> ParentClassName
-> [modifier[, modifier...]] ParentClassName
-> [modifier[, modifier...]] ParentClassName.proj(new_name1=old_name1, ...)

Where:

Constraints:

Effects of a Foreign Key Definition:

  1. Attribute Inheritance: The primary key attributes of the ParentClassName are implicitly included in the attribute set of the child table if they are not already present. Their data types and properties (e.g., nullability, if modified) are derived from the parent’s primary key.

  2. Referential Constraint: A referential integrity constraint is established in the database, ensuring that values in the foreign key attributes of the child table must match existing primary key values in the ParentClassName.

  3. Implicit Index: An index is automatically created on the foreign key attributes in the child table to optimize join operations and referential integrity checks.

Indexes and Unique Constraints

Beyond the implicit index created for the primary key, tables MAY define additional secondary indexes to optimize query performance or enforce uniqueness constraints on non-primary key attributes or combinations thereof.

Example: Table with Secondary and Unique Indexes

@schema
class Person(dj.Manual):
    definition = """
    # Table representing a person with unique identifiers
    person_id        : uint32      # Unique person identifier (Primary Key)
    ---
    first_name       : varchar(50)
    last_name        : varchar(50)
    drivers_license  : varchar(20) = null
    cell_phone       : varchar(15) = null
    email            : varchar(100) = null

    index (last_name, first_name)  # Secondary index for faster name searches
    unique index (drivers_license) # Ensures driver's license is unique (if not null)
    unique index (cell_phone)      # Ensures cell phone is unique (if not null)
    unique index (email)           # Ensures email is unique (if not null)
    """

This definition facilitates efficient searches by (last_name, first_name). The unique indexes ensure data integrity for drivers_license, cell_phone, and email.

Note on Nullable Unique Constraints: A unique index on an attribute that permits NULL values allows multiple rows to contain NULL in that attribute, as NULL is generally not considered equal to another NULL for constraint purposes. The exact behavior MAY depend on the underlying database system (e.g., MySQL, PostgreSQL).

Lookup Tables

Lookup tables (dj.Lookup tier) store static or semi-static reference data that is considered an integral part of the pipeline’s schema definition rather than dynamic experimental data. Examples include controlled vocabularies, lists of experimental parameters, or standardized codes.

Example: ChemicalElement Lookup Table

import datajoint as dj

schema = dj.Schema('chemistry') # Assumes 'schema' is a dj.Schema() object

@schema
class ChemicalElement(dj.Lookup):
    definition = """
    # Lookup table for chemical elements
    atomic_number : uint8       # Atomic number (Primary Key)
    ---
    symbol        : char(2)     # Chemical symbol
    name          : varchar(20) # Element name
    atomic_weight : decimal(7,4)  # Standard atomic weight
    """
    contents = [
        {'atomic_number': 1, 'symbol': 'H',  'name': 'Hydrogen',  'atomic_weight': 1.008},
        {'atomic_number': 2, 'symbol': 'He', 'name': 'Helium',    'atomic_weight': 4.0026},
        {'atomic_number': 3, 'symbol': 'Li', 'name': 'Lithium',   'atomic_weight': 6.94},
        # ... additional elements
    ]

Attribute Types

DataJoint defines a standardized, concise set of fundamental attribute types for use in table definitions. These types abstract the underlying SQL data types provided by supported relational database backends (e.g., (PostgreSQL and MySQL)), offering a consistent interface tailored for scientific data workflows. This specification prioritizes type names that are intuitive to users in scientific computing environments (e.g., uint8 rather than SQL’s TINYINT UNSIGNED).

Core Attribute Types

The following core attribute types SHALL be supported by DataJoint implementations:

CategoryTypeDescription
UUIDuuidUniversally Unique Identifier (RFC 4122). Default values are not supported for uuid attributes.
Integersint8, uint8, int16, uint16,
int32, uint32, int64, uint64
Standard signed and unsigned integer types of varying bit widths.
Floating-Pointfloat32, float64Single-precision (32-bit) and double-precision (64-bit) floating-point numbers. Note: NaN (Not a Number) behavior MAY vary by backend; e.g., MySQL does not natively support NaN in indexed FLOAT columns.
Decimaldecimal(M,N)Fixed-point decimal number with a total precision of M digits and N digits after the decimal point (scale).
Character Stringschar(N), varchar(N)Fixed-length (char) or variable-length (varchar) character strings, where N specifies the maximum length.
Enumerationenum('val1', 'val2', ...)A type restricted to a predefined set of allowed string values.
DatedateRepresents a calendar date (year, month, day) in ISO 8601 format (YYYY-MM-DD). A special default value of NOW MAY be used to set the current date upon insertion.
Time / TimestamptimestampRepresents a point in time, typically with microsecond precision, stored in UTC. Values SHOULD conform to ISO 8601 format. A special default value of NOW MAY be used to set the current timestamp upon insertion.
Binary Large ObjectblobStores large binary data directly within the database row (inline storage). Suitable for moderately sized binary objects.
Object ReferenceobjectStores a reference (e.g., a key or path) to an external data object managed by DataJoint but stored outside the primary database (e.g., in an object store or file system). See Object Types.
Custom Type<adaptor_name>A user-defined type managed by a Custom Type Adaptor, allowing for specialized storage and handling of complex data structures.

Distinction: blob vs. object Attribute Types

The blob and object types both handle non-scalar data, but differ in their storage strategy and typical use cases:

TypeIntended UseData Storage Location
blobRaw binary data stored directly within the database table row. Suitable for relatively small to moderately sized binary data where inline storage is acceptable and efficient.Database System (e.g., as BLOB or BYTEA columns in MySQL/PostgreSQL).
objectReferences to data entities stored externally to the primary database. Suitable for large files, datasets, or complex objects where external storage is preferred for scalability or management.External Storage Systems (e.g., file systems, cloud object stores like S3/GCS/Azure Blob, network-attached storage). The database stores metadata and a reference key.

Object Types (External Storage)

DataJoint pipelines often employ a hybrid storage model to manage large-scale scientific datasets efficiently. In this model:

This approach leverages the strengths of relational databases for data integrity and querying, while utilizing the scalability and cost-effectiveness of dedicated object storage solutions for bulk data.

How object Attributes Work

The object attribute type facilitates this hybrid model by enabling object-augmented schemas. Tables can include attributes of type object, which store references to data entities residing in an external store. These externally stored objects are:

The dj.Object Interface for Custom Object Handling

To enable DataJoint to manage custom external object types, users MAY define Python classes that subclass dj.Object. Such classes MUST implement a specific interface to handle the serialization, storage, retrieval, and verification of these objects.

The required methods for a dj.Object subclass are:

Method SignatureDescription
put(self, store, key: str) -> dictWrites the instance’s data to the specified store (a configured object storage backend) under the given key (a unique identifier or path). Returns a dictionary of metadata (e.g., checksum, version, timestamp) to be stored in the database.
get(cls, store, key: str) -> "dj.Object"A class method that reads data from the store using the key and reconstructs an instance of the dj.Object subclass.
get_meta(self) -> dictReturns a dictionary of metadata about the object instance (e.g., size, checksum, version), typically reflecting the state as stored or to be stored.
verify(self, store, key: str) -> boolConfirms that the object corresponding to key exists in the store and is valid (e.g., by checking its checksum or integrity). Returns True if valid, False otherwise.

Metadata for External Objects

For each attribute of type object, DataJoint (or the custom dj.Object implementation) SHALL ensure that relevant metadata is stored within the relational database alongside the reference. This metadata typically includes:

Custom Types (Type Adaptors)

Custom types, also known as type adaptors, extend DataJoint’s native type system to allow seamless integration and management of diverse scientific data structures as if they were standard attributes stored directly in database columns. They provide a mechanism for converting between complex Python objects or specialized file formats and one of DataJoint’s Core Attribute Types suitable for database storage.

DataJoint projects MAY define and register a collection of custom type adaptors tailored to their specific data handling needs.

Functionality of Custom Types:

Key Benefits:

Implementation and Registration:

Type adaptors SHALL be implemented as Python classes subclassing from dj.CustomType. These classes MUST implement the following methods and properties:

  1. type_name (property): A string representing the unique name by which this custom type is referenced in table definitions (e.g., <my_custom_type>).

  2. stored_type (property): A string specifying the underlying DataJoint Core Attribute Type (e.g., blob, object, varchar(255), or another registered <adaptor_name>) used for actual storage in the database.

  3. put(self, user_object: object) -> object: Converts the user_object (the Python object provided during insertion) into the stored_type format. The key argument (representing the primary key of the record being inserted) MAY be optionally accepted by put if its values are needed for the conversion or storage process (e.g., for constructing object paths if stored_type is object).

  4. get(self, stored_object: object) -> object: Converts the stored_object (data retrieved from the database in stored_type format) back into the user-level Python object. The key argument MAY be optionally accepted by get if needed for reconstruction.

Custom type adaptors MUST be accessible (e.g., imported) and registered with the DataJoint client (e.g., via a plugin mechanism or explicit registration call like dj.register_type(MyCustomAdaptor)) before any schemas using these types are declared or activated. Revisions to type adaptors SHOULD maintain backward compatibility to ensure existing data remains accessible and correctly interpretable.

Example Type Adaptors:

type_namestored_typePurpose
<dj_blob>blobSerializes arbitrary Python objects into a generic blob format, often used for backward compatibility with older DataJoint blob storage.
<zarr_array>objectManages a NumPy array stored externally as a Zarr dataset, referenced by a path/key.
<tif_image>objectManages an image stored externally as a TIFF file.

Usage in Table Definition: Once registered, a custom type adaptor is used in a table definition string by its type_name.

class ProcessedData(dj.Computed):
    definition = """
    -> upstream.Analysis
    ---
    computed_matrix : <zarr_array>    # Uses the <zarr_array> custom type
    result_summary  : <dj_blob>       # Uses the <dj_blob> custom type
    """

During an insert operation, DataJoint will invoke the put method of the corresponding registered adaptor for attributes of custom types. During a fetch operation, the get method will be invoked.

Master-Part Relationship

The master-part relationship is a design pattern in DataJoint used to enforce group integrity, ensuring that a primary entity (the “master”) and its dependent components (the “parts”) are managed as a single atomic unit. This is crucial for scenarios where multiple related records must always exist together or not at all.

Ensuring Group Integrity

Consider a scenario where an experimental trial (master) produces multiple simultaneously recorded data streams (parts). The master-part pattern ensures that the trial record and all its associated data stream records are inserted together, and if the trial record is deleted, all its data stream records are also automatically deleted.

Master-Part Pattern Definition

The fully qualified Python class name for a part table follows the format: module_name.MasterClassName.PartClassName. The corresponding database table name will also reflect this nested structure (e.g., <schema_name>.<master_table_name>__<part_table_name>).

A part table MUST always have a foreign key referencing its master table. To simplify this declaration, DataJoint provides the alias -> master within the definition string of a part table. This alias automatically establishes the foreign key relationship to the enclosing master table, inheriting its primary key attributes.

Restrictions on Master-Part Relationships

  1. No Nested Masters: A part table SHALL NOT itself serve as a master table for other part tables. The master-part relationship is limited to one level of nesting.

  2. Exclusive Master: A part table SHALL belong to exactly one master table. It cannot have multiple master tables.

  3. Hierarchical Data: For more complex hierarchical relationships extending beyond a single master-part level, multiple distinct master tables, each potentially with their own part tables, SHOULD be used, linked by standard foreign key relationships.

Example: Cell Segmentation with Master-Part

@schema
class Segmentation(dj.Computed):  # Master table
    definition = """
    -> acq.Image
    -> seg.SegmentationMethod
    ---
    segmented_image_object : <image_object_type> # Reference to the full segmented image
    region_count           : uint16              # Number of segmented regions
    """

    class Region(dj.Part):  # Part table, nested within Segmentation
        definition = """
        # Stores individual segmented regions for each Segmentation entry
        -> master             # Foreign key to the Segmentation master table
        region_idx : uint16   # Differentiates regions within the same segmentation
        ---
        region_pixels_object : <region_mask_object_type> # Object with pixel data for the region
        region_area          : float                     # Area of the region
        """

Key Characteristics of this Example:

Enforcement via Transaction Processing

The DataJoint framework guarantees the integrity of master-part relationships through its transaction handling:


Diagram

DataJoint comes with a formally-defined diagramming notation implemented by the dj.Diagram class. The pipeline is visualized as a Directed Acyclic Graph with nodes corresponding to classes and edges corresponding to foreign key dependencies between them. The tables are grouped by their schemas, which, in turn, also form a DAG, where edges bundle all the foreign key from the child schemas to the parent schemas. The diagram is always depicted with the data moving top-to-bottom (foreign keys referencing upward) or left-to-right (foreign keys referencing leftward).

dj.Diagram are graph objects that support an algebra of graph operations: union, intersection, and dilation.

The nodes are colored according to their table tier.


Object Storage

DataJoint integrates object storage into its relational database-driven pipelines to efficiently manage large scientific datasets. Object storage is used for two primary purposes:

  1. Object-Augmented Schemas – Storing large, unstructured data (e.g., images, time series) externally while keeping structured metadata in the database.

  2. Database Backup & Export – Providing a structured, shareable repository of pipeline data.

Object-Augmented Schemas

A DataJoint pipeline follows a hybrid storage model, where:

This scalable approach maintains fast querying, data integrity, and transactional consistency, while enabling flexible, distributed storage of large datasets.

Storage Backend Configuration

A DataJoint client is configured to access the storage backend associated with each database. Supported backends include:

DataJoint uses fsspec to ensure compatibility across multiple storage backends.

Example Structured Storage Pattern

A DataJoint project creates a structured hierarchical storage pattern:

📁 project_name/
├── datajoint.toml
├──📁 schema_name1/
├──📁 schema_name2/
├──📁 schema_name3/
│  ├── schema.py
│  ├── 📁 tables
│  │   ├── table1/key1-value1.parquet
│  │   ├── table2/key2-value2.parquet
│  │   ...
│  ├── 📁 objects
│  │   ├── table1-field1/key3-value3.zarr
│  │   ├── table1-field2/key3-value3.gif
...  ...

When using object storage, the corresponding keys might be:

s3://project_name/schema_name3/objects/table1/key1-value1.parquet
s3://project_name/schema_name3/fields/table1-field1/key3-value3.zarr

The organizational structure of stored objects is configurable, allowing partitioning based on primary key attributes.

Example configuration in datajoint.toml:

# Configuration for object storage
[object_storage]
partition_pattern = "subject{subject_id}/session{session_id}"

This structure dynamically replaces placeholders {subject_id} and {session_id} with actual values. Example Stored Objects with Partitioning

s3://my-bucket/project_name/subject123/session45/schema_name3/objects/table1/key1-value1/image1.tiff
s3://my-bucket/project_name/subject123/session45/schema_name3/objects/table2/key2-value2/movie2.zarr

This structured naming approach allows:

Export, Sharing, and Backup/Restore

The structured project data repository serves as a comprehensive, publishable copy of the dataset, accessible by DataJoint and other tools.

The export method recreates the data repository in full at another location. DataJoint offers tools to restore the database from the file repository using the tabular data, supporting data publishing and pipeline cloning alongside the data.


Data Operations

Data operations alter the state of the data stored in the database.

Records (rows) in a stored table are considered immutable: each is inserted or deleted as a whole using insert and delete operators.

However, to allow correcting errors in existing data, the update1 operator is provided to perform deliberate corrections in the data by changing the values of individual attributes.

Insert comes in three flavors:

  1. insert1(record) for inserting an individual record

  2. insert(sequence) for batches of records

  3. insert(query_expression) for inserting query results

Insert1

The method Table.insert1(rec) inserts one record rec into one table.

The record to be inserted must be fully formed and comply with all data integrity constraints:

If the record violates any constraints, the operation fails, ensuring atomicity—either the record is successfully inserted, or nothing is changed.

A record value may be specified as an ordered tuple, in which case all fields must be present, or as a mapping (dict), in which case optional fields may be omitted.

Example:

# insert a student record
Student.insert1({
    'student_id': 1002,
    'first_name': 'Alice',
    'last_name': 'Johnson',
    'sex': 'F',
    'date_of_birth': '1998-03-12',
    'home_address': '1234 Elm Street',
    'home_city': 'Springfield',
    'home_state': 'IL',
    'home_zipcode': '62704',
    'home_phone': '(217)555-1234'
})

Batch Insert

The method Table.insert(seq) inserts a sequence of records. In this case, the entire sequence of records is inserted in a single transaction: if a single record fails to insert, the entire sequence fails. This makes the insert operator atomic.

Example:

Student.insert([
    {'student_id': 1000, 'first_name': 'Rebecca', 'last_name': 'Sanchez',
     'sex': 'F', 'date_of_birth': '1997-09-13',
     'home_address': '6604 Gentry Turnpike Suite 513',
     'home_city': 'Andreaport', 'home_state': 'MN',
     'home_zipcode': '29376', 'home_phone': '(250)428-1836'},
    {'student_id': 1001, 'first_name': 'Matthew', 'last_name': 'Gonzales',
     'sex': 'M', 'date_of_birth': '1997-05-17',
     'home_address': '1432 Jessica Freeway Apt. 545',
     'home_city': 'Frazierberg', 'home_state': 'NE',
     'home_zipcode': '60485', 'home_phone': '(699)755-6306x996'}
])

Query Insert

The method Table.insert(query_expression) inserts the result of a query expression. In this case, the query expression is treated similarly to the inserted sequence in batch insert and must meet the same requirements.

The remarkable property of query insert is that no data is fetched to the client. Both the source query expression and the subsequent insert are performed server-side as an atomic transcation without sending data to the client. Both the query and the insert are performed as an atomic transaction.

Example:

# insert all possible majors for all students
StudentMajor.insert(Department.proj() * Student.proj())

Delete

The delete operation removes records from stored tables with cascading effects to all dependent records. Delete is often used in combination with the restriction operator to specify records to delete.

Example:

# delete all students
Student.delete()

# delete specific students
(Student & "student_id IN (500, 501, 503)").delete()

Delete cascades to all dependent records in the downstream tables after proper user confirmation.

Update

Update is used to change the values of secondary attributes in one record only. The Table.update1(rec) operator takes a mapping, rec, which must provide:

  1. The complete value of the primary key for the updated record

  2. The new values for all updated attributes

Example:

# Update email and cellphone for student id 1001
Student.update1({
    'student_id': 1001,
    'email': 'new_email@example.com',
    'cell_phone': '(555)123-4567'
})

Note that this formulation is incapable of updating primary key attributes. The record must already exist and only secondary attributes will be updated. Modifying the primary key requires deleting and reinserting the record.

Effect of Foreign Key Constraints

Foreign keys enforce referential integrity, which affects how insert, delete, and update1 operations behave:


Queries

A query is a server-side operation on stored or derived data that yields a new derived table. The results of a query are transferred to the client application through fetch operations.

Query Expressions

A query expression is the formal definition of a query. It is constructed by applying Query Operators to one or more input tables (or the results of other query expressions) to define an output table. Query expressions in DataJoint adhere to the following principles:

  1. Composable: Query expressions can be combined, allowing the output of one expression to serve as an input for another, enabling the construction of complex queries from simpler components.

  2. Declarative: Query expressions specify the desired data to be retrieved rather than the explicit sequence of computational steps to obtain it. The DataJoint client and underlying database engine are responsible for optimizing and executing the query.

  3. Relational: The result of any query expression is a well-formed relational table, possessing a defined set of attributes and a primary key, thus maintaining relational integrity.

Fetch Operations

Fetch operations execute a query expression on the database server and transfer the resulting data to the client application.

.fetch() Method

The .fetch() method retrieves all rows satisfying the query expression.

.fetch1() Method

The .fetch1() method retrieves a single row satisfying the query expression.

Iteration over Query Results

DataJoint query expression objects SHALL be iterable. Iteration allows for row-by-row processing of query results.

Query Operators

DataJoint provides a set of fundamental query operators for constructing query expressions. These operators act on tables or query expressions to produce new tables.

The primary query operators are:

These operators can be combined. Standard Python operator precedence SHALL apply, and parentheses () MAY be used to explicitly control the order of evaluation.

Operator: Restriction (&, -)

The restriction operator filters the rows of a table based on specified conditions. It is expressed in two forms:

Algebraic Closure under Restriction: The primary key and attribute set of the resulting table are identical to those of the operand A. The entity type is preserved.

Forms of Restriction Conditions:

  1. Restriction by a Condition String: Conditions are often expressed as SQL-like predicate strings.

    # Select students from California.
    ca_students = Student & "home_state = 'CA'"
    
    # Select students not from California.
    non_ca_students = Student - "home_state = 'CA'"
  2. Restriction by a Dictionary: Conditions can be specified as a dictionary mapping attribute names to values. These imply equality conditions.

    # Students from Texas (dictionary form).
    tx_students = Student & {'home_state': 'TX'}
  3. Restriction by a Sequence of Conditions (Logical OR): When a query is restricted by a sequence (e.g., a list) of condition strings, the conditions SHALL be combined using logical OR (disjunction).

    # Select students who are from outside California OR were born on or after January 1, 2010.
    selected_students = Student & ["home_state <> 'CA'", "date_of_birth >= '2010-01-01'"]
  4. Restriction by Conjunction (Logical AND): To combine conditions using logical AND (conjunction), conditions MAY be applied sequentially or by using a dj.AndList object.

    # Select young students from outside California (sequential application).
    young_non_ca_students = (Student
        & "home_state <> 'CA'"
        & "date_of_birth >= '2010-01-01'")
    
    # Equivalent using dj.AndList.
    young_non_ca_students_alt = Student & dj.AndList([
        "home_state <> 'CA'", "date_of_birth >= '2010-01-01'"
    ])
  5. Restriction by a Subquery: The result of another query expression can be used as a condition. The restriction acts as a semijoin (for &) or an anti-semijoin (for -).

    # Select students enrolled in course 'CS101'.
    cs101_enrollments = Enroll & "course_id = 'CS101'"
    students_in_cs101 = Student & cs101_enrollments
    
    # Select students not enrolled in course 'CS101'.
    students_not_in_cs101 = Student - cs101_enrollments
  6. Restriction by dj.Top (Limiting Results): The dj.Top operator limits the number of rows returned.

    # Select the first 5 students (order is undefined unless specified).
    top_5_students = Student & dj.Top(5)
    
    # Select the 5 students with the most recent birth dates (youngest).
    youngest_5_students = Student & dj.Top(5, order_by="date_of_birth DESC")

Operator: Projection (.proj())

The projection operator selects a subset of attributes from a table. It can also be used to rename attributes and compute new attributes derived from existing ones.

Algebraic Closure under Projection: The primary key of the resulting table is identical to that of the operand A. The entity type is preserved.

Operator: Join (*)

The join operator A * B combines rows from table A and table B. The operation first forms the Cartesian product of the rows and then restricts the result to rows where attributes common to both A and B satisfy Semantic Matching conditions.

Algebraic Closure under Join: The entity type and primary key of the join result depend on the functional dependencies between the operands A and B with respect to the join attributes. * If one operand is functionally dependent on the other across the join attributes (e.g., B’s attributes in the join are determined by A’s primary key), the resulting entity type is that of the determining operand (A), and its primary key is preserved as the result’s primary key. * If the operands are functionally independent with respect to the join, the resulting entity type represents a pairing of the two entities, and the primary key of the result is typically the union of the primary keys of A and B (excluding redundant attributes from the shared join columns).

Operator: Aggregation (.aggr())

The aggregation operator A.aggr(B, computed_attributes...) projects attributes from table A (similar to A.proj()) and, additionally, computes new attributes by applying aggregation functions to groups of data from table B. Rows in B are grouped according to the primary key of table A.

Algebraic Closure under Aggregation: The primary key and entity type of the resulting table A.aggr(B, ...) are identical to those of table A.

Operator: Union (+)

The union operator A + B combines rows from table A and table B.

Algebraic Closure under Union: The primary key and entity type of the resulting table are identical to those of the operands A and B.

Semantic Matching

For binary operators that combine two tables A and B (e.g., join *, restriction by subquery A & B_query, aggregation A.aggr(B, ...), union A + B), a mechanism is required to match rows between A and B. DataJoint employs a rule termed semantic matching.

Semantic matching between rows of tables A and B is performed as an equality condition on all pairs of attributes that satisfy both:

  1. The attributes have the same name in both A and B.

  2. The attributes trace to the same original attribute definition through an uninterrupted chain of foreign keys.

This rule is more restrictive than the NATURAL JOIN in SQL, which typically only requires attributes to have the same name. The DataJoint approach prevents accidental matching of attributes that are homonymous but semantically distinct.

If tables A and B possess attributes with identical names that do not satisfy the second condition (i.e., they do not share the same lineage), these attributes are considered to “collide.” Such a semantic mismatch renders the binary operator invalid, and an error SHALL be raised. To resolve semantic mismatches, the projection operator (.proj()) MUST be applied to one or both operands to rename or remove the colliding attributes prior to the binary operation.

Algebraic Closure

A fundamental property of DataJoint’s query system is algebraic closure. This means that the result of any valid query expression is itself a well-formed relational table. This output table:

The principle of algebraic closure allows complex queries to be constructed by composing simpler query expressions, where the output of one operation serves as a valid input for the next. Each operator section above specifies how the primary key and entity type of its result are determined, preserving this closure.

Universal Sets (dj.U())

Universal sets, denoted by dj.U(...), are symbolic constructs representing the set of all possible values for a specified list of attributes, or a singular universal context if no attributes are specified. They are not directly fetchable tables but serve as operands in query expressions.

Applications of Universal Sets

  1. Projecting Unique Values: When restricted by an existing table, dj.U(<attributes>) combined with that table results in the distinct values of <attributes> present in the table.

    # Retrieve all unique last names from the Student table.
    unique_last_names = dj.U('last_name') & Student
    
    # Retrieve all unique full names from the Student table.
    unique_full_names = dj.U('first_name', 'last_name') & Student
  2. Universal Aggregation: dj.U() (with no arguments) serves as the grouping entity for aggregations that span all rows of a table, rather than grouping by the primary key of another table.

    # Count the total number of students.
    total_student_count = dj.U().aggr(Student, n_students='COUNT(student_id)')
    # The result is a table with one row and one attribute 'n_students'.
    # The primary key of this result is the empty set.
  3. Aggregation by arbitary groupings: dj.U(<attribites>) creates a new grouping entity with an arbitrary primary key for use in aggregations for which no existing entity type fits that purpose.

    # count how many students were born in each year and month
    student_counts = dj.U('year_of_birth', 'month_of_birth').aggr(
        Student.proj(
            year_of_birth='YEAR(date_of_birth)',
            month_of_birth='MONTH(date_of_birth)'),
        n_students='COUNT(*)'
    )

    In this case, the rules os semantic matching are lifted.


Computation

DataJoint integrates computation as an intrinsic component of its data model. Certain tables, designated as auto-populated (i.e., dj.Imported or dj.Computed tiers), represent the results of computations rather than directly ingested data. The population of these tables is governed by methods defined within their corresponding class definitions.

Key characteristics of automated computation in DataJoint include:

This model is designed to ensure data consistency, integrity, and computational reproducibility throughout the data pipeline.

The make Method: Defining Computational Logic

Auto-populated tables (i.e., tables of tier dj.Imported or dj.Computed) MUST implement a method named make, which encapsulates the logic for populating a single entry (or a master entry and its associated part entries) in the table.

Method Signature:

def make(self, key, **make_opts) -> None:

The key argument is a dictionary representing a single primary key value from the table’s key source, identifying the specific entity for which the computation is to be performed. Optional make_opts MAY be passed to customize the behavior of the make method.

The execution of a make method for a given key typically involves three principal steps:

  1. Fetch Data: Retrieve necessary input data from upstream (parent) tables, restricted by the provided key.

  2. Compute Results: Process the fetched data to generate the required output values for the attributes of the current table and its part tables (if any).

  3. Insert Results: Store the computed results into the current table (and its part tables, if applicable) using the self.insert1() method (and self.Part.insert() for any Part tables). The inserted record MUST include the attributes from the input key along with the newly computed attributes.

Example: Computing an Average Signal

@schema
class SignalAverage(dj.Computed):
    definition = """
    -> RawSignal
    ---
    avg_signal: float
    """

    def make(self, key: dict) -> None:
        # Step 1: Fetch input data from upstream table.
        raw_signal_value = (RawSignal & key).fetch1("signal")

        # Step 2: Compute the result.
        average_signal_value = raw_signal_value.mean()

        # Step 3: Insert the computed result.
        self.insert1({**key, "avg_signal": average_signal_value})

This structured approach ensures that all computed data are traceable to their inputs and the specific computational logic applied.

Each invocation of the make method for a single key SHALL be executed within a single, ACID-compliant database transaction. This guarantees:

Handling Long-Running Computations

For computations that are expected to be long-running (potentially hours or days), maintaining an open database transaction for the entire duration of the make method can lead to operational issues, such as blocking other database operations or exceeding transaction timeout limits.

To address this, DataJoint supports a deferred transaction verification mechanism. This involves separating the data fetching and computation from the final data insertion and verification, minimizing the duration of the database transaction. This is achieved by implementing three distinct methods instead of a single make method:

  1. make_fetch(self, key: dict) -> object:

    • Retrieves all necessary input data from the database based on key.

    • This method executes outside the main insertion transaction.

    • The returned fetched_data object is passed to make_compute.

  2. make_compute(self, key: dict, fetched_data: object) -> object:

    • Performs the primary computation using the fetched_data.

    • This method executes outside any database transaction.

    • The returned computed_results object is passed to make_insert.

  3. make_insert(self, key: dict, fetched_data: object, computed_results: object) -> None:

    • This method executes within a minimal database transaction.

    • Prior to calling make_insert, DataJoint SHOULD re-fetch the input data (by calling make_fetch a second time) to verify that inputs have not been modified since the initial fetch.

    • If inputs are verified to be consistent, DataJoint calls make_insert.

    • If inputs have changed, the transaction SHOULD be rolled back, and the entire computation is cancelled for subsequent retries.

The same example from above will appear as follows:

Example: Computing an Average Signal with short transaction time -- three-part make method

@schema
class SignalAverage(dj.Computed):
    definition = """
    -> RawSignal
    ---
    avg_signal: float
    """

    def make_fetch(self, key: dict) -> None:
        # Step 1: Fetch input data from upstream table.
        raw_signal_value = (RawSignal & key).fetch1("signal")
        return raw_signal_value,


    def make_compute(self, key: dict, fetched: tuple) -> tuple:
        raw_signal_value, = fetched
        # Step 2: Compute the result.
        average_signal_value = raw_signal_value.mean()
        return average_signal_value,

    def make_insert(self, key: dict, fetched: tuple, computed: tuple) -> None:
        average_signal_vaue, = computed
        # Step 3: Insert the computed result.
        self.insert1({**key, "avg_signal": average_signal_value})

An alternative implementation of this three-part make method is to convert a normal make method into a generator yielding the fetched data and computed results. This effectively implements the three-part make method, but is more succinct.

The example below illustrates this approach:

Example: Computing an Average Signal with short transaction time -- Generator Implementation

@schema
class SignalAverage(dj.Computed):
    definition = """
    -> RawSignal
    ---
    avg_signal: float
    """

    def make(self, key: dict) -> None:
        # Step 1: Fetch input data from upstream table.
        raw_signal_value = (RawSignal & key).fetch1("signal")
        computed = yield raw_signal_value,   # yield the fetched data and receive computed results

        if computed is None:
            # Step 2: Compute the result.
            average_signal_value = raw_signal_value.mean()
            computed = average_signal_value,
            yield computed  # yield the computed results

        # Step 3: Insert the computed result.
        average_signal_value, = computed
        self.insert1({**key, "avg_signal": average_signal_value})
        yield  # signals the end of computation

DataJoint’s internal logic will automatically call the three methods in the correct order, as illustrated in the pseudocode below: Pseudocode for Deferred Transaction Handling:

fetched = self.make_fetch(key)
computed = self.make_compute(key, fetched)

<begin transaction>
fetched_again = self.make_fetch(key)

if fetched != fetched_again:   # peforms a deep comparison of the two tuples
    <rollback transaction>
else:
    self.make_insert(key, fetched, computed)
    <commit transaction>

This strategy minimizes the time database locks are held, at the cost of fetching input data twice. It is recommended for computations where the computation time significantly exceeds the data fetching time. It is the user’s responsibility to segregate the functionality into the three distinct parts (all fetches in make_fetch, all computations in make_compute, and all inserts in make_insert). However, in future standards, DataJoint may enforce this segregation.

Key Source: Determining Entries for Computation

The key source for an auto-populated table defines the set of primary key values for which the make method needs to be executed. DataJoint automatically determines the key source.

Example: Key Source for a Computed Table Consider a computed table ProcessedImage that processes images recorded by different methods:

@schema
class ProcessedImage(dj.Computed):
    definition = """
    -> acq.Image
    -> ProcessingMethod
    ---
    processed_data: blob
    """

The key source in this case is:

acq.Image.proj() * ProcessingMethod.proj() - ProcessedImage

This expression identifies all combinations of Image and ProcessingMethod primary keys that do not yet have a corresponding entry in ProcessedImage.

Computed vs Imported Tables

Auto-populated tables are primarily categorized into two tiers: dj.Computed and dj.Imported. The distinction lies in the origin of their input data and the implications for reproducibility:

TierPurposeReproducibilityData Source
Computed (dj.Computed)Fully reproducible computations✅ GuaranteedUses only pipeline data
Imported (dj.Imported)Data ingested from external sources❌ Not guaranteedReads from external sources (e.g., instruments, APIs)

While both tiers use the make (or make_fetch/make_compute/make_insert) method for populating entries, dj.Computed tables are designed to ensure that, given the same upstream pipeline data and the same computational code, the results are identical. dj.Imported tables introduce external dependencies, meaning reproducibility also relies on the stability and versioning of these external sources.

Summary of automated computing

By structuring computation within DataJoint pipelines, researchers can build efficient, reproducible, and scalable data workflows.