The populate operation is the engine of workflow automation in DataJoint.
While insert, delete, and update are operations for Manual tables, populate automates data entry for Imported and Computed tables based on the dependencies defined in the schema.
As introduced in Workflow Operations, the distinction between external and automatic data entry maps directly to table tiers:
| Table Tier | Data Entry Method |
|---|---|
| Lookup | contents property (part of schema) |
| Manual | insert from external sources |
| Imported | Automatic populate |
| Computed | Automatic populate |
This chapter shows how populate transforms the schema’s dependency graph into executable computations.
The Relational Workflow Model in Action¶
Recall that the Relational Workflow Model is built on four fundamental concepts:
Workflow Entity — Each table represents an entity type created at a specific workflow step
Workflow Dependencies — Foreign keys prescribe the order of operations
Workflow Steps — Distinct phases where entity types are created (manual or automated)
Directed Acyclic Graph (DAG) — The schema forms a graph structure ensuring valid execution sequences
The Relational Workflow Model defines a new class of databases: Computational Databases, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.
From Declarative Schema to Executable Pipeline¶
A DataJoint schema uses table tiers to distinguish different workflow roles:
| Tier | Color | Role in Workflow |
|---|---|---|
| Lookup | Gray | Static reference data and configuration parameters |
| Manual | Green | Data from external systems or human entry |
| Imported | Blue | Data acquired automatically from instruments or files |
| Computed | Red | Derived data produced by computational transformations |
Because dependencies are explicit through foreign keys, DataJoint’s populate() method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table’s make() method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving computational validity—the guarantee that all derived data remains consistent with its upstream dependencies.
This is the essence of workflow automation: each table advertises what it depends on, and populate() runs only the computations that are still missing.
The populate Method¶
The populate() method is the engine of workflow automation. When called on a computed or imported table, it:
Identifies missing work — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table
Iterates over pending keys — For each missing key, calls the table’s
make()methodWraps each
make()in a transaction — Ensures atomicity: either all inserts succeed or none doHandles errors gracefully — Failed jobs are logged but do not stop the remaining work
# Process all pending work
Detection.populate(display_progress=True)
# Process a specific subset
Detection.populate(Image & "image_id < 10")
# Distribute across workers
Detection.populate(reserve_jobs=True)The reserve_jobs=True option enables parallel execution across multiple processes or machines by using the database itself for job coordination.
The make Method¶
The make() method defines the computational logic for each entry.
It receives a key dictionary identifying which entity to compute and must fetch inputs, compute results, and insert them into the table.
See the dedicated make Method chapter for:
The three-part anatomy (fetch, compute, insert)
Restrictions on auto-populated tables
The three-part pattern for long-running computations
Transaction handling strategies
Schema Dimensions and the Key Source¶
Auto-populated tables have a fundamental constraint: they cannot introduce new schema dimensions. A schema dimension is created when a table defines a new primary key attribute directly (see Primary Keys). For Computed and Imported tables, the primary key must be fully determined by foreign keys to upstream dependencies.
This constraint is what makes the key source well-defined. The key source is computed as:
key_source = (join of all primary-key dependencies).proj() - TableIn other words: take the Cartesian product of all upstream tables referenced in the primary key, project to just the primary key attributes, and subtract the entries already present in the table. The result is the set of pending work items.
Because auto-populated tables cannot add new dimensions, each key in the key source corresponds to exactly one make() call. The computation receives a fully-specified key and produces results for that key.
What if a computation produces multiple outputs? Use part tables. Part tables can introduce new dimensions. For example, a blob detection algorithm might find 200 blobs in one image—the Detection master cannot introduce a blob_id dimension, but Detection.Blob (the part table) can.
Transactional Integrity¶
Each make() call executes inside an ACID transaction. This provides critical guarantees for computational workflows:
Atomicity — The entire computation either commits or rolls back as a unit
Isolation — Partial results are never visible to other processes
Consistency — The database moves from one valid state to another
When a computed table has part tables, the transaction boundary encompasses both the master and all its parts. The master’s make() method is responsible for inserting everything within a single transactional scope. See the Master-Part chapter for detailed coverage of ACID semantics and the master’s responsibility pattern.
Case Study: Blob Detection¶
The Blob Detection example demonstrates these concepts in a compact image-analysis workflow:
Source data —
Image(manual) stores NumPy arrays aslongblobfieldsParameter space —
BlobParamSet(lookup) defines detection configurations viacontentsComputation —
Detection(computed) depends on both upstream tables
The Detection table uses a master-part structure: the master row stores an aggregate (blob count), while Detection.Blob parts store per-feature coordinates. Notice how schema dimensions work here:
Detectioninherits its primary key entirely fromImageandBlobParamSet—it cannot introduce new dimensionsDetection.Blobintroduces theblob_iddimension to identify individual blobs within each detection
When populate() runs:
Each
(image_id, blob_paramset)combination triggers onemake()callThe
make()method fetches inputs, runs detection, and inserts both master and partsThe transaction ensures all blob coordinates appear atomically with their count
Detection.populate(display_progress=True)
# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]This pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the SelectDetection manual table records the preferred parameter set for each image. Because SelectDetection depends on Detection, it implicitly has access to all Detection.Blob parts for the selected detection.
Why Computational Databases Matter¶
The Relational Workflow Model provides several key benefits:
| Benefit | Description |
|---|---|
| Reproducibility | Rerunning populate() regenerates derived tables from raw inputs |
| Dependency-aware scheduling | DataJoint infers job order from foreign keys (the DAG structure) |
| Computational validity | Transactions ensure downstream results stay consistent with upstream inputs |
| Provenance tracking | The schema documents what was computed from what |
Practical Tips¶
Develop incrementally — Test
make()logic with restrictions (e.g.,Table.populate(restriction)) before processing all dataMonitor progress — Use
display_progress=Truefor visibility during developmentDistribute work — Use
reserve_jobs=Truewhen running multiple workersUse master-part for multi-row results — When a computation produces multiple output entities, use part tables to introduce new dimensions while keeping everything in a single transaction