Populate

The populate operation is the engine of workflow automation in DataJoint. While insert, delete, and update are operations for Manual tables, populate automates data entry for Imported and Computed tables based on the dependencies defined in the schema.

As introduced in Workflow Operations, the distinction between external and automatic data entry maps directly to table tiers:

Table Tier	Data Entry Method
Lookup	`contents` property (part of schema)
Manual	`insert` from external sources
Imported	Automatic `populate`
Computed	Automatic `populate`

This chapter shows how populate transforms the schema’s dependency graph into executable computations.

The Relational Workflow Model in Action¶

Recall that the Relational Workflow Model is built on four fundamental concepts:

Workflow Entity — Each table represents an entity type created at a specific workflow step
Workflow Dependencies — Foreign keys prescribe the order of operations
Workflow Steps — Distinct phases where entity types are created (manual or automated)
Directed Acyclic Graph (DAG) — The schema forms a graph structure ensuring valid execution sequences

The Relational Workflow Model defines a new class of databases: Computational Databases, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.

From Declarative Schema to Executable Pipeline¶

A DataJoint schema uses table tiers to distinguish different workflow roles:

Tier	Color	Role in Workflow
Lookup	Gray	Static reference data and configuration parameters
Manual	Green	Data from external systems or human entry
Imported	Blue	Data acquired automatically from instruments or files
Computed	Red	Derived data produced by computational transformations

Because dependencies are explicit through foreign keys, DataJoint’s populate() method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table’s make() method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving computational validity—the guarantee that all derived data remains consistent with its upstream dependencies.

This is the essence of workflow automation: each table advertises what it depends on, and populate() runs only the computations that are still missing.

The `populate` Method¶

The populate() method is the engine of workflow automation. When called on a computed or imported table, it:

Identifies missing work — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table
Iterates over pending keys — For each missing key, calls the table’s make() method
Wraps each make() in a transaction — Ensures atomicity: either all inserts succeed or none do
Handles errors gracefully — Failed jobs are logged but do not stop the remaining work

# Process all pending work
Detection.populate(display_progress=True)

# Process a specific subset
Detection.populate(Image & "image_id < 10")

# Distribute across workers
Detection.populate(reserve_jobs=True)

The reserve_jobs=True option enables parallel execution across multiple processes or machines by using the database itself for job coordination.

The `make` Method¶

The make() method defines the computational logic for each entry. It receives a key dictionary identifying which entity to compute and must fetch inputs, compute results, and insert them into the table.

See the dedicated make Method chapter for:

The three-part anatomy (fetch, compute, insert)
Restrictions on auto-populated tables
The three-part pattern for long-running computations
Transaction handling strategies

Schema Dimensions and the Key Source¶

Auto-populated tables have a fundamental constraint: they cannot introduce new schema dimensions. A schema dimension is created when a table defines a new primary key attribute directly (see Primary Keys). For Computed and Imported tables, the primary key must be fully determined by foreign keys to upstream dependencies.

This constraint is what makes the key source well-defined. The key source is computed as:

key_source = (join of all primary-key dependencies).proj() - Table

In other words: take the Cartesian product of all upstream tables referenced in the primary key, project to just the primary key attributes, and subtract the entries already present in the table. The result is the set of pending work items.

Because auto-populated tables cannot add new dimensions, each key in the key source corresponds to exactly one make() call. The computation receives a fully-specified key and produces results for that key.

What if a computation produces multiple outputs? Use part tables. Part tables can introduce new dimensions. For example, a blob detection algorithm might find 200 blobs in one image—the Detection master cannot introduce a blob_id dimension, but Detection.Blob (the part table) can.

Transactional Integrity¶

Each make() call executes inside an ACID transaction. This provides critical guarantees for computational workflows:

Atomicity — The entire computation either commits or rolls back as a unit
Isolation — Partial results are never visible to other processes
Consistency — The database moves from one valid state to another

When a computed table has part tables, the transaction boundary encompasses both the master and all its parts. The master’s make() method is responsible for inserting everything within a single transactional scope. See the Master-Part chapter for detailed coverage of ACID semantics and the master’s responsibility pattern.

The Jobs Table¶

When using reserve_jobs=True, DataJoint tracks job status in a special jobs table that enables coordination across multiple workers. The jobs table is accessed through the schema object:

schema.jobs

The jobs table works like any other DataJoint table—you can query, restrict, and delete from it using familiar operations.

What the Jobs Table Contains¶

The jobs table stores entries for:

Reserved jobs — Jobs currently being processed by a worker
Error jobs — Jobs where the make() call ended in an error

A job corresponds to a single make() call from any auto-populated table in the schema. Each entry records which table and key combination is being processed, along with its current status.

How Populate Uses the Jobs Table¶

When populate() runs with reserve_jobs=True:

Before calling make(), it reserves the job by inserting an entry into the jobs table
If make() completes successfully, the reservation is removed
If make() fails with an error, the job entry remains with status="error"

Critically, populate will not attempt to recompute a job if it is already reserved or in error status. This prevents:

Multiple workers from processing the same key simultaneously
Repeated failures on problematic keys that waste compute resources

Reprocessing Failed Jobs¶

To retry jobs that ended in error, delete them from the jobs table:

# Delete all error jobs, allowing them to be reprocessed
(schema.jobs & {"status": "error"}).delete()

# Delete error jobs for a specific table
(schema.jobs & {"table_name": "detection", "status": "error"}).delete()

After deleting the error entries, the next populate() call will attempt to process those keys again.

Monitoring Job Status¶

Since the jobs table is a regular DataJoint table, you can query it to monitor workflow progress:

# Count jobs by table and status
dj.U("table_name", "status").aggr(schema.jobs, count="count(*)")

# View all error jobs
schema.jobs & {"status": "error"}

# Check reserved jobs (currently running)
schema.jobs & {"status": "reserved"}

Case Study: Blob Detection¶

The Blob Detection example demonstrates these concepts in a compact image-analysis workflow:

Source data — Image (manual) stores NumPy arrays as longblob fields
Parameter space — BlobParamSet (lookup) defines detection configurations via contents
Computation — Detection (computed) depends on both upstream tables

The Detection table uses a master-part structure: the master row stores an aggregate (blob count), while Detection.Blob parts store per-feature coordinates. Notice how schema dimensions work here:

Detection inherits its primary key entirely from Image and BlobParamSet—it cannot introduce new dimensions
Detection.Blob introduces the blob_id dimension to identify individual blobs within each detection

When populate() runs:

Each (image_id, blob_paramset) combination triggers one make() call
The make() method fetches inputs, runs detection, and inserts both master and parts
The transaction ensures all blob coordinates appear atomically with their count

Detection.populate(display_progress=True)
# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]

This pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the SelectDetection manual table records the preferred parameter set for each image. Because SelectDetection depends on Detection, it implicitly has access to all Detection.Blob parts for the selected detection.

Why Computational Databases Matter¶

The Relational Workflow Model provides several key benefits:

Benefit	Description
Reproducibility	Rerunning `populate()` regenerates derived tables from raw inputs
Dependency-aware scheduling	DataJoint infers job order from foreign keys (the DAG structure)
Computational validity	Transactions ensure downstream results stay consistent with upstream inputs
Provenance tracking	The schema documents what was computed from what

Practical Tips¶

Develop incrementally — Test make() logic with restrictions (e.g., Table.populate(restriction)) before processing all data
Monitor progress — Use display_progress=True for visibility during development
Distribute work — Use reserve_jobs=True when running multiple workers
Use master-part for multi-row results — When a computation produces multiple output entities, use part tables to introduce new dimensions while keeping everything in a single transaction