Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Populate

The populate operation is the engine of workflow automation in DataJoint. While insert, delete, and update are operations for Manual tables, populate automates data entry for Imported and Computed tables based on the dependencies defined in the schema.

As introduced in Workflow Operations, the distinction between external and automatic data entry maps directly to table tiers:

Table TierData Entry Method
Lookupcontents property (part of schema)
Manualinsert from external sources
ImportedAutomatic populate
ComputedAutomatic populate

This chapter shows how populate transforms the schema’s dependency graph into executable computations.

The Relational Workflow Model in Action

Recall that the Relational Workflow Model is built on four fundamental concepts:

  1. Workflow Entity — Each table represents an entity type created at a specific workflow step

  2. Workflow Dependencies — Foreign keys prescribe the order of operations

  3. Workflow Steps — Distinct phases where entity types are created (manual or automated)

  4. Directed Acyclic Graph (DAG) — The schema forms a graph structure ensuring valid execution sequences

The Relational Workflow Model defines a new class of databases: Computational Databases, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.

From Declarative Schema to Executable Pipeline

A DataJoint schema uses table tiers to distinguish different workflow roles:

TierColorRole in Workflow
LookupGrayStatic reference data and configuration parameters
ManualGreenData from external systems or human entry
ImportedBlueData acquired automatically from instruments or files
ComputedRedDerived data produced by computational transformations

Because dependencies are explicit through foreign keys, DataJoint’s populate() method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table’s make() method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving computational validity—the guarantee that all derived data remains consistent with its upstream dependencies.

This is the essence of workflow automation: each table advertises what it depends on, and populate() runs only the computations that are still missing.

The populate Method

The populate() method is the engine of workflow automation. When called on a computed or imported table, it:

  1. Identifies missing work — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table

  2. Iterates over pending keys — For each missing key, calls the table’s make() method

  3. Wraps each make() in a transaction — Ensures atomicity: either all inserts succeed or none do

  4. Handles errors gracefully — Failed jobs are logged but do not stop the remaining work

# Process all pending work
Detection.populate(display_progress=True)

# Process a specific subset
Detection.populate(Image & "image_id < 10")

# Distribute across workers
Detection.populate(reserve_jobs=True)

The reserve_jobs=True option enables parallel execution across multiple processes or machines by using the database itself for job coordination.

The make Method

The make() method defines the computational logic for each entry. It receives a key dictionary identifying which entity to compute and must fetch inputs, compute results, and insert them into the table.

See the dedicated make Method chapter for:

  • The three-part anatomy (fetch, compute, insert)

  • Restrictions on auto-populated tables

  • The three-part pattern for long-running computations

  • Transaction handling strategies

Schema Dimensions and the Key Source

Auto-populated tables have a fundamental constraint: they cannot introduce new schema dimensions. A schema dimension is created when a table defines a new primary key attribute directly (see Primary Keys). For Computed and Imported tables, the primary key must be fully determined by foreign keys to upstream dependencies.

This constraint is what makes the key source well-defined. The key source is computed as:

key_source = (join of all primary-key dependencies).proj() - Table

In other words: take the Cartesian product of all upstream tables referenced in the primary key, project to just the primary key attributes, and subtract the entries already present in the table. The result is the set of pending work items.

Because auto-populated tables cannot add new dimensions, each key in the key source corresponds to exactly one make() call. The computation receives a fully-specified key and produces results for that key.

What if a computation produces multiple outputs? Use part tables. Part tables can introduce new dimensions. For example, a blob detection algorithm might find 200 blobs in one image—the Detection master cannot introduce a blob_id dimension, but Detection.Blob (the part table) can.

Transactional Integrity

Each make() call executes inside an ACID transaction. This provides critical guarantees for computational workflows:

  • Atomicity — The entire computation either commits or rolls back as a unit

  • Isolation — Partial results are never visible to other processes

  • Consistency — The database moves from one valid state to another

When a computed table has part tables, the transaction boundary encompasses both the master and all its parts. The master’s make() method is responsible for inserting everything within a single transactional scope. See the Master-Part chapter for detailed coverage of ACID semantics and the master’s responsibility pattern.

Case Study: Blob Detection

The Blob Detection example demonstrates these concepts in a compact image-analysis workflow:

  1. Source dataImage (manual) stores NumPy arrays as longblob fields

  2. Parameter spaceBlobParamSet (lookup) defines detection configurations via contents

  3. ComputationDetection (computed) depends on both upstream tables

The Detection table uses a master-part structure: the master row stores an aggregate (blob count), while Detection.Blob parts store per-feature coordinates. Notice how schema dimensions work here:

  • Detection inherits its primary key entirely from Image and BlobParamSet—it cannot introduce new dimensions

  • Detection.Blob introduces the blob_id dimension to identify individual blobs within each detection

When populate() runs:

  • Each (image_id, blob_paramset) combination triggers one make() call

  • The make() method fetches inputs, runs detection, and inserts both master and parts

  • The transaction ensures all blob coordinates appear atomically with their count

Detection.populate(display_progress=True)
# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]

This pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the SelectDetection manual table records the preferred parameter set for each image. Because SelectDetection depends on Detection, it implicitly has access to all Detection.Blob parts for the selected detection.

Why Computational Databases Matter

The Relational Workflow Model provides several key benefits:

BenefitDescription
ReproducibilityRerunning populate() regenerates derived tables from raw inputs
Dependency-aware schedulingDataJoint infers job order from foreign keys (the DAG structure)
Computational validityTransactions ensure downstream results stay consistent with upstream inputs
Provenance trackingThe schema documents what was computed from what

Practical Tips

  • Develop incrementally — Test make() logic with restrictions (e.g., Table.populate(restriction)) before processing all data

  • Monitor progress — Use display_progress=True for visibility during development

  • Distribute work — Use reserve_jobs=True when running multiple workers

  • Use master-part for multi-row results — When a computation produces multiple output entities, use part tables to introduce new dimensions while keeping everything in a single transaction