Computation as Workflow - The DataJoint Book

Draft — subject to revision

DataJoint reframes databases as workflows: each table advertises what it depends on, and DataJoint’s populate() method runs only the computations that are still missing. The Blob-detection Pipeline from the examples chapter demonstrates how this plays out in practice and meets the demands of scientific reproducibility (reproducible processing and a clear path from primary data to interface).

From Declarative Schema to Executable Pipeline¶

A DataJoint schema mixes several table roles:

Manual / lookup tables capture authoritative inputs and configuration options.
Computed tables declare derived data and embed the logic that produces it.
Part tables attach one-to-many detail that should always be inserted atomically with their parent.

Because dependencies are explicit, populate() can explore the graph top-down: for every upstream key that has not been processed, it executes the table’s make() method; if anything fails, the transaction is rolled back.

Case Study: Blob Detection¶

The notebook 075-blob-detection.ipynb assembles a compact image-analysis workflow:

Store source imagery – Image is a manual table with a longblob field. NumPy arrays fetched from skimage are serialized automatically, illustrating the lecture’s warning that binary payloads need a serializer when you save them in a relational database.
Scan parameter space – BlobParamSet is a lookup table of min/max sigma and threshold values for skimage.feature.blob_doh. Each combination represents an alternative experiment configuration—exactly the “experiment parameters” mindset stressed in class.
Compute detections – Detection depends on both upstream tables. Its part table Detection.Blob holds every circle (x, y, radius) produced by the detector so that master and detail rows stay in sync.

@schema
class Detection(dj.Computed):
    definition = """
    -> Image
    -> BlobParamSet
    ---
    nblobs : int
    """

    class Blob(dj.Part):
        definition = """
        -> master
        blob_id : int
        ---
        x : float
        y : float
        r : float
        """

    def make(self, key):
        img = (Image & key).fetch1("image")
        params = (BlobParamSet & key).fetch1()
        blobs = blob_doh(img,
                         min_sigma=params['min_sigma'],
                         max_sigma=params['max_sigma'],
                         threshold=params['threshold'])
        self.insert1(dict(key, nblobs=len(blobs)))
        self.Blob.insert(dict(key, blob_id=i, x=x, y=y, r=r)
                         for i, (x, y, r) in enumerate(blobs))

Running Detection.populate(display_progress=True) fans out over every (image, paramset) pair, creating six jobs in the demo notebook. Because each job lives in a transaction, half-written results never leak—one of the isolation guarantees highlighted in the lecture’s ACID recap.

Curate the Preferred Result¶

After inspecting the plots, a small manual table SelectDetection records the “best” parameter set for each image. That drives a final visualization that renders only the chosen detections. This illustrates a common pattern for the final project: let automation explore the combinatorics, then capture human judgment in a concise manual table. In the presentation, this curated view is what you would surface through Dash, Streamlit, or another GUI toolkit.

Why It Matters for the Final Project¶

Reproducibility – rerunning populate() regenerates every derived table from raw inputs, satisfying the requirement for trustworthy analyses.
Dependency-aware scheduling – you do not need to script job order; DataJoint infers it from foreign keys, exactly as promised in lecture.
Extensibility – adding a new image or parameter set triggers only the necessary new jobs, so the pipeline scales to the “at least six tables” complexity target.

Practical Tips¶

Develop make() logic with restrictions (e.g., Detection.populate(key)) before unlocking the entire pipeline.
Use display_progress=True when you need visibility; use reserve_jobs=True when distributing work across multiple machines.
If your computed table writes both summary and detail rows, keep them in a part table so the transaction boundary protects them together.

The blob-detection notebook is a self-contained template: swap in your own raw data source, adjust the parameter search, and you have the skeleton for an end-to-end computational database ready to feed a dashboard demo on presentation day.

Queries

Subqueries and Complex Query Patterns

Interfaces

Graphical Interfaces: Plotly Dash