Draft — subject to revision
DataJoint reframes databases as workflows: each table advertises what it depends on, and DataJoint’s populate() method runs only the computations that are still missing. The Blob-detection Pipeline from the examples chapter demonstrates how this plays out in practice and meets the demands of scientific reproducibility (reproducible processing and a clear path from primary data to interface).
From Declarative Schema to Executable Pipeline¶
A DataJoint schema mixes several table roles:
Manual / lookup tables capture authoritative inputs and configuration options.
Computed tables declare derived data and embed the logic that produces it.
Part tables attach one-to-many detail that should always be inserted atomically with their parent.
Because dependencies are explicit, populate() can explore the graph top-down: for every upstream key that has not been processed, it executes the table’s make() method; if anything fails, the transaction is rolled back.
Case Study: Blob Detection¶
The notebook 075-blob-detection.ipynb assembles a compact image-analysis workflow:
Store source imagery –
Imageis a manual table with alongblobfield. NumPy arrays fetched fromskimageare serialized automatically, illustrating the lecture’s warning that binary payloads need a serializer when you save them in a relational database.Scan parameter space –
BlobParamSetis a lookup table of min/max sigma and threshold values forskimage.feature.blob_doh. Each combination represents an alternative experiment configuration—exactly the “experiment parameters” mindset stressed in class.Compute detections –
Detectiondepends on both upstream tables. Its part tableDetection.Blobholds every circle (x, y, radius) produced by the detector so that master and detail rows stay in sync.
@schema
class Detection(dj.Computed):
definition = """
-> Image
-> BlobParamSet
---
nblobs : int
"""
class Blob(dj.Part):
definition = """
-> master
blob_id : int
---
x : float
y : float
r : float
"""
def make(self, key):
img = (Image & key).fetch1("image")
params = (BlobParamSet & key).fetch1()
blobs = blob_doh(img,
min_sigma=params['min_sigma'],
max_sigma=params['max_sigma'],
threshold=params['threshold'])
self.insert1(dict(key, nblobs=len(blobs)))
self.Blob.insert(dict(key, blob_id=i, x=x, y=y, r=r)
for i, (x, y, r) in enumerate(blobs))Running Detection.populate(display_progress=True) fans out over every (image, paramset) pair, creating six jobs in the demo notebook. Because each job lives in a transaction, half-written results never leak—one of the isolation guarantees highlighted in the lecture’s ACID recap.
Curate the Preferred Result¶
After inspecting the plots, a small manual table SelectDetection records the “best” parameter set for each image. That drives a final visualization that renders only the chosen detections. This illustrates a common pattern for the final project: let automation explore the combinatorics, then capture human judgment in a concise manual table. In the presentation, this curated view is what you would surface through Dash, Streamlit, or another GUI toolkit.
Why It Matters for the Final Project¶
Reproducibility – rerunning
populate()regenerates every derived table from raw inputs, satisfying the requirement for trustworthy analyses.Dependency-aware scheduling – you do not need to script job order; DataJoint infers it from foreign keys, exactly as promised in lecture.
Extensibility – adding a new image or parameter set triggers only the necessary new jobs, so the pipeline scales to the “at least six tables” complexity target.
Practical Tips¶
Develop
make()logic with restrictions (e.g.,Detection.populate(key)) before unlocking the entire pipeline.Use
display_progress=Truewhen you need visibility; usereserve_jobs=Truewhen distributing work across multiple machines.If your computed table writes both summary and detail rows, keep them in a part table so the transaction boundary protects them together.
The blob-detection notebook is a self-contained template: swap in your own raw data source, adjust the parameter search, and you have the skeleton for an end-to-end computational database ready to feed a dashboard demo on presentation day.