Skip to article frontmatterSkip to article content

Data Integrity

Why Data Integrity Matters

Imagine a neuroscience lab where recording sessions are tracked in a database. Without proper safeguards, you might encounter:

Each scenario represents a failure of data integrity — the database’s ability to maintain accurate, consistent, and reliable data that faithfully represents reality.

Data Integrity is the ability of a database to define, express, and enforce rules for valid data states and transformations.

The Challenge

Scientific databases face unique challenges:

  • Multiple users entering data concurrently

  • Long-running experiments generating data over months or years

  • Complex relationships between experimental entities

  • Evolving protocols requiring schema updates

  • Collaborative teams with different data entry practices

Without robust integrity mechanisms, these challenges lead to:

  • Invalid or incomplete data entry

  • Loss of data during updates

  • Unwarranted alteration of historical records

  • Misidentification or mismatch of experimental subjects

  • Data duplication across tables

  • Broken references between related datasets

From Real-World Rules to Database Constraints

The core challenge of database design is translating organizational rules into enforceable constraints. Consider a simple example:

Lab Rule: “Each mouse must have a unique ID, and every recording session must reference a valid mouse.”

Database Implementation:

Relational databases excel at expressing and enforcing such rules through integrity constraints — declarative rules that the database automatically enforces.

Types of Data Integrity Constraints

This section introduces seven fundamental types of integrity constraints. Each will be covered in detail in subsequent chapters, with DataJoint implementation examples.

1. Domain Integrity

Ensures values are within valid ranges and types.

Domain integrity restricts attribute values to predefined valid sets using:

Example: Recording temperature must be between 20-25°C.

Covered in: Tables — Data type specification


2. Completeness

Guarantees required data is present.

Completeness prevents missing values that could invalidate analyses:

Example: Every experiment must have a start date.

Covered in:


3. Entity Integrity

Each real-world entity corresponds to exactly one database record, and each database record corresponds to exactly one real-world entity.

Entity integrity ensures a one-to-one correspondence between real-world entities and their digital representations in the database. This is not simply about having unique identifiers—it’s about establishing a reliable, bidirectional mapping where:

Example: Each mouse in the lab has exactly one unique ID, and that ID refers to exactly one mouse—never two different mice sharing the same ID, and never one mouse having multiple IDs.

Covered in:


4. Referential Integrity

Relationships between entities remain consistent.

Referential integrity maintains logical associations across tables:

Example: A recording session cannot reference a non-existent mouse.

Covered in:


5. Compositional Integrity

Complex entities remain complete with all parts.

Compositional integrity ensures multi-part entities are never partially stored:

Example: An imaging session’s metadata and all acquired frames are stored together or not at all.

Covered in:


6. Consistency

All users see the same valid data state.

Consistency provides a unified view during concurrent access:

Example: Two researchers inserting experiments simultaneously don’t create duplicates.

Covered in:


7. Workflow Integrity

Operations execute in valid sequences that respect enterprise processes.

Workflow integrity extends referential integrity by enforcing not just what relationships exist, but when and how entities are created through operational sequences. While traditional databases ensure that a recording session references a valid mouse (referential integrity), workflow integrity also ensures that the mouse must be created before the recording session can be created—preserving the temporal and operational order of your enterprise processes.

Workflow integrity maintains valid operation sequences through:

Example: An analysis pipeline cannot compute results before acquiring raw data. If NeuronAnalysis depends on SpikeData, which depends on RecordingSession, the database enforces that recordings are created before spike detection, which occurs before analysis—maintaining the integrity of the entire scientific workflow.

Covered in:


The Power of Declarative Constraints

Unlike application-level validation (checking rules in Python code), database constraints are:

  1. Always enforced — Cannot be bypassed by any application

  2. Automatically checked — No developer implementation needed

  3. Concurrent-safe — Work correctly with multiple users

  4. Self-documenting — Schema explicitly declares rules

  5. Performance-optimized — Database engine enforces efficiently

Example Contrast:

# Application-level (fragile)
if mouse_id not in existing_mice:
    raise ValueError("Invalid mouse ID")
# Can be bypassed by other applications

# Database-level (robust)
# RecordingSession.mouse → FOREIGN KEY → Mouse.mouse_id
# Automatically enforced for all applications

DataJoint’s Approach to Integrity

DataJoint builds on SQL’s integrity mechanisms with additional features:

As you progress through the following chapters, you’ll see how DataJoint implements each integrity type through concise, expressive table declarations.