Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Primary Keys

Primary Keys: Ensuring Entity Integrity

In the Tables chapter, we learned that attributes above the --- line form the primary key. But why does this matter? The primary key is the cornerstone of entity integrity—the guarantee that each real-world entity corresponds to exactly one database record, and vice versa.

What is a Primary Key?

A primary key is a column or combination of columns that uniquely identifies each row in a table.

Primary Key Requirements

In DataJoint, every table must have a primary key. Primary key attributes:

  • Cannot be NULL

  • Must be unique across all rows

  • Cannot be changed after insertion (immutable)

  • Are declared above the --- line in the table definition

Entity Integrity: The Core Concept

Entity integrity ensures a one-to-one correspondence between real-world entities and their database records:

Without entity integrity, databases become unreliable:

Integrity FailureConsequence
Same entity, multiple recordsFragmented data, conflicting information
Multiple entities, same recordMixed data, privacy violations
Cannot match entity to recordLost data, broken workflows

Imagine what kinds of difficulties would arise if entity integrity broke down in the systems you interact with every day:

Example: If your university had two student records for you, your transcript might show incomplete courses, financial aid could be miscalculated, and graduation requirements might be incorrectly tracked.

The Three Questions of Entity Integrity

When designing a primary key, you must answer three questions:

  1. How do I prevent duplicate records? — Ensure the same entity cannot appear twice

  2. How do I prevent record sharing? — Ensure different entities cannot share a record

  3. How do I match entities to records? — When an entity arrives, how do I find its record?

Example: Laboratory Mouse Database

Consider a neuroscience lab tracking mice:

QuestionAnswer
Prevent duplicates?Each mouse gets a unique ear tag at arrival; database rejects duplicate tags
Prevent sharing?Ear tags are never reused; retired tags are archived
Match entities?Read the ear tag → look up record by primary key
DataJoint
SQL
@schema
class Mouse(dj.Manual):
    definition = """
    ear_tag : char(6)   # unique ear tag (e.g., 'M00142')
    ---
    date_of_birth : date
    sex : enum('M', 'F', 'U')
    strain : varchar(50)
    """

Example: University Student Database

Consider a university registrar’s office tracking students:

QuestionAnswer
Prevent duplicates?Each student gets a unique ID at enrollment; verification against existing records using name, date of birth, and government ID
Prevent sharing?Photo ID cards issued; IDs are never reused even after graduation
Match entities?Student presents ID card → look up record by student ID
DataJoint
SQL
@schema
class Student(dj.Manual):
    definition = """
    student_id : char(8)   # unique student ID (e.g., 'S2024001')
    ---
    first_name : varchar(50)
    last_name : varchar(50)
    date_of_birth : date
    enrollment_date : date
    """

Notice how both examples follow the same pattern: a real-world identification system (ear tags, student IDs) enables the three questions to be answered consistently.

The database enforces the first two questions automatically through the primary key constraint. The third question requires a physical identification system—ear tags, barcodes, or RFID chips that link physical entities to database records.

Types of Primary Keys

Primary keys can be classified along two independent dimensions:

  1. Usage: Natural keys (used in the real world) vs. Surrogate keys (used only inside the database)

  2. Composition: Simple keys (one attribute) vs. Composite keys (multiple attributes)

These dimensions are independent—a natural key can be simple or composite, and so can a surrogate key.

Natural Keys

A natural key is an identifier used outside the database to refer to entities in the real world. The defining characteristic is that the key requires a real-world mechanism to establish and maintain the permanent association between entities and their identifiers.

Natural keys may originate from:

Even when a database or management system generates the identifier, if that identifier is then used in the real world to refer to the entity—printed on labels, written in lab notebooks, referenced in conversations—it functions as a natural key.

Example: Laboratory Animal IDs

A colony management system might generate animal IDs like M00142. Once that ID is printed on an ear tag and attached to a mouse, it becomes the natural key. The real-world mechanism (the ear tag) maintains the association between the physical mouse and its identifier.

DataJoint
SQL
@schema
class Mouse(dj.Manual):
    definition = """
    animal_id : char(6)   # colony-assigned ID (e.g., 'M00142')
    ---
    date_of_birth : date
    sex : enum('M', 'F', 'U')
    strain : varchar(50)
    """

Examples of composite natural keys:

Advantages:

Disadvantages:

Surrogate Keys

A surrogate key is an identifier used primarily inside the database, with minimal or no exposure to end users. Users typically don’t search for entities by surrogate keys or use them in conversation.

Examples:

DataJoint
SQL
@schema
class InternalRecord(dj.Manual):
    definition = """
    record_id : int unsigned   # internal identifier, not exposed to users
    ---
    created_timestamp : timestamp
    data : longblob
    """

Key distinction from natural keys: Surrogate keys don’t require external identification systems because users don’t need to match physical entities to records by these keys. The database maintains uniqueness, but the key itself isn’t used for entity identification in the real world.

When surrogate keys are appropriate:

Composite Keys in Hierarchical Relationships

Composite primary keys commonly arise when tables inherit foreign keys as part of their primary key. This creates hierarchical relationships where child entities are identified within the context of their parent.

DataJoint
SQL
@schema
class Subject(dj.Manual):
    definition = """
    subject_id : varchar(12)   # subject identifier
    ---
    species : varchar(30)
    """

@schema
class Session(dj.Manual):
    definition = """
    -> Subject
    session : smallint unsigned  # session number within subject
    ---
    session_date : date
    """

In this example, Session has a composite primary key (subject_id, session). Each session is uniquely identified by which subject and which session number. This pattern is covered in detail in the Relationships chapter.

Schema Dimensions

A schema dimension is created when a table defines a new primary key attribute directly, rather than inheriting it through a foreign key. Tables that introduce new primary key attributes are said to create new schema dimensions.

Identifying Schema Dimensions

Consider this hierarchy:

DataJoint
SQL
@schema
class Subject(dj.Manual):
    definition = """
    subject_id : varchar(12)   # NEW DIMENSION: defines subject identity
    ---
    species : varchar(30)
    """

@schema
class Session(dj.Manual):
    definition = """
    -> Subject                         # inherits subject_id dimension
    session : smallint unsigned        # NEW DIMENSION: defines session identity within subject
    ---
    session_date : date
    """

@schema
class Scan(dj.Manual):
    definition = """
    -> Session                         # inherits subject_id and session dimensions
    scan : smallint unsigned           # NEW DIMENSION: defines scan identity within session
    ---
    scan_time : time
    """

In this example:

Diagram Notation

In DataJoint diagrams, tables that introduce new schema dimensions have their names underlined. Tables that only inherit their primary key through foreign keys (without adding new attributes) are not underlined—they represent the same identity as their parent.

Why Schema Dimensions Matter

Schema dimensions are fundamental to how DataJoint performs semantic matching in queries. When you join tables or use one table to restrict another, DataJoint matches rows based on shared schema dimensions—not just attributes with the same name.

Two attributes match semantically when they:

  1. Have the same name

  2. Trace back to the same original dimension through foreign key chains

This is why subject_id in Subject, Session, and Scan all refer to the same dimension and will be matched in joins, while an unrelated subject_id in a completely separate table hierarchy would not match.

Schema Dimensions and Auto-Populated Tables

Auto-populated tables (dj.Computed and dj.Imported) have a special constraint: they cannot introduce new schema dimensions directly. Their primary key must be fully determined by their upstream dependencies through foreign keys.

This constraint ensures that auto-populated tables compute results for entities that are already defined elsewhere in the pipeline. The make method receives a key from the key source (derived from parent tables), and the computation produces results for that specific key.

DataJoint
SQL
@schema
class ProcessedScan(dj.Computed):
    definition = """
    -> Scan                    # inherits subject_id, session, scan dimensions
    ---                        # NO new primary key attributes allowed here
    processed_data : longblob
    quality_score : float
    """

However, part tables can introduce new dimensions. When a computation produces multiple related results (e.g., detecting multiple cells in an image), the part table can add a new dimension to distinguish them:

DataJoint
SQL
@schema
class CellDetection(dj.Computed):
    definition = """
    -> Scan                    # master table inherits dimensions
    ---
    detection_method : varchar(60)
    """

    class Cell(dj.Part):
        definition = """
        -> master
        cell_id : smallint unsigned   # NEW DIMENSION: identifies cells within scan
        ---
        cell_x : float
        cell_y : float
        cell_type : varchar(30)
        """

In this example, CellDetection (the master) cannot introduce new dimensions, but CellDetection.Cell (the part table) introduces the cell_id dimension to identify individual detected cells.

Choosing the Right Primary Key Strategy

ScenarioRecommended Approach
Established external ID system existsUse the natural key
Entity naturally identified by multiple attributesUse composite natural key
Entity identified within parent contextInherit foreign key + add local identifier
No natural identifier existsCreate explicit surrogate key
Privacy-sensitive contextSurrogate key (not natural)

Entity Integrity Varies by Context

Different applications require different levels of entity integrity:

LevelExampleEnforcement
StrictAirlines, banksGovernment ID verification, biometrics
ModerateUniversities, hospitalsPhoto ID, documentation
FlexibleGyms, loyalty programsBasic verification, some sharing tolerated
MinimalSocial mediaEmail verification only

Example: Strict vs. Flexible

An airline must know exactly who boards each flight (strict entity integrity). A grocery store loyalty program may not care if family members share a card (flexible entity integrity).

Partial Entity Integrity

Sometimes only one direction of entity integrity is required:

Example: A social media platform might ensure that each user account is tied to exactly one person (preventing account sharing), but not prevent a person from creating multiple accounts. This is partial entity integrity—the record-to-entity direction is enforced, but not entity-to-record.

For many applications, partial integrity is sufficient. Design your primary keys to match your actual requirements—don’t over-engineer for scenarios that don’t matter to your domain.

Entity Integrity Without Natural Keys

When no natural key can be established—no external identifier exists and no real-world mechanism can maintain the entity-to-record association—full entity integrity is still possible but requires a multi-step identification process.

Consider a scenario where anonymous survey responses must be linked to follow-up surveys from the same respondent:

  1. Generate a unique token at the time of first response

  2. Provide the token to the respondent (email, printed card, etc.)

  3. Require the token for follow-up responses

  4. Trust the process to maintain the association

The database ensures uniqueness of records through the primary key, but matching records to real-world entities requires comprehensive process design outside the database. The token becomes a natural key only if the external process reliably maintains the association.

Primary Keys in DataJoint Queries

Primary keys have special significance in DataJoint queries:

  1. Semantic matching in joins — When you join tables with *, DataJoint matches on shared schema dimensions, not just attribute names

  2. Semantic matching in restrictions — When you restrict a table by another (A & B), matching is performed on shared schema dimensions

  3. Restrictions are efficient — Queries by primary key use indexes for fast lookups

  4. Results always have primary keys — Every query result is itself a valid relation with a well-defined primary key

DataJoint
SQL
# Efficient: restriction by primary key
Mouse & {'ear_tag': 'M00142'}

# Join matches on shared schema dimensions
Subject * Session * Scan  # All three share the subject_id dimension

# The result of any query has a well-defined primary key
(Subject * Session).primary_key  # Combines dimensions from both tables

Summary

Primary keys are the foundation of entity integrity in relational databases:

ConceptKey Points
Entity Integrity1:1 correspondence between entities and records; requires external processes
Three QuestionsPrevent duplicates, prevent sharing, enable matching
Natural KeysIdentifiers used in the real world to refer to entities; require external association mechanisms
Surrogate KeysIdentifiers used only inside the database; not exposed to users
Composite KeysMultiple attributes forming the key (applies to both natural and surrogate)
Partial IntegritySometimes only one direction of entity-record correspondence is needed
Schema DimensionsNew primary key attributes define dimensions; inherited attributes share them
Semantic MatchingJoins and restrictions match on shared schema dimensions