Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Schemas

What is a schema?

The term schema has two related meanings in the context of databases:

1. Schema as a Data Blueprint

A schema is a formal specification of the structure of data and the rules governing its integrity. It serves as a blueprint that defines how data is organized, stored, and accessed within a database. This ensures that the database reflects the rules and requirements of the underlying business or research project it supports.

In structured data models, such as the relational model, a schema provides a robust framework for defining:

Aims of Good Schema Design

Key Elements of Schema Design

Through careful schema design, database architects create systems that are both efficient and flexible, meeting the current and future needs of an organization. The schema acts as a living document that guides the structure, operations, and integrity of the database.

2. Schema as a Database Module

In complex database designs, the term “schema” is also used to describe a distinct module of a larger database with its own namespace that groups related tables together. This modular approach:

Declaring a schema

Before you can create tables, you must declare a schema to serve as a namespace for your tables. Each schema requires a unique name to distinguish it within the database.

Here’s how to declare a schema in DataJoint:

import datajoint as dj

# Define the schema
schema = dj.Schema('schema_name')
[2024-08-27 04:10:41,167][INFO]: Connecting root@localhost:3306
[2024-08-27 04:10:41,184][INFO]: Connected root@localhost:3306

Using the schema Object

The schema object groups related tables together and helps prevent naming conflicts.

By convention, the object created by dj.Schema is named schema. Typically, only one schema object is used in any given Python namespace, usually at the level of a Python module.

The schema object serves multiple purposes:

  • Creating Tables: Used as a class decorator (@schema) to declare tables within the schema. For details, see the next section, Tables

  • Visualizing the Schema: Generates diagrams to illustrate relationships between tables.

  • Exporting Data: Facilitates exporting data for external use or backup.

With this foundation, you are ready to begin declaring tables and building your data pipeline.

Multi-Schema Pipelines

As pipelines grow, you will organize tables into multiple schemas. Each schema groups related tables together—for example, subject, acquisition, processing, and analysis.

Simple Scripts vs. Full Projects

For learning, exploration, and simple pipelines, you can define schemas directly in Python scripts or Jupyter notebooks, just like the examples throughout this book. This is the easiest way to get started:

# Simple script: my_pipeline.py
import datajoint as dj

schema = dj.Schema('my_experiment')

@schema
class Subject(dj.Manual):
    definition = """
    subject_id : int
    ---
    subject_name : varchar(100)
    """

@schema  
class Session(dj.Manual):
    definition = """
    -> Subject
    session_date : date
    ---
    notes : varchar(500)
    """

For production deployment with multiple collaborators, version control, and automated workers, you should organize the pipeline as a proper Python package. See Pipeline Projects for the full project structure including:

  • Standard layout with src/workflow/

  • Configuration with pyproject.toml

  • Docker deployment

Convention: One Schema = One Module

Whether using simple scripts or full projects, the fundamental convention is: one database schema corresponds to one Python module (or one script/notebook for simple cases).

This ensures:

  • Each module has exactly one schema object

  • Clear dependency management between schemas

  • No circular imports

Example Schema Module

Here is a typical schema module defining the subject schema:

%pycat code/subject.py
import datajoint as dj

# Define the subject management schema
schema = dj.Schema("subject_management")

@schema
class Subject(dj.Manual):
    definition = """
    subject_id : int
    ---
    subject_name : varchar(50)
    species : varchar(50)
    """

Working with Existing Schemas

This section describes how to work with database schemas without access to the original code that generated the schema. These situations often arise when:

  • The database is created by another user who has not shared the generating code yet

  • The database schema is created from a programming language other than Python

  • You need to explore an existing database before writing new code

Listing Available Schemas

You can use the dj.list_schemas function to list the names of database schemas available to you:

import datajoint as dj
dj.list_schemas()

Connecting to an Existing Schema

Just as with a new schema, you start by creating a schema object to connect to the chosen database schema:

schema = dj.Schema('existing_schema_name')

If the schema already exists, dj.Schema is initialized as usual and you may plot the schema diagram. But instead of seeing class names, you will see the raw table names as they appear in the database:

dj.Diagram(schema)

Spawning Missing Classes

When you connect to an existing schema without the original Python code, you can view the diagram but cannot interact with the tables. A similar situation arises when another developer has added new tables to the schema but has not yet shared the updated module code with you.

You may use the schema.spawn_missing_classes method to spawn classes into the local namespace for any tables missing their classes:

schema.spawn_missing_classes()

Now you may interact with these tables as if they were declared right here in your namespace.

Creating a Virtual Module

The spawn_missing_classes method creates new classes in the local namespace. However, it is often more convenient to import a schema with its Python module, equivalent to:

import university as uni

We can mimic this import without having access to university.py using the create_virtual_module function:

import datajoint as dj

uni = dj.create_virtual_module('university', 'existing_schema_name')

Now uni behaves as an imported module complete with the schema object and all the table classes. You can use it like any other module:

dj.Diagram(uni)
uni.Student - uni.StudentMajor

Virtual Module Options

dj.create_virtual_module takes optional arguments:

create_schema

The create_schema=False argument (default) assures that an error is raised when the schema does not already exist. Set it to True if you want to create an empty schema:

# This will raise an error if 'nonexistent' schema doesn't exist
dj.create_virtual_module('what', 'nonexistent')

# This will create the schema if it doesn't exist
dj.create_virtual_module('what', 'nonexistent', create_schema=True)

create_tables

The create_tables=False argument is passed to the schema object. It prevents the use of the schema object of the virtual module for creating new tables in the existing schema. This is a precautionary measure since virtual modules are often used for completed schemas.

You may set this argument to True if you wish to add new tables to the existing schema:

uni = dj.create_virtual_module('university', 'existing_schema_name', create_tables=True)

@uni.schema
class NewTable(dj.Manual):
    definition = """
    -> uni.Student 
    ---
    example : varchar(255)
    """

A more common approach when you need to add tables is to create a new schema object and use the spawn_missing_classes function to make the existing classes available.

Dropping a Schema

Dropping a schema in DataJoint involves permanently deleting all the tables within that schema and the schema itself from the database. This is a significant action because it removes not only the tables but also all the data stored within those tables. To drop a schema, you typically use the schema.drop() method, where schema is the schema object you defined earlier in your code.

When you execute this command, DataJoint will prompt you to confirm the action, as it is irreversible and will result in the loss of all data associated with the schema. This operation is especially powerful because it cascades through all tables within the schema, removing each one.

It’s crucial to ensure that any data within the schema is either no longer needed or has been adequately backed up before dropping the schema, as this action will permanently remove all traces of the data and the schema’s structure from the database.

# dropping a schema
schema.drop()