What is a schema?¶
The term schema has two related meanings in the context of databases:
1. Schema as a Data Blueprint¶
A schema is a formal specification of the structure of data and the rules governing its integrity. It serves as a blueprint that defines how data is organized, stored, and accessed within a database. This ensures that the database reflects the rules and requirements of the underlying business or research project it supports.
In structured data models, such as the relational model, a schema provides a robust framework for defining:
The structure of tables (relations) and their attributes (columns).
Rules and constraints that ensure data consistency, accuracy, and reliability.
Relationships between tables, such as primary keys (unique identifiers for records) and foreign keys (references to related records in other tables).
Aims of Good Schema Design¶
Data Integrity: Ensures consistency and prevents anomalies.
Query Efficiency: Facilitates fast and accurate data retrieval, supports complex queries, and optimizes database performance.
Scalability: Allows the database to grow and adapt as data volumes increase.
Key Elements of Schema Design¶
Tables and Attributes: Each table is defined with specific attributes (columns), each assigned a data type.
Primary Keys: Uniquely identify each record in a table.
Foreign Keys: Establish relationships between entities in tables.
Indexes: Support efficient queries.
Through careful schema design, database architects create systems that are both efficient and flexible, meeting the current and future needs of an organization. The schema acts as a living document that guides the structure, operations, and integrity of the database.
2. Schema as a Database Module¶
In complex database designs, the term “schema” is also used to describe a distinct module of a larger database with its own namespace that groups related tables together. This modular approach:
Separates tables into logical groups for better organization.
Avoids naming conflicts in large databases with multiple schemas.
Declaring a schema¶
Before you can create tables, you must declare a schema to serve as a namespace for your tables. Each schema requires a unique name to distinguish it within the database.
Here’s how to declare a schema in DataJoint:
import datajoint as dj
# Define the schema
schema = dj.Schema('schema_name')[2024-08-27 04:10:41,167][INFO]: Connecting root@localhost:3306
[2024-08-27 04:10:41,184][INFO]: Connected root@localhost:3306
Using the schema Object¶
The schema object groups related tables together and helps prevent naming conflicts.
By convention, the object created by dj.Schema is named schema. Typically, only one schema object is used in any given Python namespace, usually at the level of a Python module.
The schema object serves multiple purposes:
Creating Tables: Used as a class decorator (
@schema) to declare tables within the schema. For details, see the next section, TablesVisualizing the Schema: Generates diagrams to illustrate relationships between tables.
Exporting Data: Facilitates exporting data for external use or backup.
With this foundation, you are ready to begin declaring tables and building your data pipeline.
Multi-Schema Pipelines¶
As pipelines grow, you will organize tables into multiple schemas. Each schema groups related tables together—for example, subject, acquisition, processing, and analysis.
Simple Scripts vs. Full Projects¶
For learning, exploration, and simple pipelines, you can define schemas directly in Python scripts or Jupyter notebooks, just like the examples throughout this book. This is the easiest way to get started:
# Simple script: my_pipeline.py
import datajoint as dj
schema = dj.Schema('my_experiment')
@schema
class Subject(dj.Manual):
definition = """
subject_id : int
---
subject_name : varchar(100)
"""
@schema
class Session(dj.Manual):
definition = """
-> Subject
session_date : date
---
notes : varchar(500)
"""For production deployment with multiple collaborators, version control, and automated workers, you should organize the pipeline as a proper Python package. See Pipeline Projects for the full project structure including:
Standard layout with
src/workflow/Configuration with
pyproject.tomlDocker deployment
Convention: One Schema = One Module¶
Whether using simple scripts or full projects, the fundamental convention is: one database schema corresponds to one Python module (or one script/notebook for simple cases).
This ensures:
Each module has exactly one
schemaobjectClear dependency management between schemas
No circular imports
Example Schema Module¶
Here is a typical schema module defining the subject schema:
%pycat code/subject.pyimport datajoint as dj
# Define the subject management schema
schema = dj.Schema("subject_management")
@schema
class Subject(dj.Manual):
definition = """
subject_id : int
---
subject_name : varchar(50)
species : varchar(50)
"""
Working with Existing Schemas¶
This section describes how to work with database schemas without access to the original code that generated the schema. These situations often arise when:
The database is created by another user who has not shared the generating code yet
The database schema is created from a programming language other than Python
You need to explore an existing database before writing new code
Listing Available Schemas¶
You can use the dj.list_schemas function to list the names of database schemas available to you:
import datajoint as dj
dj.list_schemas()Connecting to an Existing Schema¶
Just as with a new schema, you start by creating a schema object to connect to the chosen database schema:
schema = dj.Schema('existing_schema_name')If the schema already exists, dj.Schema is initialized as usual and you may plot the schema diagram. But instead of seeing class names, you will see the raw table names as they appear in the database:
dj.Diagram(schema)Spawning Missing Classes¶
When you connect to an existing schema without the original Python code, you can view the diagram but cannot interact with the tables. A similar situation arises when another developer has added new tables to the schema but has not yet shared the updated module code with you.
You may use the schema.spawn_missing_classes method to spawn classes into the local namespace for any tables missing their classes:
schema.spawn_missing_classes()Now you may interact with these tables as if they were declared right here in your namespace.
Creating a Virtual Module¶
The spawn_missing_classes method creates new classes in the local namespace. However, it is often more convenient to import a schema with its Python module, equivalent to:
import university as uniWe can mimic this import without having access to university.py using the create_virtual_module function:
import datajoint as dj
uni = dj.create_virtual_module('university', 'existing_schema_name')Now uni behaves as an imported module complete with the schema object and all the table classes. You can use it like any other module:
dj.Diagram(uni)
uni.Student - uni.StudentMajorVirtual Module Options¶
dj.create_virtual_module takes optional arguments:
create_schema¶
The create_schema=False argument (default) assures that an error is raised when the schema does not already exist. Set it to True if you want to create an empty schema:
# This will raise an error if 'nonexistent' schema doesn't exist
dj.create_virtual_module('what', 'nonexistent')
# This will create the schema if it doesn't exist
dj.create_virtual_module('what', 'nonexistent', create_schema=True)create_tables¶
The create_tables=False argument is passed to the schema object. It prevents the use of the schema object of the virtual module for creating new tables in the existing schema. This is a precautionary measure since virtual modules are often used for completed schemas.
You may set this argument to True if you wish to add new tables to the existing schema:
uni = dj.create_virtual_module('university', 'existing_schema_name', create_tables=True)
@uni.schema
class NewTable(dj.Manual):
definition = """
-> uni.Student
---
example : varchar(255)
"""A more common approach when you need to add tables is to create a new schema object and use the spawn_missing_classes function to make the existing classes available.
Dropping a Schema¶
Dropping a schema in DataJoint involves permanently deleting all the tables within that schema and the schema itself from the database. This is a significant action because it removes not only the tables but also all the data stored within those tables. To drop a schema, you typically use the schema.drop() method, where schema is the schema object you defined earlier in your code.
When you execute this command, DataJoint will prompt you to confirm the action, as it is irreversible and will result in the loss of all data associated with the schema. This operation is especially powerful because it cascades through all tables within the schema, removing each one.
It’s crucial to ensure that any data within the schema is either no longer needed or has been adequately backed up before dropping the schema, as this action will permanently remove all traces of the data and the schema’s structure from the database.
# dropping a schema
schema.drop()