University - The DataJoint Book

Why University Databases?¶

University registration databases are among the most common examples in introductory database design texts. They provide an ideal learning environment because:

Familiar domain: Everyone understands students, courses, enrollments, and grades
Rich relationships: Demonstrate one-to-many (student → enrollments), many-to-many (students ↔ courses), and hierarchical structures (department → course → section)
Real-world complexity: Capture actual business rules like prerequisites, term-based scheduling, and grade tracking
Clear entity types: Students, courses, departments, and terms are easily recognizable entities

This example demonstrates how relational workflow principles and normalization work together to create a well-structured database that accurately represents the university’s operations.

Applying Relational Workflow Principles¶

This schema follows DataJoint’s Relational Workflow Model, where each table represents entities created at specific workflow steps:

Student registration (Student): Students are registered when they first enroll
Department creation (Department): Departments exist independently and are created administratively
Major declaration (StudentMajor): Students declare majors at a specific point in time
Course catalog (Course): Courses are defined by departments before they can be offered
Term scheduling (Term, CurrentTerm): Academic terms are created, and one is designated as current
Section creation (Section): Course sections are created for specific terms
Enrollment (Enroll): Students enroll in sections during registration periods
Grading (Grade): Grades are assigned after course completion

The foreign keys (->) establish workflow dependencies that prescribe the order of operations. For example, you cannot enroll a student (Enroll) until both the student (Student) and section (Section) exist. You cannot create a section (Section) until the course (Course) and term (Term) exist. This DAG structure ensures workflows execute in the correct sequence.

Applying Normalization Principles¶

This design follows entity normalization principles, where each table represents exactly one well-defined entity type:

Student: One table for student entities with their permanent attributes
Department: One table for department entities
Course: One table for course entities (depends on Department)
Term: One table for academic term entities
Section: One table for section entities (depends on Course and Term)
Enroll: Represents the enrollment relationship between Student and Section
Grade: Represents the grading of an enrollment (depends on Enroll)

Key normalization decisions:

StudentMajor is separate from Student: A student can have multiple majors over time, and each major declaration has its own date. This avoids update anomalies and preserves history.
Grade is separate from Enroll: Not all enrollments have grades (students may drop), and grades are assigned at a different workflow step than enrollment.
Section combines Course and Term: A section represents a specific offering of a course in a specific term, requiring both as part of its identity.
CurrentTerm uses a singleton pattern: Only one term can be current at a time, enforced by a surrogate key (omega) with a single value.

This normalized design eliminates redundancy, prevents update anomalies, and ensures each fact is stored in exactly one place. The foreign keys maintain referential integrity while also establishing the workflow execution order.

Schema definition¶

The schema definition code is in the university.py module:

%pycat university.py

import datajoint as dj


schema = dj.Schema('university')


@schema
class Student(dj.Manual):
    definition = """
    student_id : int unsigned   # university-wide ID number
    ---
    first_name      : varchar(40)
    last_name       : varchar(40)
    sex             : enum('F', 'M', 'U')
    date_of_birth   : date
    home_address    : varchar(120) # mailing street address
    home_city       : varchar(60)  # mailing address
    home_state      : char(2)      # US state acronym: e.g. OH
    home_zip        : char(10)     # zipcode e.g. 93979-4979
    home_phone      : varchar(20)  # e.g. 414.657.6883x0881
    """
    
@schema
class Department(dj.Manual):
    definition = """
    dept : varchar(6)   # abbreviated department name, e.g. BIOL
    ---
    dept_name    : varchar(200)  # full department name
    dept_address : varchar(200)  # mailing address
    dept_phone   : varchar(20)
    """

@schema
class StudentMajor(dj.Manual):
    definition = """
    -> Student
    ---
    -> Department
    declare_date :  date  # when student declared her major
    """

@schema
class Course(dj.Manual):
    definition = """
    -> Department
    course  : int unsigned   # course number, e.g. 1010
    ---
    course_name :  varchar(200)  # e.g. "Neurobiology of Sensation and Movement."
    credits     :  decimal(3,1)  # number of credits earned by completing the course
    """

@schema
class Term(dj.Manual):
    definition = """
    term_year : year
    term      : enum('Spring', 'Summer', 'Fall')
    """

@schema
class Section(dj.Manual):
    definition = """
    -> Course
    -> Term
    section : char(1)
    ---
    auditorium   :  varchar(12)
    """

@schema
class CurrentTerm(dj.Manual):
    definition = """
    omega : enum('1')  # surrogate
    ---
    -> Term
    """

@schema
class Enroll(dj.Manual):
    definition = """
    -> Student
    -> Section
    """

@schema
class LetterGrade(dj.Lookup):
    definition = """
    grade : char(2)
    ---
    points : decimal(3,2)
    """
    contents = [
        ['A',  4.00],
        ['A-', 3.67],
        ['B+', 3.33],
        ['B',  3.00],
        ['B-', 2.67],
        ['C+', 2.33],
        ['C',  2.00],
        ['C-', 1.67],
        ['D+', 1.33],
        ['D',  1.00],
        ['F',  0.00]
        ]



@schema
class Grade(dj.Manual):
    definition = """
    -> Enroll
    ---
    -> LetterGrade
    """

from university import *

[2025-09-12 18:32:22,035][INFO]: DataJoint 0.14.6 connected to root@localhost:3306

dj.Diagram(schema)

Populate¶

To keep the example self-contained we generate synthetic data programmatically. The goal is to approximate how a registrar’s office would introduce information into the system while exercising every workflow dependency in the schema.

Our approach:

Guard against re-population – we check whether Grade already has records and stop if the schema was populated before. This keeps the notebook idempotent.
Create base entities with Faker – a helper generator (yield_students) continuously yields realistic student records (names, addresses, dates of birth, phone numbers). We use it to insert 2,000 students into Student.
Insert reference data manually – departments and courses are seeded from curated lists so the catalog looks plausible.
Generate workflow records – in dependency order we insert majors (StudentMajor), terms (Term, CurrentTerm), sections (Section), and finally enrollments (Enroll). Random choices guarantee variety while respecting dependencies (e.g., a section cannot exist without its course and term).
Assign grades – once enrollments exist, we assign grades to roughly 90% of them using LetterGrade, leaving some records ungraded to mimic withdrawals or incomplete courses.

The cells that follow execute these steps in the correct order so that each table is populated only after its parents exist, showcasing the relational workflow in action.

if Grade():
    raise RuntimeError(
        "Already populated. You only need to execute this notebook once.\n"
        "Drop the schema and restart the notebook to re-populate.")
    # schema.drop()

from tqdm import tqdm
import faker
import random
import datetime
fake = faker.Faker()
faker.Faker.seed(42)  # For reproducible results

def yield_students():
    fake_name = {'F': fake.name_female, 'M': fake.name_male}
    while True:  # ignore invalid values
        try:
            sex = random.choice(('F', 'M'))
            first_name, last_name = fake_name[sex]().split(' ')[:2]
            street_address, city = fake.address().split('\n')
            city, state = city.split(', ')
            state, zipcode = state.split(' ')            
        except ValueError:
            continue
        else:
            yield dict(
                first_name=first_name,
                last_name=last_name,
                sex=sex,
                home_address=street_address,
                home_city=city,
                home_state=state,
                home_zip=zipcode,
                date_of_birth=str(
                    fake.date_time_between(start_date="-35y", end_date="-15y").date()),
                home_phone = fake.phone_number()[:20])

Student.insert(
    dict(k, student_id=i) for i, k in zip(range(1000,3000), yield_students()))

Student()

Department.insert(
    dict(dept=dept, 
         dept_name=name, 
         dept_address=fake.address(), 
         dept_phone=fake.phone_number()[:20])
    for dept, name in [
        ["CS", "Computer Science"],
        ["BIOL", "Life Sciences"],
        ["PHYS", "Physics"],
        ["MATH", "Mathematics"]])

StudentMajor.insert({**s, **d, 
                     'declare_date':fake.date_between(start_date=datetime.date(1999,1,1))}
    for s, d in zip(Student.fetch('KEY'), random.choices(Department.fetch('KEY'), k=len(Student())))
                   if random.random() < 0.75)

StudentMajor()

# from https://www.utah.edu/
Course.insert([
    ['BIOL', 1006, 'World of Dinosaurs', 3],
    ['BIOL', 1010, 'Biology in the 21st Century', 3],
    ['BIOL', 1030, 'Human Biology', 3],
    ['BIOL', 1210, 'Principles of Biology', 4],
    ['BIOL', 2010, 'Evolution & Diversity of Life', 3],
    ['BIOL', 2020, 'Principles of Cell Biology',  3],
    ['BIOL', 2021, 'Principles of Cell Science',  4],
    ['BIOL', 2030, 'Principles of Genetics',  3],
    ['BIOL', 2210, 'Human Genetics',3],
    ['BIOL', 2325, 'Human Anatomy', 4],
    ['BIOL', 2330, 'Plants & Society', 3],
    ['BIOL', 2355, 'Field Botany', 2],
    ['BIOL', 2420, 'Human Physiology', 4],

    ['PHYS', 2040, 'Classcal Theoretical Physics II', 4],
    ['PHYS', 2060, 'Quantum Mechanics', 3],
    ['PHYS', 2100, 'General Relativity and Cosmology', 3],
    ['PHYS', 2140, 'Statistical Mechanics', 4],
    
    ['PHYS', 2210, 'Physics for Scientists and Engineers I', 4], 
    ['PHYS', 2220, 'Physics for Scientists and Engineers II', 4],
    ['PHYS', 3210, 'Physics for Scientists I (Honors)', 4],
    ['PHYS', 3220, 'Physics for Scientists II (Honors)', 4],
    
    ['MATH', 1250, 'Calculus for AP Students I', 4],
    ['MATH', 1260, 'Calculus for AP Students II', 4],
    ['MATH', 1210, 'Calculus I', 4],
    ['MATH', 1220, 'Calculus II', 4],
    ['MATH', 2210, 'Calculus III', 3],
 
    ['MATH', 2270, 'Linear Algebra', 4],
    ['MATH', 2280, 'Introduction to Differential Equations', 4],
    ['MATH', 3210, 'Foundations of Analysis I', 4],
    ['MATH', 3220, 'Foundations of Analysis II', 4],
    
    ['CS', 1030, 'Foundations of Computer Science', 3],
    ['CS', 1410, 'Introduction to Object-Oriented Programming', 4],
    ['CS', 2420, 'Introduction to Algorithms & Data Structures', 4],
    ['CS', 2100, 'Discrete Structures', 3],
    ['CS', 3500, 'Software Practice', 4],
    ['CS', 3505, 'Software Practice II', 3],
    ['CS', 3810, 'Computer Organization', 4],
    ['CS', 4400, 'Computer Systems', 4],
    ['CS', 4150, 'Algorithms', 3],
    ['CS', 3100, 'Models of Computation', 3],
    ['CS', 3200, 'Introduction to Scientific Computing', 3],
    ['CS', 4000, 'Senior Capstone Project - Design Phase', 3],
    ['CS', 4500, 'Senior Capstone Project', 3],
    ['CS', 4940, 'Undergraduate Research', 3],
    ['CS', 4970, 'Computer Science Bachelor''s Thesis', 3]])

Term.insert(dict(term_year=year, term=term) 
            for year in range(1999, 2025) 
            for term in ['Spring', 'Summer', 'Fall'])

Term().fetch(order_by=('term_year DESC', 'term DESC'), as_dict=True, limit=1)[0]

CurrentTerm().insert1({
    **Term().fetch(order_by=('term_year DESC', 'term DESC'), as_dict=True, limit=1)[0]})

def make_section(prob):
    for c in (Course * Term).proj():
        for sec in 'abcd':
            if random.random() < prob:
                break
            yield {
                **c, 'section': sec, 
                'auditorium': random.choice('ABCDEF') + str(random.randint(1,100))}   

Section.insert(make_section(0.5))

# Enrollment 
terms = Term().fetch('KEY')
quit_prob = 0.1
for student in tqdm(Student.fetch('KEY')):
    start_term = random.randrange(len(terms))
    for term in terms[start_term:]:
        if random.random() < quit_prob:
            break
        else:
            sections = ((Section & term) - (Course & (Enroll & student))).fetch('KEY')
            if sections:
                Enroll.insert({**student, **section} for section in 
                              random.sample(sections, random.randrange(min(5, len(sections)))))

  0%|          | 0/2000 [00:00<?, ?it/s]

100%|██████████| 2000/2000 [00:18<00:00, 110.90it/s]

# assign random grades
grades = LetterGrade.fetch('grade')

grade_keys = Enroll.fetch('KEY')
random.shuffle(grade_keys)
grade_keys = grade_keys[:len(grade_keys)*9//10]

Grade.insert({**key, 'grade':grade} 
             for key, grade in zip(grade_keys, random.choices(grades, k=len(grade_keys))))

Grade()

Student()