Why University Databases?¶
University registration databases are among the most common examples in introductory database design texts. They provide an ideal learning environment because:
Familiar domain: Everyone understands students, courses, enrollments, and grades
Rich relationships: Demonstrate one-to-many (student → enrollments), many-to-many (students ↔ courses), and hierarchical structures (department → course → section)
Real-world complexity: Capture actual business rules like prerequisites, term-based scheduling, and grade tracking
Clear entity types: Students, courses, departments, and terms are easily recognizable entities
This example demonstrates how relational workflow principles and normalization work together to create a well-structured database that accurately represents the university’s operations.
Applying Relational Workflow Principles¶
This schema follows DataJoint’s Relational Workflow Model, where each table represents entities created at specific workflow steps:
Student registration (
Student): Students are registered when they first enrollDepartment creation (
Department): Departments exist independently and are created administrativelyMajor declaration (
StudentMajor): Students declare majors at a specific point in timeCourse catalog (
Course): Courses are defined by departments before they can be offeredTerm scheduling (
Term,CurrentTerm): Academic terms are created, and one is designated as currentSection creation (
Section): Course sections are created for specific termsEnrollment (
Enroll): Students enroll in sections during registration periodsGrading (
Grade): Grades are assigned after course completion
The foreign keys (->) establish workflow dependencies that prescribe the order of operations. For example, you cannot enroll a student (Enroll) until both the student (Student) and section (Section) exist. You cannot create a section (Section) until the course (Course) and term (Term) exist. This DAG structure ensures workflows execute in the correct sequence.
Applying Normalization Principles¶
This design follows entity normalization principles, where each table represents exactly one well-defined entity type:
Student: One table for student entities with their permanent attributesDepartment: One table for department entitiesCourse: One table for course entities (depends on Department)Term: One table for academic term entitiesSection: One table for section entities (depends on Course and Term)Enroll: Represents the enrollment relationship between Student and SectionGrade: Represents the grading of an enrollment (depends on Enroll)
Key normalization decisions:
StudentMajor is separate from Student: A student can have multiple majors over time, and each major declaration has its own date. This avoids update anomalies and preserves history.
Grade is separate from Enroll: Not all enrollments have grades (students may drop), and grades are assigned at a different workflow step than enrollment.
Section combines Course and Term: A section represents a specific offering of a course in a specific term, requiring both as part of its identity.
CurrentTerm uses a singleton pattern: Only one term can be current at a time, enforced by a surrogate key (
omega) with a single value.
This normalized design eliminates redundancy, prevents update anomalies, and ensures each fact is stored in exactly one place. The foreign keys maintain referential integrity while also establishing the workflow execution order.
Schema definition¶
The schema definition code is in the university.py module:
%pycat university.pyimport datajoint as dj
schema = dj.Schema('university')
@schema
class Student(dj.Manual):
definition = """
student_id : int unsigned # university-wide ID number
---
first_name : varchar(40)
last_name : varchar(40)
sex : enum('F', 'M', 'U')
date_of_birth : date
home_address : varchar(120) # mailing street address
home_city : varchar(60) # mailing address
home_state : char(2) # US state acronym: e.g. OH
home_zip : char(10) # zipcode e.g. 93979-4979
home_phone : varchar(20) # e.g. 414.657.6883x0881
"""
@schema
class Department(dj.Manual):
definition = """
dept : varchar(6) # abbreviated department name, e.g. BIOL
---
dept_name : varchar(200) # full department name
dept_address : varchar(200) # mailing address
dept_phone : varchar(20)
"""
@schema
class StudentMajor(dj.Manual):
definition = """
-> Student
---
-> Department
declare_date : date # when student declared her major
"""
@schema
class Course(dj.Manual):
definition = """
-> Department
course : int unsigned # course number, e.g. 1010
---
course_name : varchar(200) # e.g. "Neurobiology of Sensation and Movement."
credits : decimal(3,1) # number of credits earned by completing the course
"""
@schema
class Term(dj.Manual):
definition = """
term_year : year
term : enum('Spring', 'Summer', 'Fall')
"""
@schema
class Section(dj.Manual):
definition = """
-> Course
-> Term
section : char(1)
---
auditorium : varchar(12)
"""
@schema
class CurrentTerm(dj.Manual):
definition = """
omega : enum('1') # surrogate
---
-> Term
"""
@schema
class Enroll(dj.Manual):
definition = """
-> Student
-> Section
"""
@schema
class LetterGrade(dj.Lookup):
definition = """
grade : char(2)
---
points : decimal(3,2)
"""
contents = [
['A', 4.00],
['A-', 3.67],
['B+', 3.33],
['B', 3.00],
['B-', 2.67],
['C+', 2.33],
['C', 2.00],
['C-', 1.67],
['D+', 1.33],
['D', 1.00],
['F', 0.00]
]
@schema
class Grade(dj.Manual):
definition = """
-> Enroll
---
-> LetterGrade
"""
from university import *[2025-09-12 18:32:22,035][INFO]: DataJoint 0.14.6 connected to root@localhost:3306
dj.Diagram(schema)Populate¶
To keep the example self-contained we generate synthetic data programmatically. The goal is to approximate how a registrar’s office would introduce information into the system while exercising every workflow dependency in the schema.
Our approach:
Guard against re-population – we check whether
Gradealready has records and stop if the schema was populated before. This keeps the notebook idempotent.Create base entities with Faker – a helper generator (
yield_students) continuously yields realistic student records (names, addresses, dates of birth, phone numbers). We use it to insert 2,000 students intoStudent.Insert reference data manually – departments and courses are seeded from curated lists so the catalog looks plausible.
Generate workflow records – in dependency order we insert majors (
StudentMajor), terms (Term,CurrentTerm), sections (Section), and finally enrollments (Enroll). Random choices guarantee variety while respecting dependencies (e.g., a section cannot exist without its course and term).Assign grades – once enrollments exist, we assign grades to roughly 90% of them using
LetterGrade, leaving some records ungraded to mimic withdrawals or incomplete courses.
The cells that follow execute these steps in the correct order so that each table is populated only after its parents exist, showcasing the relational workflow in action.
if Grade():
raise RuntimeError(
"Already populated. You only need to execute this notebook once.\n"
"Drop the schema and restart the notebook to re-populate.")
# schema.drop()from tqdm import tqdm
import faker
import random
import datetime
fake = faker.Faker()
faker.Faker.seed(42) # For reproducible resultsdef yield_students():
fake_name = {'F': fake.name_female, 'M': fake.name_male}
while True: # ignore invalid values
try:
sex = random.choice(('F', 'M'))
first_name, last_name = fake_name[sex]().split(' ')[:2]
street_address, city = fake.address().split('\n')
city, state = city.split(', ')
state, zipcode = state.split(' ')
except ValueError:
continue
else:
yield dict(
first_name=first_name,
last_name=last_name,
sex=sex,
home_address=street_address,
home_city=city,
home_state=state,
home_zip=zipcode,
date_of_birth=str(
fake.date_time_between(start_date="-35y", end_date="-15y").date()),
home_phone = fake.phone_number()[:20])Student.insert(
dict(k, student_id=i) for i, k in zip(range(1000,3000), yield_students()))Student()Department.insert(
dict(dept=dept,
dept_name=name,
dept_address=fake.address(),
dept_phone=fake.phone_number()[:20])
for dept, name in [
["CS", "Computer Science"],
["BIOL", "Life Sciences"],
["PHYS", "Physics"],
["MATH", "Mathematics"]])StudentMajor.insert({**s, **d,
'declare_date':fake.date_between(start_date=datetime.date(1999,1,1))}
for s, d in zip(Student.fetch('KEY'), random.choices(Department.fetch('KEY'), k=len(Student())))
if random.random() < 0.75)StudentMajor()# from https://www.utah.edu/
Course.insert([
['BIOL', 1006, 'World of Dinosaurs', 3],
['BIOL', 1010, 'Biology in the 21st Century', 3],
['BIOL', 1030, 'Human Biology', 3],
['BIOL', 1210, 'Principles of Biology', 4],
['BIOL', 2010, 'Evolution & Diversity of Life', 3],
['BIOL', 2020, 'Principles of Cell Biology', 3],
['BIOL', 2021, 'Principles of Cell Science', 4],
['BIOL', 2030, 'Principles of Genetics', 3],
['BIOL', 2210, 'Human Genetics',3],
['BIOL', 2325, 'Human Anatomy', 4],
['BIOL', 2330, 'Plants & Society', 3],
['BIOL', 2355, 'Field Botany', 2],
['BIOL', 2420, 'Human Physiology', 4],
['PHYS', 2040, 'Classcal Theoretical Physics II', 4],
['PHYS', 2060, 'Quantum Mechanics', 3],
['PHYS', 2100, 'General Relativity and Cosmology', 3],
['PHYS', 2140, 'Statistical Mechanics', 4],
['PHYS', 2210, 'Physics for Scientists and Engineers I', 4],
['PHYS', 2220, 'Physics for Scientists and Engineers II', 4],
['PHYS', 3210, 'Physics for Scientists I (Honors)', 4],
['PHYS', 3220, 'Physics for Scientists II (Honors)', 4],
['MATH', 1250, 'Calculus for AP Students I', 4],
['MATH', 1260, 'Calculus for AP Students II', 4],
['MATH', 1210, 'Calculus I', 4],
['MATH', 1220, 'Calculus II', 4],
['MATH', 2210, 'Calculus III', 3],
['MATH', 2270, 'Linear Algebra', 4],
['MATH', 2280, 'Introduction to Differential Equations', 4],
['MATH', 3210, 'Foundations of Analysis I', 4],
['MATH', 3220, 'Foundations of Analysis II', 4],
['CS', 1030, 'Foundations of Computer Science', 3],
['CS', 1410, 'Introduction to Object-Oriented Programming', 4],
['CS', 2420, 'Introduction to Algorithms & Data Structures', 4],
['CS', 2100, 'Discrete Structures', 3],
['CS', 3500, 'Software Practice', 4],
['CS', 3505, 'Software Practice II', 3],
['CS', 3810, 'Computer Organization', 4],
['CS', 4400, 'Computer Systems', 4],
['CS', 4150, 'Algorithms', 3],
['CS', 3100, 'Models of Computation', 3],
['CS', 3200, 'Introduction to Scientific Computing', 3],
['CS', 4000, 'Senior Capstone Project - Design Phase', 3],
['CS', 4500, 'Senior Capstone Project', 3],
['CS', 4940, 'Undergraduate Research', 3],
['CS', 4970, 'Computer Science Bachelor''s Thesis', 3]])Term.insert(dict(term_year=year, term=term)
for year in range(1999, 2025)
for term in ['Spring', 'Summer', 'Fall'])
Term().fetch(order_by=('term_year DESC', 'term DESC'), as_dict=True, limit=1)[0]
CurrentTerm().insert1({
**Term().fetch(order_by=('term_year DESC', 'term DESC'), as_dict=True, limit=1)[0]})
def make_section(prob):
for c in (Course * Term).proj():
for sec in 'abcd':
if random.random() < prob:
break
yield {
**c, 'section': sec,
'auditorium': random.choice('ABCDEF') + str(random.randint(1,100))}
Section.insert(make_section(0.5))# Enrollment
terms = Term().fetch('KEY')
quit_prob = 0.1
for student in tqdm(Student.fetch('KEY')):
start_term = random.randrange(len(terms))
for term in terms[start_term:]:
if random.random() < quit_prob:
break
else:
sections = ((Section & term) - (Course & (Enroll & student))).fetch('KEY')
if sections:
Enroll.insert({**student, **section} for section in
random.sample(sections, random.randrange(min(5, len(sections))))) 0%| | 0/2000 [00:00<?, ?it/s]100%|██████████| 2000/2000 [00:18<00:00, 110.90it/s]
# assign random grades
grades = LetterGrade.fetch('grade')
grade_keys = Enroll.fetch('KEY')
random.shuffle(grade_keys)
grade_keys = grade_keys[:len(grade_keys)*9//10]
Grade.insert({**key, 'grade':grade}
for key, grade in zip(grade_keys, random.choices(grades, k=len(grade_keys))))Grade()Student()