The aggregation operator computes summary statistics from related entities. It augments each entity in one table with values computed from matching entities in another table.
Understanding Aggregation¶
Aggregation answers questions like:
How many experiments has each animal participated in?
What is each student’s GPA?
How many direct reports does each manager have?
The result preserves the primary key of the grouping table while adding computed attributes.
Key Concepts¶
Grouping entity: The table whose entities you’re augmenting (e.g., Student)
Aggregated entity: The table whose data is being summarized (e.g., Grade)
Algebraic closure: The result has the same primary key as the grouping entity
Left join semantics: Entities without matches still appear (with NULL or default values)
Basic Syntax¶
# Aggregate related entities
result = GroupingTable.aggr(AggregatedTable, new_attr='AGG_FUNC(expression)', ...)Components¶
| Component | Description |
|---|---|
GroupingTable | The table whose entities define the groups |
AggregatedTable | The table (or query) whose data is being summarized |
new_attr='...' | Named aggregate expressions using SQL aggregate functions |
Aggregate Functions¶
Common SQL aggregate functions available in expressions:
| Function | Description |
|---|---|
COUNT(*) | Count of matching rows |
COUNT(attr) | Count of non-NULL values |
SUM(attr) | Sum of values |
AVG(attr) | Average of values |
MIN(attr) | Minimum value |
MAX(attr) | Maximum value |
GROUP_CONCAT(attr) | Concatenate values into a string |
Counting Related Entities¶
The most common aggregation counts how many related entities exist for each grouping entity.
Example: Count Enrollments per Student¶
# How many courses is each student enrolled in?
enrollment_counts = Student.aggr(Enroll, n_courses='COUNT(*)')Result structure:
Primary key:
student_id(from Student)New attribute:
n_courses(count of enrollments)
SQL Equivalent:
SELECT s.*, COUNT(e.student_id) AS n_courses
FROM student s
LEFT JOIN enroll e ON s.student_id = e.student_id
GROUP BY s.student_id;Example: Count Students per Section¶
# How many students are enrolled in each section?
section_sizes = Section.aggr(Enroll, n_students='COUNT(*)')Example: Count Direct Reports per Manager¶
# For each manager, count their direct reports
managers = Employee.proj(manager_id='employee_id')
report_counts = managers.aggr(ReportsTo, n_reports='COUNT(*)')Computing Statistics¶
Aggregation can compute any SQL aggregate function on related data.
Example: Average Grade per Student¶
# Compute average grade for each student
avg_grades = Student.aggr(Grade, avg_grade='AVG(grade_value)')Example: GPA Calculation¶
GPA requires weighting grades by credits:
# Compute weighted GPA for each student
student_gpa = Student.aggr(
Course * Grade * LetterGrade,
gpa='SUM(points * credits) / SUM(credits)',
total_credits='SUM(credits)'
)Here, Course * Grade * LetterGrade joins the tables to access both credits (from Course) and points (from LetterGrade).
Example: Order Statistics¶
# For each order, compute item statistics
order_stats = Order.aggr(
OrderItem,
n_items='COUNT(*)',
total='SUM(quantity * unit_price)',
avg_item_price='AVG(unit_price)'
)Multiple Aggregate Expressions¶
You can compute multiple aggregates in a single operation:
# Compute multiple statistics per student
student_stats = Student.aggr(
Grade,
n_grades='COUNT(*)',
avg_grade='AVG(grade_value)',
min_grade='MIN(grade_value)',
max_grade='MAX(grade_value)'
)All aggregate expressions are computed simultaneously for each group.
Aggregation with Restrictions¶
Apply restrictions to either the grouping table or the aggregated table.
Restricting the Grouping Table¶
Filter which entities receive aggregated values:
# GPA only for CS majors
cs_student_gpa = (Student & (StudentMajor & {'dept': 'CS'})).aggr(
Course * Grade * LetterGrade,
gpa='SUM(points * credits) / SUM(credits)'
)Restricting the Aggregated Table¶
Filter which data is included in the aggregation:
# Count only current term enrollments per student
current_enrollments = Student.aggr(
Enroll & CurrentTerm,
n_current='COUNT(*)'
)
# Average grade for math courses only
math_avg = Student.aggr(
Grade & {'dept': 'MATH'},
math_avg='AVG(grade_value)'
)Combining Both¶
# For seniors only, compute GPA from upper-division courses
senior_upper_gpa = (Student & {'class_standing': 'Senior'}).aggr(
Course * Grade * LetterGrade & 'course >= 3000',
upper_gpa='SUM(points * credits) / SUM(credits)'
)Filtering Aggregation Results¶
After aggregation, you can restrict based on the computed values:
# Students with GPA above 3.5
student_gpa = Student.aggr(
Course * Grade * LetterGrade,
gpa='SUM(points * credits) / SUM(credits)'
)
honor_students = student_gpa & 'gpa >= 3.5'# Sections with more than 30 students
section_sizes = Section.aggr(Enroll, n='COUNT(*)')
large_sections = section_sizes & 'n > 30'SQL Equivalent (using HAVING):
SELECT s.*, COUNT(*) AS n
FROM section s
LEFT JOIN enroll e USING (dept, course, section_id)
GROUP BY s.dept, s.course, s.section_id
HAVING n > 30;Left Join Behavior¶
Aggregation uses left join semantics: all entities from the grouping table appear in the result, even if they have no matching records in the aggregated table.
Example: Students Without Grades¶
# All students with their grade count (0 for students without grades)
grade_counts = Student.aggr(Grade, n_grades='COUNT(*)')Students without any grades will have n_grades = 0.
Example: Preserving All Sections¶
# All sections with enrollment count (0 for empty sections)
all_section_sizes = Section.aggr(Enroll, n_students='COUNT(*)')Empty sections appear with n_students = 0.
Examples from the University Database¶
Example 1: Student Statistics¶
# Comprehensive student statistics
student_stats = Student.aggr(
Course * Grade * LetterGrade,
n_courses='COUNT(*)',
total_credits='SUM(credits)',
gpa='SUM(points * credits) / SUM(credits)'
)Example 2: Department Statistics¶
# Count majors per department
dept_major_counts = Department.aggr(StudentMajor, n_majors='COUNT(*)')
# Average GPA per department (from majors in that department)
dept_gpa = Department.aggr(
StudentMajor * Student.aggr(
Course * Grade * LetterGrade,
gpa='SUM(points * credits) / SUM(credits)'
),
dept_avg_gpa='AVG(gpa)'
)Example 3: Course Popularity¶
# Count total enrollments per course (across all sections and terms)
course_popularity = Course.aggr(
Section * Enroll,
total_enrollments='COUNT(*)'
)
# Most popular courses
popular_courses = course_popularity & 'total_enrollments > 100'Example 4: Grade Distribution¶
# Count each grade type per course
grade_distribution = Course.aggr(
Grade,
a_count='SUM(grade="A")',
b_count='SUM(grade="B")',
c_count='SUM(grade="C")',
d_count='SUM(grade="D")',
f_count='SUM(grade="F")'
)Aggregation vs. Join¶
| Operation | Purpose | Result Rows | Result Attributes |
|---|---|---|---|
A.aggr(B, ...) | Summarize B for each A | One row per A | A’s attributes + aggregates |
A * B | Combine A and B | One row per A-B pair | All attributes from A and B |
Example Comparison¶
# Join: One row per student-enrollment pair
Student * Enroll # Many rows per student (one per enrollment)
# Aggregation: One row per student with enrollment count
Student.aggr(Enroll, n='COUNT(*)') # One row per studentUse join when you need individual related records.
Use aggregation when you need summary statistics per entity.
SQL Translation¶
DataJoint’s aggregation translates to SQL with LEFT JOIN and GROUP BY:
Basic Aggregation¶
# DataJoint
Student.aggr(Enroll, n='COUNT(*)')-- SQL
SELECT s.*, COUNT(e.student_id) AS n
FROM student s
LEFT JOIN enroll e ON s.student_id = e.student_id
GROUP BY s.student_id;Aggregation with Joined Tables¶
# DataJoint
Student.aggr(
Course * Grade * LetterGrade,
gpa='SUM(points * credits) / SUM(credits)'
)-- SQL
SELECT s.*, SUM(lg.points * c.credits) / SUM(c.credits) AS gpa
FROM student s
LEFT JOIN grade g ON s.student_id = g.student_id
LEFT JOIN course c ON g.dept = c.dept AND g.course = c.course
LEFT JOIN letter_grade lg ON g.grade = lg.grade
GROUP BY s.student_id;Aggregation with Restriction on Result¶
# DataJoint
Student.aggr(Enroll, n='COUNT(*)') & 'n > 5'-- SQL
SELECT s.*, COUNT(*) AS n
FROM student s
LEFT JOIN enroll e ON s.student_id = e.student_id
GROUP BY s.student_id
HAVING COUNT(*) > 5;Best Practices¶
1. Understand the Grouping¶
The grouping table determines:
The primary key of the result
Which entities appear in the output
The entity type represented by each row
2. Use Meaningful Aggregate Names¶
# Good: descriptive names
Student.aggr(Enroll, n_enrollments='COUNT(*)', total_credits='SUM(credits)')
# Avoid: generic names
Student.aggr(Enroll, n='COUNT(*)', x='SUM(credits)')3. Restrict Before Aggregating When Possible¶
# More efficient: restrict first
(Student & {'home_state': 'CA'}).aggr(Enroll, n='COUNT(*)')
# Less efficient: restrict after
Student.aggr(Enroll, n='COUNT(*)') & {'home_state': 'CA'}4. Handle NULL Values¶
When the aggregated table has NULL values, use IFNULL or COALESCE:
Student.aggr(Grade, avg_grade='AVG(IFNULL(grade_value, 0))')5. Test with Small Data First¶
# Verify the aggregation works correctly
test_result = (Student & {'student_id': 1001}).aggr(Grade, n='COUNT(*)')
print(test_result.fetch())Summary¶
The aggregation operator computes summary statistics from related entities:
Syntax:
GroupingTable.aggr(AggregatedTable, new_attr='AGG_FUNC(...)')Result: Has the same primary key as the grouping table
Left join: All grouping entities appear, even without matches
Multiple aggregates: Compute several statistics in one operation
Chaining: Combine with restrictions and other operators
Use aggregation when you need summary statistics per entity rather than individual related records.
Practice Exercises¶
Exercise 1: Count Enrollments¶
Task: Count how many students are enrolled in each section.
section_counts = Section.aggr(Enroll, n_students='COUNT(*)')Exercise 2: Compute GPA¶
Task: Compute weighted GPA for each student.
student_gpa = Student.aggr(
Course * Grade * LetterGrade,
gpa='SUM(points * credits) / SUM(credits)'
)Exercise 3: Department Statistics¶
Task: Count the number of courses offered by each department.
dept_course_counts = Department.aggr(Course, n_courses='COUNT(*)')Exercise 4: High Enrollment Courses¶
Task: Find courses with more than 50 total enrollments across all sections.
course_enrollments = Course.aggr(Section * Enroll, total='COUNT(*)')
high_enrollment = course_enrollments & 'total > 50'Exercise 5: Student Course Load¶
Task: For each student enrolled in the current term, compute the total credits.
current_load = Student.aggr(
(Enroll & CurrentTerm) * Course,
current_credits='SUM(credits)'
)