Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Operator: Aggregation

The aggregation operator computes summary statistics from related entities. It augments each entity in one table with values computed from matching entities in another table.

Understanding Aggregation

Aggregation answers questions like:

  • How many experiments has each animal participated in?

  • What is each student’s GPA?

  • How many direct reports does each manager have?

The result preserves the primary key of the grouping table while adding computed attributes.

Key Concepts

  • Grouping entity: The table whose entities you’re augmenting (e.g., Student)

  • Aggregated entity: The table whose data is being summarized (e.g., Grade)

  • Algebraic closure: The result has the same primary key as the grouping entity

  • Left join semantics: Entities without matches still appear (with NULL or default values)

Basic Syntax

# Aggregate related entities
result = GroupingTable.aggr(AggregatedTable, new_attr='AGG_FUNC(expression)', ...)

Components

ComponentDescription
GroupingTableThe table whose entities define the groups
AggregatedTableThe table (or query) whose data is being summarized
new_attr='...'Named aggregate expressions using SQL aggregate functions

Aggregate Functions

Common SQL aggregate functions available in expressions:

FunctionDescription
COUNT(*)Count of matching rows
COUNT(attr)Count of non-NULL values
SUM(attr)Sum of values
AVG(attr)Average of values
MIN(attr)Minimum value
MAX(attr)Maximum value
GROUP_CONCAT(attr)Concatenate values into a string

The most common aggregation counts how many related entities exist for each grouping entity.

Example: Count Enrollments per Student

# How many courses is each student enrolled in?
enrollment_counts = Student.aggr(Enroll, n_courses='COUNT(*)')

Result structure:

  • Primary key: student_id (from Student)

  • New attribute: n_courses (count of enrollments)

SQL Equivalent:

SELECT s.*, COUNT(e.student_id) AS n_courses
FROM student s
LEFT JOIN enroll e ON s.student_id = e.student_id
GROUP BY s.student_id;

Example: Count Students per Section

# How many students are enrolled in each section?
section_sizes = Section.aggr(Enroll, n_students='COUNT(*)')

Example: Count Direct Reports per Manager

# For each manager, count their direct reports
managers = Employee.proj(manager_id='employee_id')
report_counts = managers.aggr(ReportsTo, n_reports='COUNT(*)')

Computing Statistics

Aggregation can compute any SQL aggregate function on related data.

Example: Average Grade per Student

# Compute average grade for each student
avg_grades = Student.aggr(Grade, avg_grade='AVG(grade_value)')

Example: GPA Calculation

GPA requires weighting grades by credits:

# Compute weighted GPA for each student
student_gpa = Student.aggr(
    Course * Grade * LetterGrade,
    gpa='SUM(points * credits) / SUM(credits)',
    total_credits='SUM(credits)'
)

Here, Course * Grade * LetterGrade joins the tables to access both credits (from Course) and points (from LetterGrade).

Example: Order Statistics

# For each order, compute item statistics
order_stats = Order.aggr(
    OrderItem,
    n_items='COUNT(*)',
    total='SUM(quantity * unit_price)',
    avg_item_price='AVG(unit_price)'
)

Multiple Aggregate Expressions

You can compute multiple aggregates in a single operation:

# Compute multiple statistics per student
student_stats = Student.aggr(
    Grade,
    n_grades='COUNT(*)',
    avg_grade='AVG(grade_value)',
    min_grade='MIN(grade_value)',
    max_grade='MAX(grade_value)'
)

All aggregate expressions are computed simultaneously for each group.

Aggregation with Restrictions

Apply restrictions to either the grouping table or the aggregated table.

Restricting the Grouping Table

Filter which entities receive aggregated values:

# GPA only for CS majors
cs_student_gpa = (Student & (StudentMajor & {'dept': 'CS'})).aggr(
    Course * Grade * LetterGrade,
    gpa='SUM(points * credits) / SUM(credits)'
)

Restricting the Aggregated Table

Filter which data is included in the aggregation:

# Count only current term enrollments per student
current_enrollments = Student.aggr(
    Enroll & CurrentTerm,
    n_current='COUNT(*)'
)

# Average grade for math courses only
math_avg = Student.aggr(
    Grade & {'dept': 'MATH'},
    math_avg='AVG(grade_value)'
)

Combining Both

# For seniors only, compute GPA from upper-division courses
senior_upper_gpa = (Student & {'class_standing': 'Senior'}).aggr(
    Course * Grade * LetterGrade & 'course >= 3000',
    upper_gpa='SUM(points * credits) / SUM(credits)'
)

Filtering Aggregation Results

After aggregation, you can restrict based on the computed values:

# Students with GPA above 3.5
student_gpa = Student.aggr(
    Course * Grade * LetterGrade,
    gpa='SUM(points * credits) / SUM(credits)'
)
honor_students = student_gpa & 'gpa >= 3.5'
# Sections with more than 30 students
section_sizes = Section.aggr(Enroll, n='COUNT(*)')
large_sections = section_sizes & 'n > 30'

SQL Equivalent (using HAVING):

SELECT s.*, COUNT(*) AS n
FROM section s
LEFT JOIN enroll e USING (dept, course, section_id)
GROUP BY s.dept, s.course, s.section_id
HAVING n > 30;

Left Join Behavior

Aggregation uses left join semantics: all entities from the grouping table appear in the result, even if they have no matching records in the aggregated table.

Example: Students Without Grades

# All students with their grade count (0 for students without grades)
grade_counts = Student.aggr(Grade, n_grades='COUNT(*)')

Students without any grades will have n_grades = 0.

Example: Preserving All Sections

# All sections with enrollment count (0 for empty sections)
all_section_sizes = Section.aggr(Enroll, n_students='COUNT(*)')

Empty sections appear with n_students = 0.

Examples from the University Database

Example 1: Student Statistics

# Comprehensive student statistics
student_stats = Student.aggr(
    Course * Grade * LetterGrade,
    n_courses='COUNT(*)',
    total_credits='SUM(credits)',
    gpa='SUM(points * credits) / SUM(credits)'
)

Example 2: Department Statistics

# Count majors per department
dept_major_counts = Department.aggr(StudentMajor, n_majors='COUNT(*)')

# Average GPA per department (from majors in that department)
dept_gpa = Department.aggr(
    StudentMajor * Student.aggr(
        Course * Grade * LetterGrade,
        gpa='SUM(points * credits) / SUM(credits)'
    ),
    dept_avg_gpa='AVG(gpa)'
)

Example 3: Course Popularity

# Count total enrollments per course (across all sections and terms)
course_popularity = Course.aggr(
    Section * Enroll,
    total_enrollments='COUNT(*)'
)

# Most popular courses
popular_courses = course_popularity & 'total_enrollments > 100'

Example 4: Grade Distribution

# Count each grade type per course
grade_distribution = Course.aggr(
    Grade,
    a_count='SUM(grade="A")',
    b_count='SUM(grade="B")',
    c_count='SUM(grade="C")',
    d_count='SUM(grade="D")',
    f_count='SUM(grade="F")'
)

Aggregation vs. Join

OperationPurposeResult RowsResult Attributes
A.aggr(B, ...)Summarize B for each AOne row per AA’s attributes + aggregates
A * BCombine A and BOne row per A-B pairAll attributes from A and B

Example Comparison

# Join: One row per student-enrollment pair
Student * Enroll  # Many rows per student (one per enrollment)

# Aggregation: One row per student with enrollment count
Student.aggr(Enroll, n='COUNT(*)')  # One row per student

Use join when you need individual related records.
Use aggregation when you need summary statistics per entity.

SQL Translation

DataJoint’s aggregation translates to SQL with LEFT JOIN and GROUP BY:

Basic Aggregation

# DataJoint
Student.aggr(Enroll, n='COUNT(*)')
-- SQL
SELECT s.*, COUNT(e.student_id) AS n
FROM student s
LEFT JOIN enroll e ON s.student_id = e.student_id
GROUP BY s.student_id;

Aggregation with Joined Tables

# DataJoint
Student.aggr(
    Course * Grade * LetterGrade,
    gpa='SUM(points * credits) / SUM(credits)'
)
-- SQL
SELECT s.*, SUM(lg.points * c.credits) / SUM(c.credits) AS gpa
FROM student s
LEFT JOIN grade g ON s.student_id = g.student_id
LEFT JOIN course c ON g.dept = c.dept AND g.course = c.course
LEFT JOIN letter_grade lg ON g.grade = lg.grade
GROUP BY s.student_id;

Aggregation with Restriction on Result

# DataJoint
Student.aggr(Enroll, n='COUNT(*)') & 'n > 5'
-- SQL
SELECT s.*, COUNT(*) AS n
FROM student s
LEFT JOIN enroll e ON s.student_id = e.student_id
GROUP BY s.student_id
HAVING COUNT(*) > 5;

Best Practices

1. Understand the Grouping

The grouping table determines:

  • The primary key of the result

  • Which entities appear in the output

  • The entity type represented by each row

2. Use Meaningful Aggregate Names

# Good: descriptive names
Student.aggr(Enroll, n_enrollments='COUNT(*)', total_credits='SUM(credits)')

# Avoid: generic names
Student.aggr(Enroll, n='COUNT(*)', x='SUM(credits)')

3. Restrict Before Aggregating When Possible

# More efficient: restrict first
(Student & {'home_state': 'CA'}).aggr(Enroll, n='COUNT(*)')

# Less efficient: restrict after
Student.aggr(Enroll, n='COUNT(*)') & {'home_state': 'CA'}

4. Handle NULL Values

When the aggregated table has NULL values, use IFNULL or COALESCE:

Student.aggr(Grade, avg_grade='AVG(IFNULL(grade_value, 0))')

5. Test with Small Data First

# Verify the aggregation works correctly
test_result = (Student & {'student_id': 1001}).aggr(Grade, n='COUNT(*)')
print(test_result.fetch())

Summary

The aggregation operator computes summary statistics from related entities:

  1. Syntax: GroupingTable.aggr(AggregatedTable, new_attr='AGG_FUNC(...)')

  2. Result: Has the same primary key as the grouping table

  3. Left join: All grouping entities appear, even without matches

  4. Multiple aggregates: Compute several statistics in one operation

  5. Chaining: Combine with restrictions and other operators

Use aggregation when you need summary statistics per entity rather than individual related records.

Practice Exercises

Exercise 1: Count Enrollments

Task: Count how many students are enrolled in each section.

section_counts = Section.aggr(Enroll, n_students='COUNT(*)')

Exercise 2: Compute GPA

Task: Compute weighted GPA for each student.

student_gpa = Student.aggr(
    Course * Grade * LetterGrade,
    gpa='SUM(points * credits) / SUM(credits)'
)

Exercise 3: Department Statistics

Task: Count the number of courses offered by each department.

dept_course_counts = Department.aggr(Course, n_courses='COUNT(*)')

Exercise 4: High Enrollment Courses

Task: Find courses with more than 50 total enrollments across all sections.

course_enrollments = Course.aggr(Section * Enroll, total='COUNT(*)')
high_enrollment = course_enrollments & 'total > 50'

Exercise 5: Student Course Load

Task: For each student enrolled in the current term, compute the total credits.

current_load = Student.aggr(
    (Enroll & CurrentTerm) * Course,
    current_credits='SUM(credits)'
)