Operator: Aggregation - The DataJoint Book

(This is an AI-generated template and work in progress.)

The aggregation operator in DataJoint enables users to compute summary statistics or aggregate data from tables. It is a powerful tool for extracting meaningful insights by grouping and summarizing data directly within the database.

Overview of the Aggregation Operator¶

The aggregation operator in DataJoint is implemented using the .aggr() method. This operator allows users to group data by specific attributes and compute aggregate functions such as sums, averages, counts, and more on the grouped data.

Syntax¶

<Table>.aggr(<other_table>, *aggregates, **renamed_aggregates)

Components¶

<other_table>:
- The table or query whose data is being aggregated.
*aggregates:
- A list of aggregate functions to compute.
**renamed_aggregates:
- Key-value pairs for creating new aggregated attributes, where the key is the new attribute name and the value is the aggregate function.

Using Aggregation¶

Example: Counting Rows¶

The simplest aggregation operation is counting the number of rows for each group.

import datajoint as dj

schema = dj.Schema('example_schema')

@schema
class Animal(dj.Manual):
    definition = """
    animal_id: int  # Unique identifier for the animal
    ---
    species: varchar(64)  # Species of the animal
    age: int             # Age of the animal in years
    """

@schema
class Experiment(dj.Manual):
    definition = """
    experiment_id: int  # Unique experiment identifier
    ---
    animal_id: int      # ID of the animal used in the experiment
    result: float       # Result of the experiment
    """

# Insert example data
Animal.insert([
    {'animal_id': 1, 'species': 'Dog', 'age': 5},
    {'animal_id': 2, 'species': 'Cat', 'age': 3},
    {'animal_id': 3, 'species': 'Rabbit', 'age': 2}
])

Experiment.insert([
    {'experiment_id': 101, 'animal_id': 1, 'result': 75.0},
    {'experiment_id': 102, 'animal_id': 1, 'result': 82.5},
    {'experiment_id': 103, 'animal_id': 2, 'result': 90.0}
])

# Aggregate experiments by animal_id, counting rows
experiment_counts = Animal.aggr(Experiment, count='count(*)')
print(experiment_counts.fetch())

Example: Computing Summary Statistics¶

Aggregation can also compute summary statistics such as sums, averages, and maximums.

# Aggregate experiments by animal_id, computing the average result
average_results = Animal.aggr(Experiment, avg_result='avg(result)')
print(average_results.fetch())

Combining Aggregation with Restrictions¶

Aggregation can be combined with restrictions to focus on specific subsets of data.

Example¶

# Compute the average result for animals older than 3 years
restricted_avg = (Animal & 'age > 3').aggr(Experiment, avg_result='avg(result)')
print(restricted_avg.fetch())

Best Practices¶

Understand Grouping:
- The grouping is determined by the attributes of the primary table (e.g., Animal in the examples).
Use Meaningful Aggregates:
- Choose aggregates that provide actionable insights, such as averages, counts, or maximum values.
Test Incrementally:
- Test your aggregations with smaller datasets to verify correctness before applying them to larger datasets.
Combine with Restrictions:
- Apply restrictions to narrow down the data being aggregated for more focused results.
Avoid Ambiguity:
- Clearly define attribute names for aggregates using renamed_aggregates to avoid confusion in the results.

Summary¶

The aggregation operator in DataJoint is an essential tool for summarizing data. By computing statistics like counts, averages, and more, it allows users to derive insights from their pipelines. Mastering this operator will enable efficient and meaningful data analysis directly within your database.

import pymysql
pymysql.install_as_MySQLdb()
%load_ext sql
%config SqlMagic.autocommit=True
%sql mysql://root:simple@127.0.0.1

import datajoint as dj

sales = dj.Schema('classicsales')
sales.spawn_missing_classes()

nations = dj.Schema('nation')
nations.spawn_missing_classes()

hotel = dj.Schema('hotel')
hotel.spawn_missing_classes()

university = dj.Schema('university')
university.spawn_missing_classes()

app = dj.Schema('app')
app.spawn_missing_classes()

[2023-11-01 00:25:59,466][INFO]: Connecting root@fakeservices.datajoint.io:3306
[2023-11-01 00:25:59,482][INFO]: Connected root@fakeservices.datajoint.io:3306

Concepts¶

Review the MySQL aggregate functions: https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html

Three types of queries

Aggregation functions with no GROUP BY clause produce 1 row.
Aggregation functions combined with a GROUP BY clause. The unique key of the result is composed of the columns of the GROUP BY clause.
Most common pattern: JOIN or LEFT JOIN of a table pair in a one-to-many relationship, grouped by the primary key of the left table. This aggregates the right entity set with respect to the left entity set.

Note that MySQL with the default settings allows mixing aggregated and non-aggregated values (See https://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sqlmode_only_full_group_by). So you have to watch avoid invalid mixes of values.

Using HAVING is equivalent to using a WHERE clause in an outer query.

import datajoint as dj
schema = dj.Schema('app')
schema.spawn_missing_classes()
dj.Diagram(schema)

import pymysql
pymysql.install_as_MySQLdb()
%load_ext sql
%config SqlMagic.autocommit=True
%sql mysql://root:simple@127.0.0.1

Aggregation Queries¶

Queries using aggregation functions, GROUP BY, and HAVING clauses. Using LEFT JOIN in combination with GROUP BY.

Aggregation functions: MAX, MIN, AVG, SUM, and COUNT.

%%sql
USE app

%%sql
-- show the date of the last purchase 
SELECT * FROM purchase ORDER BY purchase_date DESC LIMIT 1

%%sql
-- show the data of the last pruchase 
SELECT max(purchase_date) last_purchase, min(purchase_date) first_purchase, phone, card_number FROM purchase

Aggregation functions MAX, MIN, AVG, SUM, COUNT¶

%%sql
-- show the date of birth of the youngest person
SELECT * FROM account ORDER BY dob DESC LIMIT 1

%%sql
-- show the date of birth of the youngest person 
-- This is an invalid query because it mixes aggregation and regular fields
SELECT max(dob) as dob, phone  FROM account

%%sql
SELECT * FROM account where phone=10013740006

%%sql
-- show the youngest person 
SELECT * FROM account WHERE dob = (SELECT max(dob) FROM account)

# show average male age
dj.U().aggr(Account & 'sex="M"' , avg_age="floor(avg(DATEDIFF(now(), dob)) / 365.25)")

%%sql
SELECT floor(avg(DATEDIFF(now(), dob)) / 365.25) as avg_age FROM account WHERE sex="M"

%%sql
SELECT count(*), count(phone), count(DISTINCT first_name, last_name), count(dob) FROM account;

%%sql
-- show how many of purchases have been done for each addon

SELECT addon_id, count(*) n FROM purchase GROUP BY addon_id

%%sql
SELECT * FROM `#add_on` LIMIT 10

%%sql
SELECT * FROM purchase NATURAL JOIN `#add_on` LIMIT 10

%%sql
-- show the total money spent by each account (limit to top 10)

SELECT phone, sum(price) as total_spending 
    FROM purchase NATURAL JOIN `#add_on` 
    GROUP BY (phone) 
    ORDER BY total_spending DESC LIMIT 10

%%sql
-- show the names of people who spent less than $100

SELECT phone, sum(price) as total_spending 
    FROM purchase NATURAL JOIN `#add_on` 
    WHERE total_spending < 100
    GROUP BY (phone) 
    LIMIT 10

%%sql
-- show the names of people who spent less than $100

SELECT * FROM (
    SELECT phone, first_name, last_name, sum(price) as total_spending 
    FROM account NATURAL JOIN purchase NATURAL JOIN `#add_on` 
    GROUP BY (phone)) as q    
WHERE total_spending < 100
LIMIT 10

-- almost correct but does not include people who spent nothing

%%sql 
-- explaining LEFT joins
SELECT * FROM account NATURAL LEFT JOIN purchase NATURAL LEFT JOIN `#add_on` LIMIT 10

%%sql
-- show the names of people who spent less than $100
SELECT * FROM (
    SELECT phone, first_name, last_name, sum(ifnull(price), 0) as total_spending 
    FROM account NATURAL LEFT JOIN purchase NATURAL LEFT JOIN `#add_on` 
    GROUP BY (phone)) as q    
WHERE total_spending < 100
LIMIT 10

Summary of principles¶

Without a GROUP BY, aggregation functions collapse the table into a single row.
With GROUP BY, the grouping attributes become the new primary key of the result.
Do not mix aggregated and non-aggregated values in the result with or without a GROUP BY.
HAVING plays the same role as the WHERE clause in a nesting outer query so it can use the output of the aggregation functions.
LEFT JOIN is often follwed with a GROUP BY by the primary key attributes of the left table. In this scenario the entities in the right table are aggregated for each matching row in the left table.

Always aggregate entity B grouped by entity A
Then GROUP BY the primary of A
Aggregate the attribute B but not A
SELECT non-aggregated attributes of A but not B
Use an left join if you need to include rows of A for which there is no match in B

dj.Diagram(sales)

dj.Diagram(sales)

# Show all the orders made in March 2003 and the total number of items on each
(Order & 'order_date between "2003-03-01" and "2003-03-31"').aggr(Order.Item(),  n='count(*)', keep_all_rows=True)

# SHOW ALL the employees, the number of their direct reports

Employee.proj(reports_to='employee_number').aggr(Report, n='count(employee_number)')

%%sql

use classicsales;

SELECT employee.employee_number, first_name, last_name, count(report.employee_number) as n 
FROM employee LEFT JOIN report ON (employee.employee_number = report.reports_to)
GROUP BY employee.employee_number

%%sql

SELECT employee.employee_number, first_name, last_name, report.employee_number as subordinate 
FROM employee LEFT JOIN report ON (employee.employee_number = report.reports_to)