Skip to article frontmatterSkip to article content

Aggregation queries

import pymysql
pymysql.install_as_MySQLdb()
%load_ext sql
%config SqlMagic.autocommit=True
%sql mysql://root:simple@127.0.0.1
import datajoint as dj

sales = dj.Schema('classicsales')
sales.spawn_missing_classes()

nations = dj.Schema('nation')
nations.spawn_missing_classes()

hotel = dj.Schema('hotel')
hotel.spawn_missing_classes()

university = dj.Schema('university')
university.spawn_missing_classes()

app = dj.Schema('app')
app.spawn_missing_classes()
[2023-11-01 00:25:59,466][INFO]: Connecting root@fakeservices.datajoint.io:3306
[2023-11-01 00:25:59,482][INFO]: Connected root@fakeservices.datajoint.io:3306

Concepts

Review the MySQL aggregate functions: https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html

Three types of queries

  1. Aggregation functions with no GROUP BY clause produce 1 row.
  2. Aggregation functions combined with a GROUP BY clause. The unique key of the result is composed of the columns of the GROUP BY clause.
  3. Most common pattern: JOIN or LEFT JOIN of a table pair in a one-to-many relationship, grouped by the primary key of the left table. This aggregates the right entity set with respect to the left entity set.

Note that MySQL with the default settings allows mixing aggregated and non-aggregated values (See https://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sqlmode_only_full_group_by). So you have to watch avoid invalid mixes of values.

Using HAVING is equivalent to using a WHERE clause in an outer query.

import datajoint as dj
schema = dj.Schema('app')
schema.spawn_missing_classes()
dj.Diagram(schema)
import pymysql
pymysql.install_as_MySQLdb()
%load_ext sql
%config SqlMagic.autocommit=True
%sql mysql://root:simple@127.0.0.1

Aggregation Queries

Queries using aggregation functions, GROUP BY, and HAVING clauses. Using LEFT JOIN in combination with GROUP BY.

Aggregation functions: MAX, MIN, AVG, SUM, and COUNT.

%%sql
USE app
%%sql
-- show the date of the last purchase 
SELECT * FROM purchase ORDER BY purchase_date DESC LIMIT 1 
%%sql
-- show the data of the last pruchase 
SELECT max(purchase_date) last_purchase, min(purchase_date) first_purchase, phone, card_number FROM purchase

Aggregation functions MAX, MIN, AVG, SUM, COUNT

%%sql
-- show the date of birth of the youngest person
SELECT * FROM account ORDER BY dob DESC LIMIT 1
%%sql
-- show the date of birth of the youngest person 
-- This is an invalid query because it mixes aggregation and regular fields
SELECT max(dob) as dob, phone  FROM account
%%sql
SELECT * FROM account where phone=10013740006
%%sql
-- show the youngest person 
SELECT * FROM account WHERE dob = (SELECT max(dob) FROM account)
# show average male age
dj.U().aggr(Account & 'sex="M"' , avg_age="floor(avg(DATEDIFF(now(), dob)) / 365.25)")
%%sql
SELECT floor(avg(DATEDIFF(now(), dob)) / 365.25) as avg_age FROM account WHERE sex="M"
%%sql
SELECT count(*), count(phone), count(DISTINCT first_name, last_name), count(dob) FROM account;
%%sql
-- show how many of purchases have been done for each addon

SELECT addon_id, count(*) n FROM purchase GROUP BY addon_id 
%%sql
SELECT * FROM `#add_on` LIMIT 10
%%sql
SELECT * FROM purchase NATURAL JOIN `#add_on` LIMIT 10
%%sql
-- show the total money spent by each account (limit to top 10)

SELECT phone, sum(price) as total_spending 
    FROM purchase NATURAL JOIN `#add_on` 
    GROUP BY (phone) 
    ORDER BY total_spending DESC LIMIT 10
%%sql
-- show the names of people who spent less than $100

SELECT phone, sum(price) as total_spending 
    FROM purchase NATURAL JOIN `#add_on` 
    WHERE total_spending < 100
    GROUP BY (phone) 
    LIMIT 10
%%sql
-- show the names of people who spent less than $100

SELECT * FROM (
    SELECT phone, first_name, last_name, sum(price) as total_spending 
    FROM account NATURAL JOIN purchase NATURAL JOIN `#add_on` 
    GROUP BY (phone)) as q    
WHERE total_spending < 100
LIMIT 10

-- almost correct but does not include people who spent nothing
%%sql 
-- explaining LEFT joins
SELECT * FROM account NATURAL LEFT JOIN purchase NATURAL LEFT JOIN `#add_on` LIMIT 10
%%sql
-- show the names of people who spent less than $100
SELECT * FROM (
    SELECT phone, first_name, last_name, sum(ifnull(price), 0) as total_spending 
    FROM account NATURAL LEFT JOIN purchase NATURAL LEFT JOIN `#add_on` 
    GROUP BY (phone)) as q    
WHERE total_spending < 100
LIMIT 10

Summary of principles

  1. Without a GROUP BY, aggregation functions collapse the table into a single row.
  2. With GROUP BY, the grouping attributes become the new primary key of the result.
  3. Do not mix aggregated and non-aggregated values in the result with or without a GROUP BY.
  4. HAVING plays the same role as the WHERE clause in a nesting outer query so it can use the output of the aggregation functions.
  5. LEFT JOIN is often follwed with a GROUP BY by the primary key attributes of the left table. In this scenario the entities in the right table are aggregated for each matching row in the left table.
  1. Always aggregate entity B grouped by entity A
  2. Then GROUP BY the primary of A
  3. Aggregate the attribute B but not A
  4. SELECT non-aggregated attributes of A but not B
  5. Use an left join if you need to include rows of A for which there is no match in B

dj.Diagram(sales)

dj.Diagram(sales)
Loading...
# Show all the orders made in March 2003 and the total number of items on each
(Order & 'order_date between "2003-03-01" and "2003-03-31"').aggr(Order.Item(),  n='count(*)', keep_all_rows=True)
Loading...
# SHOW ALL the employees, the number of their direct reports

Employee.proj(reports_to='employee_number').aggr(Report, n='count(employee_number)')
Loading...
%%sql

use classicsales;

SELECT employee.employee_number, first_name, last_name, count(report.employee_number) as n 
FROM employee LEFT JOIN report ON (employee.employee_number = report.reports_to)
GROUP BY employee.employee_number
Loading...
%%sql

SELECT employee.employee_number, first_name, last_name, report.employee_number as subordinate 
FROM employee LEFT JOIN report ON (employee.employee_number = report.reports_to)
Loading...