Subqueries are one of the most powerful features in database querying, allowing you to build complex queries by nesting one query inside another. This chapter explores various subquery patterns and their applications.
Understanding Subqueries¶
A subquery is a query nested inside another query. The outer query uses the results of the inner query to filter, join, or otherwise process data. Subqueries are essential for answering complex questions that require information from multiple tables.
Types of Subqueries¶
Scalar subqueries: Return a single value
Row subqueries: Return a single row with multiple columns
Table subqueries: Return multiple rows (used with IN, EXISTS, etc.)
Basic Subquery Patterns¶
Pattern 1: Filtering with IN¶
The most common subquery pattern uses IN to filter one table based on values from another:
# Find all people who speak English
english_speakers = Person & (Fluency & {'lang_code': 'en'})SQL Equivalent:
SELECT *
FROM person
WHERE person_id IN (
SELECT person_id
FROM fluency
WHERE lang_code = 'en'
);Pattern 2: Filtering with NOT IN¶
Use NOT IN to exclude records that match the subquery:
# Find people who don't speak English
non_english_speakers = Person - (Fluency & {'lang_code': 'en'})SQL Equivalent:
SELECT p
FROM person
WHERE person_id NOT IN (
SELECT person_id
FROM fluency
WHERE lang_code = 'ENG'
);Complex Subquery Patterns¶
Pattern 3: Multiple Conditions with AND¶
When you need to satisfy multiple conditions, use multiple subqueries with AND:
# Find people who speak both English AND Spanish
english_speakers = Person & (Fluency & {'lang_code': 'ENG'})
spanish_speakers = Person & (Fluency & {'lang_code': 'SPA'})
bilingual = english_speakers & spanish_speakersSQL Equivalent:
SELECT DISTINCT p.*
FROM person
WHERE person_id IN (
SELECT f.person_id
FROM fluency f
WHERE f.lang_code = 'ENG'
)
AND person_id IN (
SELECT person_id
FROM fluency
WHERE lang_code = 'SPA'
);Pattern 4: Multiple Conditions with OR¶
Use OR to find records that satisfy any of multiple conditions:
# Find people who speak English OR Spanish
english_or_spanish = Person & ((Fluency & {'lang_code': 'ENG'}) | (Fluency & {'lang_code': 'SPA'}))SQL Equivalent:
SELECT *
FROM person
WHERE person_id IN (
SELECT person_id
FROM fluency
WHERE lang_code IN ('ENG', 'SPA')
);Pattern 5: Negated Conditions¶
Find records that don’t match specific criteria:
# Find people who speak Japanese but not fluently
japanese_speakers = Person & (Fluency & {'lang_code': 'JPN'})
fluent_japanese = Person & (Fluency & {'lang_code': 'JPN', 'fluency_level': 'fluent'})
japanese_non_fluent = japanese_speakers - fluent_japaneseSQL Equivalent:
SELECT DISTINCT *
FROM person
WHERE person_id IN (
SELECT person_id
FROM fluency
WHERE lang_code = 'JPN'
)
AND p.person_id NOT IN (
SELECT person_id
FROM fluency f
WHERE lang_code = 'JPN' AND fluency_level = 'fluent'
);Self-Referencing Tables¶
Self-referencing tables create relationships within the same table, such as hierarchical structures.
Example: Management Hierarchy¶
@schema
class Person(dj.Manual):
definition = """
person_id : int
---
name : varchar(60)
"""
@schema
class ReportsTo(dj.Manual):
definition = """
-> Person
manager_id : int # Renamed foreign key to Person
---
"""Querying Hierarchical Data¶
# Find all managers (people who have others reporting to them)
managers = Person & ReportsTo.proj(manager_id='person_id')
# Find all people who have managers
people_with_managers = Person & ReportsTo
# Find top-level managers (people who don't report to anyone)
top_managers = Person - ReportsTo.proj(manager_id='person_id')SQL Equivalents:
-- Find all managers
SELECT *
FROM person p
WHERE person_id IN (
SELECT manager_id
FROM reports_to
);
-- Find people with managers
SELECT *
FROM person p
WHERE person_id IN (
SELECT person_id
FROM reports_to
);
-- Find top-level managers
SELECT *
FROM person
WHERE person_id NOT IN (
SELECT manager_id
FROM reports_to
);Advanced Subquery Patterns¶
Pattern 6: Correlated Subqueries¶
Correlated subqueries reference columns from the outer query:
# Find people who speak more languages than the average
# This requires aggregation and comparisonSQL Equivalent:
SELECT p.*
FROM person p
WHERE (
SELECT COUNT(*)
FROM fluency f
WHERE f.person_id = p.person_id
) > (
SELECT AVG(lang_count)
FROM (
SELECT COUNT(*) as lang_count
FROM fluency
GROUP BY person_id
) counts
);Pattern 7: EXISTS vs IN¶
Use EXISTS for better performance when checking for existence:
# Find people who speak at least one language fluently
fluent_speakers = Person & (Fluency & {'fluency_level': 'fluent'})SQL Equivalent:
SELECT p.*
FROM person p
WHERE EXISTS (
SELECT 1
FROM fluency f
WHERE f.person_id = p.person_id
AND f.fluency_level = 'fluent'
);Best Practices for Subqueries¶
Use meaningful aliases: Make your queries readable with clear table aliases
Test subqueries independently: Verify each subquery works before combining them
Consider performance: EXISTS is often more efficient than IN for large datasets
Use parentheses: Group complex conditions clearly
Document complex logic: Add comments explaining the business logic
Common Pitfalls¶
NULL handling: NOT IN with NULLs can produce unexpected results
Performance: Nested subqueries can be slow on large datasets
Readability: Deeply nested subqueries can be hard to understand
Maintenance: Complex subqueries can be difficult to modify
Summary¶
Subqueries are essential for complex data analysis. The key patterns covered include:
Filtering with IN/NOT IN: Basic subquery filtering
Multiple conditions: Combining AND/OR logic
Negated conditions: Finding records that don’t match criteria
Self-referencing tables: Hierarchical data structures
Correlated subqueries: Advanced comparisons
EXISTS vs IN: Performance considerations
Mastering these patterns will enable you to answer complex questions about your data and build sophisticated database applications.