Skip to article frontmatterSkip to article content

Subqueries and Complex Query Patterns

Subqueries are one of the most powerful features in database querying, allowing you to build complex queries by nesting one query inside another. This chapter explores various subquery patterns and their applications.

Understanding Subqueries

A subquery is a query nested inside another query. The outer query uses the results of the inner query to filter, join, or otherwise process data. Subqueries are essential for answering complex questions that require information from multiple tables.

Types of Subqueries

  1. Scalar subqueries: Return a single value

  2. Row subqueries: Return a single row with multiple columns

  3. Table subqueries: Return multiple rows (used with IN, EXISTS, etc.)

Basic Subquery Patterns

Pattern 1: Filtering with IN

The most common subquery pattern uses IN to filter one table based on values from another:

# Find all people who speak English
english_speakers = Person & (Fluency & {'lang_code': 'en'})

SQL Equivalent:

SELECT *
FROM person
WHERE person_id IN (
    SELECT person_id
    FROM fluency
    WHERE lang_code = 'en'
);

Pattern 2: Filtering with NOT IN

Use NOT IN to exclude records that match the subquery:

# Find people who don't speak English
non_english_speakers = Person - (Fluency & {'lang_code': 'en'})

SQL Equivalent:

SELECT p
FROM person
WHERE person_id NOT IN (
    SELECT person_id
    FROM fluency
    WHERE lang_code = 'ENG'
);

Complex Subquery Patterns

Pattern 3: Multiple Conditions with AND

When you need to satisfy multiple conditions, use multiple subqueries with AND:

# Find people who speak both English AND Spanish
english_speakers = Person & (Fluency & {'lang_code': 'ENG'})
spanish_speakers = Person & (Fluency & {'lang_code': 'SPA'})
bilingual = english_speakers & spanish_speakers

SQL Equivalent:

SELECT DISTINCT p.*
FROM person
WHERE person_id IN (
    SELECT f.person_id
    FROM fluency f
    WHERE f.lang_code = 'ENG'
)
AND person_id IN (
    SELECT person_id
    FROM fluency
    WHERE lang_code = 'SPA'
);

Pattern 4: Multiple Conditions with OR

Use OR to find records that satisfy any of multiple conditions:

# Find people who speak English OR Spanish
english_or_spanish = Person & ((Fluency & {'lang_code': 'ENG'}) | (Fluency & {'lang_code': 'SPA'}))

SQL Equivalent:

SELECT *
FROM person
WHERE person_id IN (
    SELECT person_id
    FROM fluency
    WHERE lang_code IN ('ENG', 'SPA')
);

Pattern 5: Negated Conditions

Find records that don’t match specific criteria:

# Find people who speak Japanese but not fluently
japanese_speakers = Person & (Fluency & {'lang_code': 'JPN'})
fluent_japanese = Person & (Fluency & {'lang_code': 'JPN', 'fluency_level': 'fluent'})
japanese_non_fluent = japanese_speakers - fluent_japanese

SQL Equivalent:

SELECT DISTINCT *
FROM person
WHERE person_id IN (
    SELECT person_id
    FROM fluency
    WHERE lang_code = 'JPN'
)
AND p.person_id NOT IN (
    SELECT person_id
    FROM fluency f
    WHERE lang_code = 'JPN' AND fluency_level = 'fluent'
);

Self-Referencing Tables

Self-referencing tables create relationships within the same table, such as hierarchical structures.

Example: Management Hierarchy

@schema
class Person(dj.Manual):
    definition = """
    person_id : int
    ---
    name : varchar(60)
    """

@schema
class ReportsTo(dj.Manual):
    definition = """
    -> Person
    manager_id : int  # Renamed foreign key to Person
    ---
    """

Querying Hierarchical Data

# Find all managers (people who have others reporting to them)
managers = Person & ReportsTo.proj(manager_id='person_id')

# Find all people who have managers
people_with_managers = Person & ReportsTo

# Find top-level managers (people who don't report to anyone)
top_managers = Person - ReportsTo.proj(manager_id='person_id')

SQL Equivalents:

-- Find all managers
SELECT *
FROM person p
WHERE person_id IN (
    SELECT manager_id
    FROM reports_to
);

-- Find people with managers
SELECT *
FROM person p
WHERE person_id IN (
    SELECT person_id
    FROM reports_to
);

-- Find top-level managers
SELECT *
FROM person
WHERE person_id NOT IN (
    SELECT manager_id
    FROM reports_to
);

Advanced Subquery Patterns

Pattern 6: Correlated Subqueries

Correlated subqueries reference columns from the outer query:

# Find people who speak more languages than the average
# This requires aggregation and comparison

SQL Equivalent:

SELECT p.*
FROM person p
WHERE (
    SELECT COUNT(*)
    FROM fluency f
    WHERE f.person_id = p.person_id
) > (
    SELECT AVG(lang_count)
    FROM (
        SELECT COUNT(*) as lang_count
        FROM fluency
        GROUP BY person_id
    ) counts
);

Pattern 7: EXISTS vs IN

Use EXISTS for better performance when checking for existence:

# Find people who speak at least one language fluently
fluent_speakers = Person & (Fluency & {'fluency_level': 'fluent'})

SQL Equivalent:

SELECT p.*
FROM person p
WHERE EXISTS (
    SELECT 1
    FROM fluency f
    WHERE f.person_id = p.person_id
    AND f.fluency_level = 'fluent'
);

Best Practices for Subqueries

  1. Use meaningful aliases: Make your queries readable with clear table aliases

  2. Test subqueries independently: Verify each subquery works before combining them

  3. Consider performance: EXISTS is often more efficient than IN for large datasets

  4. Use parentheses: Group complex conditions clearly

  5. Document complex logic: Add comments explaining the business logic

Common Pitfalls

  1. NULL handling: NOT IN with NULLs can produce unexpected results

  2. Performance: Nested subqueries can be slow on large datasets

  3. Readability: Deeply nested subqueries can be hard to understand

  4. Maintenance: Complex subqueries can be difficult to modify

Summary

Subqueries are essential for complex data analysis. The key patterns covered include:

  • Filtering with IN/NOT IN: Basic subquery filtering

  • Multiple conditions: Combining AND/OR logic

  • Negated conditions: Finding records that don’t match criteria

  • Self-referencing tables: Hierarchical data structures

  • Correlated subqueries: Advanced comparisons

  • EXISTS vs IN: Performance considerations

Mastering these patterns will enable you to answer complex questions about your data and build sophisticated database applications.