Skip to article frontmatterSkip to article content

Operator: Projection

The projection operator is the second fundamental operation in relational algebra. It manipulates columns (attributes) of a relation by selecting, renaming, or computing new attributes. While restriction selects rows, projection deals with columns.

Understanding Projection

Projection selects and manipulates columns from a table. It can do three main things:

  1. Select specific columns (reduce the number of attributes)

  2. Rename columns (change attribute names)

  3. Compute new columns (create derived attributes)

Key Concepts from the Lecture

  • Projection never changes the primary key entity type - if you start with a Person table, you still have a Person table

  • Algebraic closure - the result of projection is still a valid relation

  • Entity integrity - the primary key must always be included (explicitly or implicitly)

  • Order matters with restrictions - you can only reference attributes that exist at projection time

Basic Syntax

# DataJoint projection syntax
result = Table.proj(*attributes, **computed_attributes)

# Common patterns:
Table.proj()                    # Select all attributes
Table.proj('attr1', 'attr2')    # Select specific attributes  
Table.proj(new_name='old_name') # Rename attributes
Table.proj(computed='expression') # Compute new attributes

The Three Functions of Projection

1. Selecting Specific Attributes

Use projection to reduce the number of columns returned:

# Setup from the lecture
import datajoint as dj
schema = dj.Schema('languages_demo')

@schema
class Person(dj.Manual):
    definition = """
    person_id : int
    ---
    name : varchar(60)
    date_of_birth : date
    """

# Select only specific attributes
names_only = Person.proj('name')              # Just names
ids_and_names = Person.proj('person_id', 'name')  # ID and names

Key principle from lecture: Always include the primary key to maintain entity integrity.

2. Renaming Attributes

Change attribute names for clarity or consistency:

# Rename attributes for clarity
renamed = Person.proj(birth_date='date_of_birth')

# Multiple renames
clean_names = Person.proj(
    id='person_id',
    full_name='name', 
    birth_date='date_of_birth'
)

SQL Equivalent:

SELECT person_id AS id, 
       name AS full_name, 
       date_of_birth AS birth_date
FROM person;

3. Computing New Attributes

Create derived attributes through calculations:

# Calculate age from date of birth (from the lecture)
people_with_age = Person.proj(
    'name',
    age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())',
    age_months='TIMESTAMPDIFF(MONTH, date_of_birth, NOW())'
)

SQL Equivalent:

SELECT person_id, name,
       TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) AS age_years,
       TIMESTAMPDIFF(MONTH, date_of_birth, NOW()) AS age_months
FROM person;

The Ellipses Operator (...) - Including All Attributes

DataJoint provides a powerful shorthand using ellipses (...) to include all existing attributes while adding new ones. This is especially useful when you want to add computed attributes without losing any existing columns.

Basic Ellipses Usage

# Add a new computed attribute while keeping ALL existing attributes
people_with_age = Person.proj(..., age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')

# This is equivalent to explicitly listing all attributes:
people_with_age = Person.proj(
    'person_id',           # Primary key
    'name',                # Existing attribute
    'date_of_birth',       # Existing attribute  
    age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())'  # New computed attribute
)

Key advantage: You don’t need to remember or list all existing attributes when adding new ones.

Ellipses with Multiple Computed Attributes

# Add multiple computed attributes while preserving all existing ones
enhanced_people = Person.proj(
    ...,  # Keep all existing attributes
    age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())',
    age_months='TIMESTAMPDIFF(MONTH, date_of_birth, NOW())',
    age_days='TIMESTAMPDIFF(DAY, date_of_birth, NOW())',
    is_adult='TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) >= 18'
)

Ellipses with Renaming

You can combine ellipses with attribute renaming:

# Rename one attribute and add computed ones, keep everything else
person_enhanced = Person.proj(
    ...,                    # Keep all other attributes as-is
    birth_date='date_of_birth',  # Rename this attribute
    age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())'  # Add computed attribute
)

Important note: When renaming with ellipses, the original attribute is excluded and the renamed version is included.

Working with the Languages Database

Using ellipses with our lecture examples:

# Add fluency score while keeping all original fluency attributes
fluency_with_score = Fluency.proj(
    ...,  # Keep person_id, lang_code, fluency_level
    score='''CASE fluency_level
        WHEN "beginner" THEN 1
        WHEN "intermediate" THEN 2
        WHEN "fluent" THEN 3
    END'''
)

# Add readable language name while keeping language code
languages_enhanced = Language.proj(
    ...,  # Keep lang_code, language
    lang_name='language',  # Alternative name for clarity
    code_upper='UPPER(lang_code)'  # Computed uppercase version
)

SQL Translation of Ellipses

The ellipses operator translates to explicit column listing in SQL:

# DataJoint with ellipses
Person.proj(..., age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')

SQL Equivalent:

SELECT person_id, 
       name, 
       date_of_birth,
       TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) AS age
FROM person;

When to Use Ellipses vs Explicit Listing

Use ellipses (...) when:

  • Adding computed attributes to existing data

  • You want to preserve the entire table structure

  • Prototyping and exploratory analysis

  • The source table structure might change

Use explicit listing when:

  • You need only specific columns (data reduction)

  • Creating clean, minimal result sets

  • Building production queries with stable schemas

  • Documentation and clarity are priorities

Common Patterns with Ellipses

Pattern 1: Add Age to Any Person-like Table

# Works with any table that has date_of_birth
any_person_table.proj(..., age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')

Pattern 2: Add Multiple Time-based Calculations

Person.proj(
    ...,
    age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())',
    days_since_birth='TIMESTAMPDIFF(DAY, date_of_birth, NOW())',
    birth_year='YEAR(date_of_birth)',
    birth_month='MONTH(date_of_birth)'
)

Pattern 3: Add Boolean Flags

Person.proj(
    ...,
    is_adult='TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) >= 18',
    is_senior='TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) >= 65',
    is_millennial='date_of_birth BETWEEN "1981-01-01" AND "1996-12-31"'
)

Debugging with Ellipses

Check what attributes you get with ellipses:

# See the resulting schema
original = Person
with_age = Person.proj(..., age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')

print(f"Original: {original.heading.names}")
print(f"With ellipses: {with_age.heading.names}")

Best Practices for Ellipses

  1. Use for additive operations: When you want to add, not replace

  2. Combine with explicit attributes carefully: Understand what gets included/excluded

  3. Document computed attributes: Make their purpose clear

  4. Test the resulting schema: Verify you get what you expect

The ellipses operator is a powerful DataJoint feature that makes it easy to enhance tables with additional computed information while preserving all existing data structure.

Systematic Examples from the Lecture

Example 1: Age Calculations

The lecture emphasized age calculations as a key use case:

# Calculate age in different units
age_calculations = Person.proj(
    'name',
    age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())',
    age_months='TIMESTAMPDIFF(MONTH, date_of_birth, NOW())',
    age_days='TIMESTAMPDIFF(DAY, date_of_birth, NOW())'
)

Example 2: Generational Classification

From the lecture’s generational examples:

# Create generation labels
with_generation = Person.proj(
    'name',
    'date_of_birth',
    generation='''CASE 
        WHEN date_of_birth BETWEEN "2000-01-01" AND "2013-12-31" THEN "Gen Z"
        WHEN date_of_birth BETWEEN "1981-01-01" AND "1999-12-31" THEN "Millennial"  
        WHEN date_of_birth BETWEEN "1965-01-01" AND "1980-12-31" THEN "Gen X"
        ELSE "Other"
    END'''
)

Algebraic Closure: Chaining Operations

The lecture emphasized that projection results can be used in further operations:

# Example from the lecture: systematic chaining
step1 = Person
print(f"All people: {len(step1)}")

# Restrict to millennials
step2 = step1 & 'date_of_birth BETWEEN "1981-01-01" AND "1999-12-31"'
print(f"Millennials: {len(step2)}")

# Project to add age calculation
step3 = step2.proj('name', age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
print(f"Millennials with age: {len(step3)}")

# Further restrict using computed attribute
step4 = step3 & 'age_years > 25'
print(f"Adult millennials: {len(step4)}")

Key insight from lecture: Once you compute an attribute, you can use it in subsequent restrictions.

Order of Operations: Critical Concept

The lecture emphasized understanding when attributes are available:

DataJoint Approach (Separate Operations)

# Clear separation: restriction first, then projection
people = Person & 'date_of_birth > "1990-01-01"'  # Can use date_of_birth
people_with_age = people.proj(age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
filtered = people_with_age & 'age > 25'  # Can use computed 'age'

SQL Approach (Combined Operations)

-- In SQL, WHERE is executed before SELECT internally
SELECT name, TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) AS age
FROM person
WHERE date_of_birth > '1990-01-01'    -- Can use original columns
  AND TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) > 25;  -- Must repeat calculation

Common Mistake from Lecture

# This would fail - 'age' doesn't exist yet during restriction
# WRONG: Person & 'age > 25'.proj(age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')

# CORRECT: Compute age first, then restrict
people_with_age = Person.proj(age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
adults = people_with_age & 'age > 25'

Primary Key Preservation

Critical principle from lecture: Projection preserves the primary key entity type.

# All of these are still "Person" tables with person_id as primary key
original = Person                                    # person_id primary key
projected = Person.proj('name')                     # Still person_id primary key  
computed = Person.proj(age='TIMESTAMPDIFF(...)')   # Still person_id primary key
renamed = Person.proj(id='person_id')              # Now 'id' is primary key

Rule: The primary key is always included, even if not explicitly listed.

Working with the Languages Database

Using the lecture’s main example:

@schema
class Language(dj.Lookup):
    definition = """
    lang_code : char(4)
    ---
    language : varchar(30)
    """

@schema
class Fluency(dj.Manual):
    definition = """
    -> Person
    -> Language
    ---
    fluency_level : enum('beginner', 'intermediate', 'fluent')
    """

# Project language information
language_info = Language.proj(code='lang_code', name='language')

# Project fluency with computed fluency score
fluency_scores = Fluency.proj(
    'fluency_level',
    score='''CASE fluency_level 
        WHEN "beginner" THEN 1
        WHEN "intermediate" THEN 2  
        WHEN "fluent" THEN 3
    END'''
)

SQL Translation Patterns

Every DataJoint projection translates systematically to SQL:

Simple Selection

# DataJoint
Person.proj('name')

# SQL
SELECT person_id, name FROM person;

Renaming

# DataJoint  
Person.proj(full_name='name')

# SQL
SELECT person_id, name AS full_name FROM person;

Computation

# DataJoint
Person.proj(age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')

# SQL
SELECT person_id, TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) AS age FROM person;

Best Practices from the Lecture

1. Always Consider Entity Integrity

# Include primary key explicitly for clarity
clear_projection = Person.proj('person_id', 'name', computed_attr='...')

2. Build Systematically

# Start simple, add complexity
step1 = Person.proj('name')
step2 = step1.proj(name_length='CHAR_LENGTH(name)')
step3 = step2 & 'name_length > 10'

3. Understand Attribute Availability

# Compute attributes before using them in restrictions
with_age = Person.proj(age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
adults = with_age & 'age >= 18'  # 'age' is now available

4. Test Incrementally

# Verify each step
original = Person
print(f"Original columns: {original.heading.names}")

projected = original.proj('name', age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
print(f"Projected columns: {projected.heading.names}")

Common Patterns Summary

  1. Column selection: Table.proj('col1', 'col2')

  2. Column renaming: Table.proj(new_name='old_name')

  3. Column computation: Table.proj(computed='expression')

  4. Mixed operations: Table.proj('existing', renamed='old', computed='expr')

  5. Chained with restriction: Table.proj(...) & condition

Summary

The projection operator is essential for column manipulation in database queries. Key takeaways from the lecture:

  1. Projection selects/manipulates columns (vs restriction selects rows)

  2. Three functions: select, rename, compute

  3. Primary key is preserved - entity integrity maintained

  4. Algebraic closure - results can be used in further operations

  5. Order matters - compute attributes before using them

  6. Think systematically - build complex projections incrementally

Master projection and you can transform any table structure to meet your analysis needs.

Practice Exercises: Systematic Projection Building

Let’s work through practical examples following the lecture methodology. These exercises will help you master the three functions of projection and understand order of operations.

Setup: Complete Languages Database

import datajoint as dj
schema = dj.Schema('projection_practice')

@schema
class Person(dj.Manual):
    definition = """
    person_id : int
    ---
    name : varchar(60)
    date_of_birth : date
    """

@schema
class Language(dj.Lookup):
    definition = """
    lang_code : char(4)
    ---
    language : varchar(30)
    """
    contents = [
        ('en', 'English'),
        ('es', 'Spanish'),
        ('ja', 'Japanese'),
        ('zh', 'Mandarin')
    ]

@schema
class Fluency(dj.Manual):
    definition = """
    -> Person
    -> Language
    ---
    fluency_level : enum('beginner', 'intermediate', 'fluent')
    """

Exercise 1: Basic Column Selection

Question: How do you select only the names from the Person table?

Solution:

names_only = Person.proj('name')

Key insight: The primary key (person_id) is automatically included even though not explicitly listed.

Question: How do you select multiple specific columns?

Solution:

basic_info = Person.proj('person_id', 'name')  # Explicit primary key inclusion

Exercise 2: Column Renaming

Question: Rename date_of_birth to birth_date for clarity.

Solution:

renamed = Person.proj('name', birth_date='date_of_birth')

SQL Equivalent:

SELECT person_id, name, date_of_birth AS birth_date
FROM person;

Question: Rename multiple columns at once.

Solution:

clean_schema = Person.proj(
    id='person_id',
    full_name='name',
    birth_date='date_of_birth'
)

Exercise 3: Computing Age (Core Lecture Example)

Question: Calculate age in years from date of birth.

Solution:

people_with_age = Person.proj(
    'name',
    age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())'
)

SQL Equivalent:

SELECT person_id, name, 
       TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) AS age_years
FROM person;

Question: Calculate age in multiple units.

Solution:

age_details = Person.proj(
    'name',
    age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())',
    age_months='TIMESTAMPDIFF(MONTH, date_of_birth, NOW())',
    age_days='TIMESTAMPDIFF(DAY, date_of_birth, NOW())'
)

Exercise 4: Order of Operations (Critical Concept)

Question: Find people over 25 years old, showing their names and ages.

Wrong Approach (from lecture):

# This fails - 'age_years' doesn't exist during restriction
# WRONG: Person & 'age_years > 25'.proj(age_years='TIMESTAMPDIFF(...)')

Correct Approach:

# Step 1: Compute age first
people_with_age = Person.proj(
    'name', 
    age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())'
)

# Step 2: Then restrict using computed attribute
adults = people_with_age & 'age_years > 25'

Alternative SQL Approach:

-- SQL can reference computed columns in subqueries
SELECT name, age_years
FROM (
    SELECT person_id, name, 
           TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) AS age_years
    FROM person
) AS people_with_age
WHERE age_years > 25;

Exercise 5: Generational Classification (Lecture Example)

Question: Classify people into generations based on birth year.

Solution:

with_generations = Person.proj(
    'name',
    'date_of_birth',
    generation='''CASE 
        WHEN date_of_birth BETWEEN "2000-01-01" AND "2013-12-31" THEN "Gen Z"
        WHEN date_of_birth BETWEEN "1981-01-01" AND "1999-12-31" THEN "Millennial"
        WHEN date_of_birth BETWEEN "1965-01-01" AND "1980-12-31" THEN "Gen X"
        WHEN date_of_birth BETWEEN "1946-01-01" AND "1964-12-31" THEN "Boomer"
        ELSE "Silent"
    END'''
)

Then find only Millennials:

millennials = with_generations & {'generation': 'Millennial'}

Exercise 6: Working with Association Tables

Question: Create a numeric fluency score from the enum values.

Solution:

fluency_scores = Fluency.proj(
    'fluency_level',
    score='''CASE fluency_level
        WHEN "beginner" THEN 1
        WHEN "intermediate" THEN 2
        WHEN "fluent" THEN 3
    END'''
)

Question: Find languages with their readable names (not just codes).

Solution:

readable_languages = Language.proj(
    code='lang_code',
    name='language'
)

Exercise 7: Algebraic Closure in Action

Question: Find Gen Z people who speak English fluently, showing their names and ages.

Step-by-step approach:

# Step 1: Add age calculation to Person
people_with_age = Person.proj(
    'name',
    'date_of_birth', 
    age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())'
)

# Step 2: Restrict to Gen Z using computed age
gen_z = people_with_age & 'age_years BETWEEN 10 AND 23'

# Step 3: Find those who speak English fluently
gen_z_english_fluent = gen_z & (Fluency & {'lang_code': 'ENG', 'fluency_level': 'fluent'})

# Step 4: Final projection for clean output
result = gen_z_english_fluent.proj('name', 'age_years')

Verification:

print(f"Total people: {len(Person)}")
print(f"People with age: {len(people_with_age)}")
print(f"Gen Z: {len(gen_z)}")
print(f"Gen Z English fluent: {len(result)}")

Exercise 8: Using Ellipses (...) to Preserve All Attributes

Question: Add an age calculation while keeping all existing Person attributes.

Solution:

# Using ellipses - keeps person_id, name, date_of_birth AND adds age
people_with_age = Person.proj(..., age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')

Comparison without ellipses:

# Without ellipses - must explicitly list all attributes you want to keep
people_with_age = Person.proj(
    'person_id',
    'name', 
    'date_of_birth',
    age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())'
)

Question: Add multiple computed attributes while preserving everything.

Solution:

enhanced_people = Person.proj(
    ...,  # Keep all existing attributes
    age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())',
    age_months='TIMESTAMPDIFF(MONTH, date_of_birth, NOW())',
    is_adult='TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) >= 18',
    generation='''CASE 
        WHEN date_of_birth BETWEEN "2000-01-01" AND "2013-12-31" THEN "Gen Z"
        WHEN date_of_birth BETWEEN "1981-01-01" AND "1999-12-31" THEN "Millennial"
        ELSE "Other"
    END'''
)

Question: Add fluency scores to the Fluency table while keeping all original data.

Solution:

fluency_enhanced = Fluency.proj(
    ...,  # Keep person_id, lang_code, fluency_level
    numeric_score='''CASE fluency_level
        WHEN "beginner" THEN 1
        WHEN "intermediate" THEN 2
        WHEN "fluent" THEN 3
    END''',
    is_fluent='fluency_level = "fluent"'
)

Verification:

print(f"Original Fluency columns: {Fluency.heading.names}")
print(f"Enhanced Fluency columns: {fluency_enhanced.heading.names}")

Exercise 9: Advanced Calculations

Question: Calculate how long each person has been alive in different units.

Solution using ellipses:

# Keep all original data and add lifetime calculations
lifetime_calculations = Person.proj(
    ...,  # Preserve all existing attributes
    years_alive='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())',
    months_alive='TIMESTAMPDIFF(MONTH, date_of_birth, NOW())',
    days_alive='TIMESTAMPDIFF(DAY, date_of_birth, NOW())',
    hours_alive='TIMESTAMPDIFF(HOUR, date_of_birth, NOW())'
)

Question: Find the oldest and youngest people (using projections).

Solution:

# Add age, then find extremes
with_age = Person.proj(..., age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')

# Note: This requires aggregation concepts, but conceptually:
# oldest = with_age & 'age = (SELECT MAX(age) FROM with_age)'

Testing and Debugging Projections

Verify Column Existence

# Check what columns exist after projection
original = Person
projected = Person.proj('name', age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')

print(f"Original columns: {original.heading.names}")
print(f"Projected columns: {projected.heading.names}")

Test Incremental Building

# Build step by step
step1 = Person
print(f"Step 1 - Original: {len(step1)} people")

step2 = step1.proj('name', age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
print(f"Step 2 - With age: {len(step2)} people")

step3 = step2 & 'age > 25'
print(f"Step 3 - Adults: {len(step3)} people")

Common Mistakes to Avoid

  1. Using computed attributes before creating them:

    # WRONG: Person & 'age > 25'.proj(age='...')
    # RIGHT: Person.proj(age='...') & 'age > 25'
  2. Forgetting primary key preservation:

    # Remember: projection always preserves primary key
    names = Person.proj('name')  # Still has person_id
  3. Misunderstanding entity integrity:

    # After projection, you still have a "Person" table
    projected_people = Person.proj('name')  # These are still people!

SQL Translation Reference

DataJoint PatternSQL Equivalent
Table.proj('col')SELECT primary_key, col FROM table
Table.proj(new='old')SELECT primary_key, old AS new FROM table
Table.proj(calc='expr')SELECT primary_key, expr AS calc FROM table
Table.proj('a', 'b')SELECT primary_key, a, b FROM table
Table.proj(..., new='expr')SELECT *, expr AS new FROM table
Table.proj(..., renamed='old')SELECT col1, col2, old AS renamed FROM table

Summary: The Power of Projection

Projection is essential for data transformation. Key patterns:

  1. Column selection: Reduce data volume by selecting only needed columns

  2. Column renaming: Create clearer, more consistent naming

  3. Column computation: Create derived attributes for analysis

  4. Order awareness: Compute attributes before using them in restrictions

  5. Entity preservation: Remember that projections maintain entity type

These exercises demonstrate the systematic approach emphasized in the lecture: start simple, build incrementally, and always consider entity integrity and order of operations.