The projection operator is the second fundamental operation in relational algebra. It manipulates columns (attributes) of a relation by selecting, renaming, or computing new attributes. While restriction selects rows, projection deals with columns.
Understanding Projection¶
Projection selects and manipulates columns from a table. It can do three main things:
Select specific columns (reduce the number of attributes)
Rename columns (change attribute names)
Compute new columns (create derived attributes)
Key Concepts from the Lecture¶
Projection never changes the primary key entity type - if you start with a Person table, you still have a Person table
Algebraic closure - the result of projection is still a valid relation
Entity integrity - the primary key must always be included (explicitly or implicitly)
Order matters with restrictions - you can only reference attributes that exist at projection time
Basic Syntax¶
# DataJoint projection syntax
result = Table.proj(*attributes, **computed_attributes)
# Common patterns:
Table.proj() # Select all attributes
Table.proj('attr1', 'attr2') # Select specific attributes
Table.proj(new_name='old_name') # Rename attributes
Table.proj(computed='expression') # Compute new attributesThe Three Functions of Projection¶
1. Selecting Specific Attributes¶
Use projection to reduce the number of columns returned:
# Setup from the lecture
import datajoint as dj
schema = dj.Schema('languages_demo')
@schema
class Person(dj.Manual):
definition = """
person_id : int
---
name : varchar(60)
date_of_birth : date
"""
# Select only specific attributes
names_only = Person.proj('name') # Just names
ids_and_names = Person.proj('person_id', 'name') # ID and namesKey principle from lecture: Always include the primary key to maintain entity integrity.
2. Renaming Attributes¶
Change attribute names for clarity or consistency:
# Rename attributes for clarity
renamed = Person.proj(birth_date='date_of_birth')
# Multiple renames
clean_names = Person.proj(
id='person_id',
full_name='name',
birth_date='date_of_birth'
)SQL Equivalent:
SELECT person_id AS id,
name AS full_name,
date_of_birth AS birth_date
FROM person;3. Computing New Attributes¶
Create derived attributes through calculations:
# Calculate age from date of birth (from the lecture)
people_with_age = Person.proj(
'name',
age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())',
age_months='TIMESTAMPDIFF(MONTH, date_of_birth, NOW())'
)SQL Equivalent:
SELECT person_id, name,
TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) AS age_years,
TIMESTAMPDIFF(MONTH, date_of_birth, NOW()) AS age_months
FROM person;The Ellipses Operator (...) - Including All Attributes¶
DataJoint provides a powerful shorthand using ellipses (...) to include all existing attributes while adding new ones. This is especially useful when you want to add computed attributes without losing any existing columns.
Basic Ellipses Usage¶
# Add a new computed attribute while keeping ALL existing attributes
people_with_age = Person.proj(..., age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
# This is equivalent to explicitly listing all attributes:
people_with_age = Person.proj(
'person_id', # Primary key
'name', # Existing attribute
'date_of_birth', # Existing attribute
age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())' # New computed attribute
)Key advantage: You don’t need to remember or list all existing attributes when adding new ones.
Ellipses with Multiple Computed Attributes¶
# Add multiple computed attributes while preserving all existing ones
enhanced_people = Person.proj(
..., # Keep all existing attributes
age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())',
age_months='TIMESTAMPDIFF(MONTH, date_of_birth, NOW())',
age_days='TIMESTAMPDIFF(DAY, date_of_birth, NOW())',
is_adult='TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) >= 18'
)Ellipses with Renaming¶
You can combine ellipses with attribute renaming:
# Rename one attribute and add computed ones, keep everything else
person_enhanced = Person.proj(
..., # Keep all other attributes as-is
birth_date='date_of_birth', # Rename this attribute
age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())' # Add computed attribute
)Important note: When renaming with ellipses, the original attribute is excluded and the renamed version is included.
Working with the Languages Database¶
Using ellipses with our lecture examples:
# Add fluency score while keeping all original fluency attributes
fluency_with_score = Fluency.proj(
..., # Keep person_id, lang_code, fluency_level
score='''CASE fluency_level
WHEN "beginner" THEN 1
WHEN "intermediate" THEN 2
WHEN "fluent" THEN 3
END'''
)
# Add readable language name while keeping language code
languages_enhanced = Language.proj(
..., # Keep lang_code, language
lang_name='language', # Alternative name for clarity
code_upper='UPPER(lang_code)' # Computed uppercase version
)SQL Translation of Ellipses¶
The ellipses operator translates to explicit column listing in SQL:
# DataJoint with ellipses
Person.proj(..., age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')SQL Equivalent:
SELECT person_id,
name,
date_of_birth,
TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) AS age
FROM person;When to Use Ellipses vs Explicit Listing¶
Use ellipses (...) when:
Adding computed attributes to existing data
You want to preserve the entire table structure
Prototyping and exploratory analysis
The source table structure might change
Use explicit listing when:
You need only specific columns (data reduction)
Creating clean, minimal result sets
Building production queries with stable schemas
Documentation and clarity are priorities
Common Patterns with Ellipses¶
Pattern 1: Add Age to Any Person-like Table¶
# Works with any table that has date_of_birth
any_person_table.proj(..., age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')Pattern 2: Add Multiple Time-based Calculations¶
Person.proj(
...,
age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())',
days_since_birth='TIMESTAMPDIFF(DAY, date_of_birth, NOW())',
birth_year='YEAR(date_of_birth)',
birth_month='MONTH(date_of_birth)'
)Pattern 3: Add Boolean Flags¶
Person.proj(
...,
is_adult='TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) >= 18',
is_senior='TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) >= 65',
is_millennial='date_of_birth BETWEEN "1981-01-01" AND "1996-12-31"'
)Debugging with Ellipses¶
Check what attributes you get with ellipses:
# See the resulting schema
original = Person
with_age = Person.proj(..., age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
print(f"Original: {original.heading.names}")
print(f"With ellipses: {with_age.heading.names}")Best Practices for Ellipses¶
Use for additive operations: When you want to add, not replace
Combine with explicit attributes carefully: Understand what gets included/excluded
Document computed attributes: Make their purpose clear
Test the resulting schema: Verify you get what you expect
The ellipses operator is a powerful DataJoint feature that makes it easy to enhance tables with additional computed information while preserving all existing data structure.
Systematic Examples from the Lecture¶
Example 1: Age Calculations¶
The lecture emphasized age calculations as a key use case:
# Calculate age in different units
age_calculations = Person.proj(
'name',
age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())',
age_months='TIMESTAMPDIFF(MONTH, date_of_birth, NOW())',
age_days='TIMESTAMPDIFF(DAY, date_of_birth, NOW())'
)Example 2: Generational Classification¶
From the lecture’s generational examples:
# Create generation labels
with_generation = Person.proj(
'name',
'date_of_birth',
generation='''CASE
WHEN date_of_birth BETWEEN "2000-01-01" AND "2013-12-31" THEN "Gen Z"
WHEN date_of_birth BETWEEN "1981-01-01" AND "1999-12-31" THEN "Millennial"
WHEN date_of_birth BETWEEN "1965-01-01" AND "1980-12-31" THEN "Gen X"
ELSE "Other"
END'''
)Algebraic Closure: Chaining Operations¶
The lecture emphasized that projection results can be used in further operations:
# Example from the lecture: systematic chaining
step1 = Person
print(f"All people: {len(step1)}")
# Restrict to millennials
step2 = step1 & 'date_of_birth BETWEEN "1981-01-01" AND "1999-12-31"'
print(f"Millennials: {len(step2)}")
# Project to add age calculation
step3 = step2.proj('name', age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
print(f"Millennials with age: {len(step3)}")
# Further restrict using computed attribute
step4 = step3 & 'age_years > 25'
print(f"Adult millennials: {len(step4)}")Key insight from lecture: Once you compute an attribute, you can use it in subsequent restrictions.
Order of Operations: Critical Concept¶
The lecture emphasized understanding when attributes are available:
DataJoint Approach (Separate Operations)¶
# Clear separation: restriction first, then projection
people = Person & 'date_of_birth > "1990-01-01"' # Can use date_of_birth
people_with_age = people.proj(age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
filtered = people_with_age & 'age > 25' # Can use computed 'age'SQL Approach (Combined Operations)¶
-- In SQL, WHERE is executed before SELECT internally
SELECT name, TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) AS age
FROM person
WHERE date_of_birth > '1990-01-01' -- Can use original columns
AND TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) > 25; -- Must repeat calculationCommon Mistake from Lecture¶
# This would fail - 'age' doesn't exist yet during restriction
# WRONG: Person & 'age > 25'.proj(age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
# CORRECT: Compute age first, then restrict
people_with_age = Person.proj(age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
adults = people_with_age & 'age > 25'Primary Key Preservation¶
Critical principle from lecture: Projection preserves the primary key entity type.
# All of these are still "Person" tables with person_id as primary key
original = Person # person_id primary key
projected = Person.proj('name') # Still person_id primary key
computed = Person.proj(age='TIMESTAMPDIFF(...)') # Still person_id primary key
renamed = Person.proj(id='person_id') # Now 'id' is primary keyRule: The primary key is always included, even if not explicitly listed.
Working with the Languages Database¶
Using the lecture’s main example:
@schema
class Language(dj.Lookup):
definition = """
lang_code : char(4)
---
language : varchar(30)
"""
@schema
class Fluency(dj.Manual):
definition = """
-> Person
-> Language
---
fluency_level : enum('beginner', 'intermediate', 'fluent')
"""
# Project language information
language_info = Language.proj(code='lang_code', name='language')
# Project fluency with computed fluency score
fluency_scores = Fluency.proj(
'fluency_level',
score='''CASE fluency_level
WHEN "beginner" THEN 1
WHEN "intermediate" THEN 2
WHEN "fluent" THEN 3
END'''
)SQL Translation Patterns¶
Every DataJoint projection translates systematically to SQL:
Simple Selection¶
# DataJoint
Person.proj('name')
# SQL
SELECT person_id, name FROM person;Renaming¶
# DataJoint
Person.proj(full_name='name')
# SQL
SELECT person_id, name AS full_name FROM person;Computation¶
# DataJoint
Person.proj(age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
# SQL
SELECT person_id, TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) AS age FROM person;Best Practices from the Lecture¶
1. Always Consider Entity Integrity¶
# Include primary key explicitly for clarity
clear_projection = Person.proj('person_id', 'name', computed_attr='...')2. Build Systematically¶
# Start simple, add complexity
step1 = Person.proj('name')
step2 = step1.proj(name_length='CHAR_LENGTH(name)')
step3 = step2 & 'name_length > 10'3. Understand Attribute Availability¶
# Compute attributes before using them in restrictions
with_age = Person.proj(age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
adults = with_age & 'age >= 18' # 'age' is now available4. Test Incrementally¶
# Verify each step
original = Person
print(f"Original columns: {original.heading.names}")
projected = original.proj('name', age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
print(f"Projected columns: {projected.heading.names}")Common Patterns Summary¶
Column selection:
Table.proj('col1', 'col2')Column renaming:
Table.proj(new_name='old_name')Column computation:
Table.proj(computed='expression')Mixed operations:
Table.proj('existing', renamed='old', computed='expr')Chained with restriction:
Table.proj(...) & condition
Summary¶
The projection operator is essential for column manipulation in database queries. Key takeaways from the lecture:
Projection selects/manipulates columns (vs restriction selects rows)
Three functions: select, rename, compute
Primary key is preserved - entity integrity maintained
Algebraic closure - results can be used in further operations
Order matters - compute attributes before using them
Think systematically - build complex projections incrementally
Master projection and you can transform any table structure to meet your analysis needs.
Practice Exercises: Systematic Projection Building¶
Let’s work through practical examples following the lecture methodology. These exercises will help you master the three functions of projection and understand order of operations.
Setup: Complete Languages Database¶
import datajoint as dj
schema = dj.Schema('projection_practice')
@schema
class Person(dj.Manual):
definition = """
person_id : int
---
name : varchar(60)
date_of_birth : date
"""
@schema
class Language(dj.Lookup):
definition = """
lang_code : char(4)
---
language : varchar(30)
"""
contents = [
('en', 'English'),
('es', 'Spanish'),
('ja', 'Japanese'),
('zh', 'Mandarin')
]
@schema
class Fluency(dj.Manual):
definition = """
-> Person
-> Language
---
fluency_level : enum('beginner', 'intermediate', 'fluent')
"""Exercise 1: Basic Column Selection¶
Question: How do you select only the names from the Person table?
Solution:
names_only = Person.proj('name')Key insight: The primary key (person_id) is automatically included even though not explicitly listed.
Question: How do you select multiple specific columns?
Solution:
basic_info = Person.proj('person_id', 'name') # Explicit primary key inclusionExercise 2: Column Renaming¶
Question: Rename date_of_birth to birth_date for clarity.
Solution:
renamed = Person.proj('name', birth_date='date_of_birth')SQL Equivalent:
SELECT person_id, name, date_of_birth AS birth_date
FROM person;Question: Rename multiple columns at once.
Solution:
clean_schema = Person.proj(
id='person_id',
full_name='name',
birth_date='date_of_birth'
)Exercise 3: Computing Age (Core Lecture Example)¶
Question: Calculate age in years from date of birth.
Solution:
people_with_age = Person.proj(
'name',
age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())'
)SQL Equivalent:
SELECT person_id, name,
TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) AS age_years
FROM person;Question: Calculate age in multiple units.
Solution:
age_details = Person.proj(
'name',
age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())',
age_months='TIMESTAMPDIFF(MONTH, date_of_birth, NOW())',
age_days='TIMESTAMPDIFF(DAY, date_of_birth, NOW())'
)Exercise 4: Order of Operations (Critical Concept)¶
Question: Find people over 25 years old, showing their names and ages.
Wrong Approach (from lecture):
# This fails - 'age_years' doesn't exist during restriction
# WRONG: Person & 'age_years > 25'.proj(age_years='TIMESTAMPDIFF(...)')Correct Approach:
# Step 1: Compute age first
people_with_age = Person.proj(
'name',
age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())'
)
# Step 2: Then restrict using computed attribute
adults = people_with_age & 'age_years > 25'Alternative SQL Approach:
-- SQL can reference computed columns in subqueries
SELECT name, age_years
FROM (
SELECT person_id, name,
TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) AS age_years
FROM person
) AS people_with_age
WHERE age_years > 25;Exercise 5: Generational Classification (Lecture Example)¶
Question: Classify people into generations based on birth year.
Solution:
with_generations = Person.proj(
'name',
'date_of_birth',
generation='''CASE
WHEN date_of_birth BETWEEN "2000-01-01" AND "2013-12-31" THEN "Gen Z"
WHEN date_of_birth BETWEEN "1981-01-01" AND "1999-12-31" THEN "Millennial"
WHEN date_of_birth BETWEEN "1965-01-01" AND "1980-12-31" THEN "Gen X"
WHEN date_of_birth BETWEEN "1946-01-01" AND "1964-12-31" THEN "Boomer"
ELSE "Silent"
END'''
)Then find only Millennials:
millennials = with_generations & {'generation': 'Millennial'}Exercise 6: Working with Association Tables¶
Question: Create a numeric fluency score from the enum values.
Solution:
fluency_scores = Fluency.proj(
'fluency_level',
score='''CASE fluency_level
WHEN "beginner" THEN 1
WHEN "intermediate" THEN 2
WHEN "fluent" THEN 3
END'''
)Question: Find languages with their readable names (not just codes).
Solution:
readable_languages = Language.proj(
code='lang_code',
name='language'
)Exercise 7: Algebraic Closure in Action¶
Question: Find Gen Z people who speak English fluently, showing their names and ages.
Step-by-step approach:
# Step 1: Add age calculation to Person
people_with_age = Person.proj(
'name',
'date_of_birth',
age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())'
)
# Step 2: Restrict to Gen Z using computed age
gen_z = people_with_age & 'age_years BETWEEN 10 AND 23'
# Step 3: Find those who speak English fluently
gen_z_english_fluent = gen_z & (Fluency & {'lang_code': 'ENG', 'fluency_level': 'fluent'})
# Step 4: Final projection for clean output
result = gen_z_english_fluent.proj('name', 'age_years')Verification:
print(f"Total people: {len(Person)}")
print(f"People with age: {len(people_with_age)}")
print(f"Gen Z: {len(gen_z)}")
print(f"Gen Z English fluent: {len(result)}")Exercise 8: Using Ellipses (...) to Preserve All Attributes¶
Question: Add an age calculation while keeping all existing Person attributes.
Solution:
# Using ellipses - keeps person_id, name, date_of_birth AND adds age
people_with_age = Person.proj(..., age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')Comparison without ellipses:
# Without ellipses - must explicitly list all attributes you want to keep
people_with_age = Person.proj(
'person_id',
'name',
'date_of_birth',
age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())'
)Question: Add multiple computed attributes while preserving everything.
Solution:
enhanced_people = Person.proj(
..., # Keep all existing attributes
age_years='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())',
age_months='TIMESTAMPDIFF(MONTH, date_of_birth, NOW())',
is_adult='TIMESTAMPDIFF(YEAR, date_of_birth, NOW()) >= 18',
generation='''CASE
WHEN date_of_birth BETWEEN "2000-01-01" AND "2013-12-31" THEN "Gen Z"
WHEN date_of_birth BETWEEN "1981-01-01" AND "1999-12-31" THEN "Millennial"
ELSE "Other"
END'''
)Question: Add fluency scores to the Fluency table while keeping all original data.
Solution:
fluency_enhanced = Fluency.proj(
..., # Keep person_id, lang_code, fluency_level
numeric_score='''CASE fluency_level
WHEN "beginner" THEN 1
WHEN "intermediate" THEN 2
WHEN "fluent" THEN 3
END''',
is_fluent='fluency_level = "fluent"'
)Verification:
print(f"Original Fluency columns: {Fluency.heading.names}")
print(f"Enhanced Fluency columns: {fluency_enhanced.heading.names}")Exercise 9: Advanced Calculations¶
Question: Calculate how long each person has been alive in different units.
Solution using ellipses:
# Keep all original data and add lifetime calculations
lifetime_calculations = Person.proj(
..., # Preserve all existing attributes
years_alive='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())',
months_alive='TIMESTAMPDIFF(MONTH, date_of_birth, NOW())',
days_alive='TIMESTAMPDIFF(DAY, date_of_birth, NOW())',
hours_alive='TIMESTAMPDIFF(HOUR, date_of_birth, NOW())'
)Question: Find the oldest and youngest people (using projections).
Solution:
# Add age, then find extremes
with_age = Person.proj(..., age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
# Note: This requires aggregation concepts, but conceptually:
# oldest = with_age & 'age = (SELECT MAX(age) FROM with_age)'Testing and Debugging Projections¶
Verify Column Existence¶
# Check what columns exist after projection
original = Person
projected = Person.proj('name', age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
print(f"Original columns: {original.heading.names}")
print(f"Projected columns: {projected.heading.names}")Test Incremental Building¶
# Build step by step
step1 = Person
print(f"Step 1 - Original: {len(step1)} people")
step2 = step1.proj('name', age='TIMESTAMPDIFF(YEAR, date_of_birth, NOW())')
print(f"Step 2 - With age: {len(step2)} people")
step3 = step2 & 'age > 25'
print(f"Step 3 - Adults: {len(step3)} people")Common Mistakes to Avoid¶
Using computed attributes before creating them:
# WRONG: Person & 'age > 25'.proj(age='...') # RIGHT: Person.proj(age='...') & 'age > 25'Forgetting primary key preservation:
# Remember: projection always preserves primary key names = Person.proj('name') # Still has person_idMisunderstanding entity integrity:
# After projection, you still have a "Person" table projected_people = Person.proj('name') # These are still people!
SQL Translation Reference¶
| DataJoint Pattern | SQL Equivalent |
|---|---|
Table.proj('col') | SELECT primary_key, col FROM table |
Table.proj(new='old') | SELECT primary_key, old AS new FROM table |
Table.proj(calc='expr') | SELECT primary_key, expr AS calc FROM table |
Table.proj('a', 'b') | SELECT primary_key, a, b FROM table |
Table.proj(..., new='expr') | SELECT *, expr AS new FROM table |
Table.proj(..., renamed='old') | SELECT col1, col2, old AS renamed FROM table |
Summary: The Power of Projection¶
Projection is essential for data transformation. Key patterns:
Column selection: Reduce data volume by selecting only needed columns
Column renaming: Create clearer, more consistent naming
Column computation: Create derived attributes for analysis
Order awareness: Compute attributes before using them in restrictions
Entity preservation: Remember that projections maintain entity type
These exercises demonstrate the systematic approach emphasized in the lecture: start simple, build incrementally, and always consider entity integrity and order of operations.