Languages - The DataJoint Book

This example demonstrates a classic many-to-many relationship design using an association table. We’ll track people and the languages they speak, along with their proficiency levels using the internationally recognized CEFR standard.

Database Schema¶

The database consists of four tables:

Language - A lookup table containing ISO 639-1 language codes and names
CEFRLevel - A lookup table defining the six CEFR proficiency levels (A1-C2)
Person - Individual people with basic information
Proficiency - An association table linking people, languages, and proficiency levels

This design demonstrates:

Many-to-many relationship: Each person can speak multiple languages, and each language is spoken by multiple people
Lookup tables: Both Language and CEFRLevel are lookup tables with predefined, standardized content
Association table with multiple foreign keys: Proficiency references Person, Language, and CEFRLevel
Normalization: CEFR levels are stored in their own table with additional metadata (descriptions, categories)
International standards: Uses ISO 639-1 codes for languages and CEFR levels for proficiency

Language Codes: ISO 639-1 Standard¶

We use ISO 639-1 language codes, which are the international standard for representing languages. These codes provide:

Background¶

ISO 639-1 is part of the ISO 639 series of standards for language codes
Established by the International Organization for Standardization (ISO)
Provides two-letter codes for major world languages
Used globally in software, databases, and international systems

Benefits of ISO 639-1 Codes¶

International Standard: Recognized worldwide across industries
Consistent: Two-letter format ensures uniform representation
Comprehensive: Covers major languages with official status
Future-proof: Maintained and updated by ISO
Integration: Compatible with web standards (HTML lang attributes, etc.)

Examples of ISO 639-1 Codes¶

en - English
es - Spanish
fr - French
de - German
ja - Japanese
zh - Chinese
ar - Arabic
hi - Hindi

Database Design Considerations¶

Using standardized codes in our database ensures:

Data consistency across different systems
Easy integration with external APIs and services
Future compatibility with international standards
Reduced ambiguity compared to custom codes

Table Definition¶

%xmode minimal
import datajoint as dj
dj.config['display.limit'] = 6  # keep output concise

# Create schema
schema = dj.Schema('languages_example')

@schema
class Language(dj.Lookup):
    definition = """
    lang_code : char(2)  # ISO 639-1 language code (e.g., 'en', 'es', 'ja')
    ---
    language : varchar(30)  # Full language name
    native_name : varchar(50)  # Language name in its native script
    """
    contents = [
        # Format: (code, language, native_name, family)
        ('ar', 'Arabic', 'العربية'), ('da', 'Danish', 'Dansk'),
        ('de', 'German', 'Deutsch'), ('el', 'Greek', 'Ελληνικά'),
        ('en', 'English', 'English'), ('es', 'Spanish', 'Español'),
        ('fi', 'Finnish', 'Suomi'), ('fr', 'French', 'Français'),
        ('he', 'Hebrew', 'עברית'), ('hi', 'Hindi', 'हिन्दी'),
        ('id', 'Indonesian', 'Bahasa Indonesia'),
        ('it', 'Italian', 'Italiano'), ('ja', 'Japanese', '日本語'),
        ('ko', 'Korean', '한국어'), ('ms', 'Malay', 'Bahasa Melayu'),
        ('nl', 'Dutch', 'Nederlands'), ('no', 'Norwegian', 'Norsk'),
        ('ph', 'Filipino', 'Tagalog'), ('pl', 'Polish', 'Polski'),
        ('pt', 'Portuguese', 'Português'), ('ru', 'Russian', 'Русский'),
        ('sa', 'Sanskrit', 'संस्कृतम्'), ('sv', 'Swedish', 'Svenska'),
        ('th', 'Thai', 'ไทย'), ('tr', 'Turkish', 'Türkçe'),
        ('uk', 'Ukrainian', 'Українська'),
        ('vi', 'Vietnamese', 'Tiếng Việt'), ('zh', 'Chinese', '中文')]

@schema
class CEFRLevel(dj.Lookup):
    definition = """
    cefr_level : char(2)  # CEFR proficiency level code (A1, A2, B1, B2, C1, C2)
    ---
    level_name : varchar(30)  # Full name of the level
    category : enum('Basic', 'Independent', 'Proficient')  # User category
    description : varchar(255)  # Brief description of abilities at this level
    """
    contents = [
        ('A1', 'Beginner', 'Basic', 
         'Can understand and use familiar everyday expressions and very basic phrases'),
        ('A2', 'Elementary', 'Basic',
         'Can communicate in simple routine tasks requiring direct exchange of information'),
        ('B1', 'Intermediate', 'Independent',
         'Can deal with most situations while traveling and produce simple connected text'),
        ('B2', 'Upper Intermediate', 'Independent',
         'Can interact with fluency and spontaneity and produce clear, detailed text'),
        ('C1', 'Advanced', 'Proficient',
         'Can express ideas fluently and use language flexibly for social and professional purposes'),
        ('C2', 'Mastery', 'Proficient',
         'Can understand virtually everything and express themselves with precision')
    ]

Exception reporting mode: Minimal

Benefits of Using ISO 639-1 Codes¶

1. International Compatibility¶

# These codes work with web standards and international APIs
web_lang_attr = f'<html lang="{lang_code}">'  # e.g., <html lang="en">
api_request = f'https://api.translate.com?lang={lang_code}'  # e.g., lang=en

2. Consistent Data Representation¶

# All systems recognize these codes
browser_detection = {'en': 'English', 'es': 'Spanish', 'ja': 'Japanese'}
database_lookup = Language & {'lang_code': 'en'}  # Always works

3. Future-Proof Design¶

# New languages can be added following the same standard
# ISO maintains and updates the standard regularly
new_languages = [
    ('sw', 'Swahili', 'Kiswahili'),
    ('th', 'Thai', 'ไทย'),
    ('vi', 'Vietnamese', 'Tiếng Việt')
]

4. Integration with External Services¶

# Compatible with translation services, content management systems
translation_api = f'https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl={lang_code}'
content_management = f'<meta name="language" content="{lang_code}">'

Language Proficiency: CEFR Levels¶

For measuring language proficiency, we use the CEFR (Common European Framework of Reference for Languages), which is the international standard for describing language ability.

Background¶

The Common European Framework of Reference for Languages (CEFR) was developed by the Council of Europe and published in 2001. It provides:

Standardized proficiency descriptors recognized worldwide
Six proficiency levels from beginner to mastery
Can-do statements describing practical abilities at each level
Common reference for language teaching, testing, and certification

The Six CEFR Levels¶

The CEFR defines six levels, grouped into three broad categories:

A - Basic User¶

A1 (Beginner): Can understand and use familiar everyday expressions and very basic phrases. Can introduce themselves and ask and answer simple personal questions.
A2 (Elementary): Can understand sentences and frequently used expressions. Can communicate in simple routine tasks requiring direct exchange of information.

B - Independent User¶

B1 (Intermediate): Can understand main points on familiar matters. Can deal with most situations while traveling. Can produce simple connected text on familiar topics.
B2 (Upper Intermediate): Can understand main ideas of complex text. Can interact with native speakers with fluency and spontaneity. Can produce clear, detailed text on various subjects.

C - Proficient User¶

C1 (Advanced): Can understand a wide range of demanding texts. Can express ideas fluently and spontaneously. Can use language flexibly for social, academic, and professional purposes.
C2 (Mastery): Can understand virtually everything heard or read. Can summarize information from different sources. Can express themselves with precision and subtle distinction of meaning.

Benefits of Using CEFR in Our Database¶

International Recognition: CEFR is used by universities, employers, and governments worldwide
Precise Measurement: Six levels provide nuanced proficiency assessment
Practical Focus: Levels describe what learners can actually do with the language
Career Relevance: Many job postings and educational programs reference CEFR levels
Standardized Testing: Major language tests (TOEFL, IELTS, DELE, etc.) map to CEFR levels

CEFR Level Equivalencies¶

Common language tests map to CEFR as follows:

TOEFL iBT: 42-71 (B1), 72-94 (B2), 95-120 (C1-C2)
IELTS: 4.0-5.0 (B1), 5.5-6.5 (B2), 7.0-9.0 (C1-C2)
Cambridge: PET (B1), FCE (B2), CAE (C1), CPE (C2)

Using Language and Proficiency Tables¶

Let’s use the Language table to create a set of persons with different languages they speak.

@schema
class Person(dj.Manual):
    definition = """
    person_id : int  # Unique identifier for each person
    ---
    name : varchar(60)  # Person's name
    date_of_birth : date  # Date of birth
    """

@schema
class Proficiency(dj.Manual):
    definition = """
    -> Person
    -> Language
    ---
    -> CEFRLevel
    """

Complete Schema Diagram¶

dj.Diagram(schema)

Populating the Person and Proficiency data¶

Let’s use Faker to generate realistic sample data for our Person table:

# Generate sample people data using Faker
import numpy as np
from faker import Faker

fake = Faker()

# Set seed for reproducible results
np.random.seed(42)
fake.seed_instance(42)

# Generate n people with diverse backgrounds
n = 500  # number of people to generate
Person.insert(
    {
        'person_id': i,
        'name': fake.name(),
        'date_of_birth': fake.date_of_birth(minimum_age=18, maximum_age=80)
    } for i in range(n))

Person()

Now let’s create random language fluency data by assigning people various language skills:

lang_keys = Language.fetch("KEY")
cefr_keys = CEFRLevel.fetch("KEY")
# Weight probabilities: more people at intermediate levels than extremes
cefr_probabilities = [0.08, 0.12, 0.13, 0.17, 0.20, 0.30]
average_languages = 2.5

for person_key in Person.fetch("KEY"):
    num_languages = np.random.poisson(average_languages)
    Proficiency.insert(
        {
            **person_key,
            **lang_key,
            **np.random.choice(cefr_keys, p=cefr_probabilities)
        } for lang_key in np.random.choice(lang_keys, num_languages, replace=False)
    )

Proficiency()

Why CEFR Levels Improve Queries¶

Using CEFR levels in our database enables more meaningful queries:

Granular Filtering: Query for specific proficiency ranges

# Find people with working proficiency (B1-B2)
Proficiency & 'cefr_level in ("B1", "B2")'

# Find advanced speakers (C1-C2)
Proficiency & 'cefr_level in ("C1", "C2")'

Comparable Across Languages: CEFR provides consistent measurement

# Find people who are intermediate (B1) or better in ANY language
Person & (Proficiency & 'cefr_level >= "B1"')

Career-Relevant: Matches real-world job requirements

# Many job postings require "B2 or higher"
# Easy to query candidates meeting this criteria

Sample Queries with Populated Data¶

Now that we have data in all three tables, let’s run some example queries:

# Query 1: Find the names of all proficient English speakers (C1 or C2)
Person.proj('name') & (Proficiency & {'lang_code': 'en'} & 'cefr_level in ("C1", "C2")')

# Query 2: Names of people who speak English or Spanish at any level 
Person.proj('name') & (Proficiency & 'lang_code in ("en", "es")')

# Query 3: Names of people who speak English AND Spanish at any level 
Person.proj('name') & (Proficiency & {'lang_code': 'en'}) & (Proficiency & {'lang_code': 'es'})

# Query 4: Show the peole who speak at least four languages
Person.aggr(Proficiency, 'name',
            nlanguages='count(lang_code)', languages='GROUP_CONCAT(lang_code)'
            ) & 'nlanguages >= 4'

# Query 5: Show the top 3 people by number of languages spoken
Person.aggr(Proficiency, 'name',
            nlanguages='count(lang_code)'
            ) & dj.Top(3, order_by='nlanguages desc')

# Query 6: Show all the people Lindsay Roman (person_id=7) can communicate wtih

Person & (
    Proficiency * Proficiency.proj(other_person='person_id') & {'other_person': 7}
    )

# Query 7: Find people with at least intermediate proficiency (B1+) in Spanish
Person.proj('name') & (Proficiency & {'lang_code': 'es'} & 'cefr_level >= "B1"')

# Query 8: Show all languages and the number of people who speak them
Language.aggr(Proficiency, 'language', nspeakers='count(person_id)')

# Query 9: Count people at each CEFR level for English
(CEFRLevel).aggr(Proficiency & {'lang_code': 'en'}, nspeakers='count(person_id)')

# Query 10: Show the top 3 languages by number of speakers
Language.aggr(Proficiency, 'language', nspeakers='count(person_id)') & dj.Top(3, order_by='nspeakers desc')

# Query 12: Show language skills with full CEFR level descriptions for person 0
# This demonstrates the benefit of having CEFRLevel as a lookup table
(Person * Proficiency * Language * CEFRLevel).proj(
    'name', 'language', 'level_name', 'category', 'description'
) & {'person_id': 0}

# Query 11: Find polyglots with C1 and C2 (mastery) level in multiple languages
Person.aggr(
    Proficiency & 'cefr_level>="C1"',
    'name',
    num_mastered='count(*)'
) & 'num_mastered >= 2'

# If you need to re-run this example, you can drop the schema by uncommenting the following line:

# schema.drop()