Architecture Overview¶

The XO-Data platform is a modern, monorepo-based data engineering platform built for scalability, reusability, and maintainability.

High-Level Architecture¶

┌─────────────────────────────────────────────────────────────┐
│                      Data Sources                            │
│                                                              │
│  • Gladly API (customer service data)                       │
│  • Sprout Social API (social media data)                    │
│  • Gmail (email attachments)                                │
│  • Google Sheets (manual data entry)                        │
│  • S3 (file uploads)                                        │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       │ Extract (xo-foundry tasks)
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                  S3 Staging Layer                            │
│                                                              │
│  Ingest Bucket → Stage Bucket                               │
│  • Copy-then-Peek pattern (8KB header read)                 │
│  • Standardize column names                                 │
│  • Load strategy path segmentation                          │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       │ Load (TRUNCATE + COPY INTO)
                       ▼
┌─────────────────────────────────────────────────────────────┐
│            Snowflake Medallion Architecture                  │
│                                                              │
│  BRONZE Layer (Raw Data)                                     │
│  • All VARCHAR, truncated daily, 6 metadata columns          │
│  • Tables: GLADLY_CONTACT_TIMESTAMPS, SPROUT_MESSAGES       │
│                    ▼                                         │
│  SILVER Layer (Historical Preservation)                      │
│  • Proper data types, no enrichment, no filtering            │
│  • Tables: CONTACT_TIMESTAMPS, MESSAGES                     │
│                    ▼                                         │
│  GOLD Layer (Analytics - 4 Types)                            │
│  • fct_contacts, dim_agents, agg_agent_daily, rpt_dashboard │
└─────────────────────────────────────────────────────────────┘

Core Design Principles¶

1. Separation of Concerns¶

The platform separates concerns into distinct layers:

Extraction (xo-core): Data source connectors and extractors
Orchestration (xo-foundry): DAG Factory, Airflow tasks, pipeline configuration
Storage (Snowflake): Medallion architecture for data quality
Analytics (xo-lens): BI tools and visualizations
Navigation (xo-bosun): Monorepo CLI for developer productivity

2. Reusable Components¶

Common utilities are packaged for reuse:

xo-core: Foundation package with extractors, managers, utilities
xo-foundry: DAG Factory, task library, time windows, CLI tools
xo-lens: Analytics and visualization tools
xo-bosun: Monorepo navigation CLI

3. Configuration-Driven Pipelines (DAG Factory)¶

Pipelines are defined through YAML configuration and generated into Python DAGs:

dag:
  domain: warbyparker
  pipeline_name: gladly_daily
  pipeline_type: snowflake_load
  schedule: "50 6 * * *"
  time_window:
    refresh_type: daily
    lag: { days: 1 }
    timezone: "America/New_York"

globals:
  snowflake:
    database: WBP_DB_DEV
    schema: BRONZE

sources:
  contact_timestamps:
    source_type: gladly_api
    load_strategy: full_refresh
    extractor:
      metric_set: ContactTimestampsReport
    snowflake:
      target_table: GLADLY_CONTACT_TIMESTAMPS

4. Type Safety¶

All code must pass ty with zero errors:

uv run ty check --project packages/xo-core
uv run ty check --project packages/xo-foundry

Modern Python typing (list[str], dict[str, Any])
Type hints on all functions
Pydantic models for configuration validation

Monorepo Structure¶

xo-data/
├── packages/              # Reusable Python packages
│   ├── xo-core/          # Foundation utilities
│   │   ├── extractors/   # Data source connectors
│   │   ├── processors/   # Data transformations
│   │   ├── loaders/      # Data loading
│   │   └── utils/        # Shared utilities
│   │
│   ├── xo-foundry/       # Orchestration layer
│   │   ├── dag_factory/  # YAML → Python DAG generation
│   │   ├── tasks/        # Airflow task library
│   │   ├── schemas/      # Pydantic config models
│   │   ├── time_window/  # Time window management
│   │   └── cli/          # CLI tools
│   │
│   ├── xo-lens/          # Analytics layer
│   │   ├── dashboards/   # Streamlit apps
│   │   └── notebooks/    # Jupyter analysis
│   │
│   └── xo-bosun/         # Monorepo navigation CLI
│       └── cli/          # xo cd, xo list, xo setup
│
├── apps/                 # Deployment targets
│   ├── airflow/xo-pipelines/  # Airflow deployment (DAGs + configs)
│   ├── snowflake-schema/      # Snowflake schema migrations
│   └── material-mkdocs/       # This documentation
│
└── .claude/             # Project documentation & ADRs
    └── ongoing/         # Active documentation

Package Dependencies¶

xo-lens (Analytics)
    └── xo-core (Utilities)

xo-foundry (Orchestration)
    └── xo-core (Utilities)

xo-core (Foundation)
    └── pandas, snowflake-connector, boto3, etc.

xo-bosun (CLI)
    └── typer (standalone)

Key Principle: Packages can depend on xo-core, but should not depend on each other (except through xo-core).

Data Flow Pattern¶

ELT Workflow¶

All pipelines follow a standard Extract → Stage → Load → Transform pattern:

1. Extract
   Source System → S3 Ingest Bucket
   • API calls (Gladly, Sprout Social, Gmail, etc.)
   • Native Python csv.DictWriter (never pandas)
   • Original column names preserved

2. Stage
   S3 Ingest → S3 Stage Bucket
   • Copy-then-Peek pattern (S3-to-S3 copy + 8KB header read)
   • Standardize column names (UPPERCASE)
   • Load strategy path segmentation

3. Load
   S3 Stage → Snowflake BRONZE
   • TRUNCATE + COPY INTO with FORCE = TRUE in transaction
   • All VARCHAR columns + 6 metadata columns
   • Idempotent (same result every run)

4. Transform
   BRONZE → SILVER → GOLD
   • dbt transformations
   • Silver: type conversions, historical preservation
   • Gold: enrichment, aggregation, reporting views

Learn more about ELT Flow →

DAG Factory¶

The DAG Factory converts YAML configurations into production-ready Airflow DAGs:

YAML Config → Pydantic Validation → Jinja2 Template → Python DAG

Learn more about DAG Factory →

Load Strategies¶

Three strategies per ADR 001:

Strategy	Description	Use Case
`full_refresh`	Immutable daily snapshots	Most common (Gladly reports)
`incremental`	Full pulls with warehouse dedup	Google Sheets
`historical`	Late-arriving data, SCD Type 2	Avoid when possible

Time Windows¶

Centralized time window management supports:

Daily: Single date (execution date minus lag)
Intraday Relative: Window from now minus lookback to now minus lag
Intraday Absolute: Fixed start/end times

Copy-then-Peek Pattern¶

A performance optimization for S3-to-Snowflake operations:

# S3-to-S3 copy (fast, no download)
# Range request for first 8KB (headers only)
# Constant time (~0.5s) regardless of file size
from xo_foundry.s3_utils import copy_and_peek_s3_file
headers = copy_and_peek_s3_file(source_bucket, source_key, dest_bucket, dest_key)

Learn more about Copy-then-Peek →

Snowflake Architecture¶

Medallion Layers¶

Layer	Purpose	Naming	Key Rules
BRONZE	Raw landing zone	`{SOURCE}_{OBJECT}`	All VARCHAR, truncated daily, 6 metadata columns
SILVER	Historical preservation	`{OBJECT}`	Typed, no enrichment, no filtering
GOLD	Analytics (4 types)	`fct_`, `dim_`, `agg_`, `rpt_`	Enriched, aggregated, consumption-ready

Database Structure¶

WBP_DB (Warby Parker)
├── BRONZE.GLADLY_CONTACT_TIMESTAMPS
├── BRONZE.GLADLY_WORK_SESSIONS
├── BRONZE.SPROUT_MESSAGES
├── SILVER.CONTACT_TIMESTAMPS
├── SILVER.WORK_SESSIONS
├── GOLD.fct_contacts
├── GOLD.agg_agent_daily
└── GOLD.rpt_agent_dashboard

CND_DB (Conde Nast)
├── BRONZE.GLADLY_CONVERSATIONS
├── SILVER.CONVERSATIONS
└── GOLD.rpt_email_daily

CORE_DB (Shared)
├── BRONZE.BAMBOOHR_EMPLOYEES
├── SILVER.ROSTER_WARBYPARKER
├── SILVER.ROSTER_CONDENAST
└── GOLD.(cross-client dimensions)

Learn more about Medallion Architecture →

Orchestration with Airflow¶

TaskFlow API (Airflow 3.0)¶

We use modern Airflow decorators:

from airflow.decorators import dag, task
from xo_foundry.tasks.extract_tasks import extract_gladly_data
from xo_foundry.tasks.stage_tasks import copy_and_standardize
from xo_foundry.tasks.snowflake_tasks import copy_to_snowflake

@dag(schedule="50 6 * * *", catchup=False)
def warbyparker_gladly_daily_dag():
    ingest = extract_gladly_data(...)
    stage = copy_and_standardize(ingest)
    load = copy_to_snowflake(stage)

warbyparker_gladly_daily_dag()

Deployment¶

Location: apps/airflow/xo-pipelines/
Local: astro dev start
Production: astro deploy <deployment-id>

Key Technologies¶

Component	Technology	Purpose
Language	Python 3.12+	Core platform
Package Manager	uv	Fast dependency management
Orchestration	Apache Airflow 3.0	Pipeline scheduling
Data Warehouse	Snowflake	Data storage and transformation
Object Storage	AWS S3	File staging
Type Checking	ty	Static type analysis
Linting	ruff	Code quality
Config Validation	Pydantic	YAML schema validation
Template Engine	Jinja2	DAG code generation
Schema Management	schemachange	Snowflake migrations
Transformations	dbt	SILVER/GOLD layer SQL

Security¶

Credentials Management¶

Environment variables for local development (.env)
Airflow Connections for deployed pipelines
AWS Secrets Manager for production credentials
Never commit credentials to git

Access Control¶

Snowflake RBAC with role hierarchy
S3 bucket policies for data isolation
Airflow role-based UI access

Next Steps¶

ELT Pipeline Flow -- Detailed pipeline walkthrough
ELT Layer Architecture -- Layer responsibilities
xo-core Package -- Foundation utilities
xo-foundry Package -- Orchestration layer
Snowflake Medallion Architecture -- Data layers
Naming Conventions -- Standards
Architecture Decisions -- ADRs

See Also:

Architecture Decisions -- All 10 ADRs
Client Registry -- Client codes and domains