Skip to content

Architecture Overview

The XO-Data platform is a modern, monorepo-based data engineering platform built for scalability, reusability, and maintainability.

High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Data Sources                            │
│                                                              │
│  • Gladly API (customer service data)                       │
│  • Sprout Social API (social media data)                    │
│  • Gmail (email attachments)                                │
│  • Google Sheets (manual data entry)                        │
│  • S3 (file uploads)                                        │
└──────────────────────┬──────────────────────────────────────┘
                       │ Extract (xo-foundry tasks)
┌─────────────────────────────────────────────────────────────┐
│                  S3 Staging Layer                            │
│                                                              │
│  Ingest Bucket → Stage Bucket                               │
│  • Copy-then-Peek pattern (8KB header read)                 │
│  • Standardize column names                                 │
│  • Load strategy path segmentation                          │
└──────────────────────┬──────────────────────────────────────┘
                       │ Load (TRUNCATE + COPY INTO)
┌─────────────────────────────────────────────────────────────┐
│            Snowflake Medallion Architecture                  │
│                                                              │
│  BRONZE Layer (Raw Data)                                     │
│  • All VARCHAR, truncated daily, 6 metadata columns          │
│  • Tables: GLADLY_CONTACT_TIMESTAMPS, SPROUT_MESSAGES       │
│                    ▼                                         │
│  SILVER Layer (Historical Preservation)                      │
│  • Proper data types, no enrichment, no filtering            │
│  • Tables: CONTACT_TIMESTAMPS, MESSAGES                     │
│                    ▼                                         │
│  GOLD Layer (Analytics - 4 Types)                            │
│  • fct_contacts, dim_agents, agg_agent_daily, rpt_dashboard │
└─────────────────────────────────────────────────────────────┘

Core Design Principles

1. Separation of Concerns

The platform separates concerns into distinct layers:

  • Extraction (xo-core): Data source connectors and extractors
  • Orchestration (xo-foundry): DAG Factory, Airflow tasks, pipeline configuration
  • Storage (Snowflake): Medallion architecture for data quality
  • Analytics (xo-lens): BI tools and visualizations
  • Navigation (xo-bosun): Monorepo CLI for developer productivity

2. Reusable Components

Common utilities are packaged for reuse:

  • xo-core: Foundation package with extractors, managers, utilities
  • xo-foundry: DAG Factory, task library, time windows, CLI tools
  • xo-lens: Analytics and visualization tools
  • xo-bosun: Monorepo navigation CLI

3. Configuration-Driven Pipelines (DAG Factory)

Pipelines are defined through YAML configuration and generated into Python DAGs:

dag:
  domain: warbyparker
  pipeline_name: gladly_daily
  pipeline_type: snowflake_load
  schedule: "50 6 * * *"
  time_window:
    refresh_type: daily
    lag: { days: 1 }
    timezone: "America/New_York"

globals:
  snowflake:
    database: WBP_DB_DEV
    schema: BRONZE

sources:
  contact_timestamps:
    source_type: gladly_api
    load_strategy: full_refresh
    extractor:
      metric_set: ContactTimestampsReport
    snowflake:
      target_table: GLADLY_CONTACT_TIMESTAMPS

4. Type Safety

All code must pass ty with zero errors:

uv run ty check --project packages/xo-core
uv run ty check --project packages/xo-foundry
  • Modern Python typing (list[str], dict[str, Any])
  • Type hints on all functions
  • Pydantic models for configuration validation

Monorepo Structure

xo-data/
├── packages/              # Reusable Python packages
│   ├── xo-core/          # Foundation utilities
│   │   ├── extractors/   # Data source connectors
│   │   ├── processors/   # Data transformations
│   │   ├── loaders/      # Data loading
│   │   └── utils/        # Shared utilities
│   │
│   ├── xo-foundry/       # Orchestration layer
│   │   ├── dag_factory/  # YAML → Python DAG generation
│   │   ├── tasks/        # Airflow task library
│   │   ├── schemas/      # Pydantic config models
│   │   ├── time_window/  # Time window management
│   │   └── cli/          # CLI tools
│   │
│   ├── xo-lens/          # Analytics layer
│   │   ├── dashboards/   # Streamlit apps
│   │   └── notebooks/    # Jupyter analysis
│   │
│   └── xo-bosun/         # Monorepo navigation CLI
│       └── cli/          # xo cd, xo list, xo setup
├── apps/                 # Deployment targets
│   ├── airflow/xo-pipelines/  # Airflow deployment (DAGs + configs)
│   ├── snowflake-schema/      # Snowflake schema migrations
│   └── material-mkdocs/       # This documentation
└── .claude/             # Project documentation & ADRs
    └── ongoing/         # Active documentation

Package Dependencies

xo-lens (Analytics)
    └── xo-core (Utilities)

xo-foundry (Orchestration)
    └── xo-core (Utilities)

xo-core (Foundation)
    └── pandas, snowflake-connector, boto3, etc.

xo-bosun (CLI)
    └── typer (standalone)

Key Principle: Packages can depend on xo-core, but should not depend on each other (except through xo-core).

Data Flow Pattern

ELT Workflow

All pipelines follow a standard Extract → Stage → Load → Transform pattern:

1. Extract
   Source System → S3 Ingest Bucket
   • API calls (Gladly, Sprout Social, Gmail, etc.)
   • Native Python csv.DictWriter (never pandas)
   • Original column names preserved

2. Stage
   S3 Ingest → S3 Stage Bucket
   • Copy-then-Peek pattern (S3-to-S3 copy + 8KB header read)
   • Standardize column names (UPPERCASE)
   • Load strategy path segmentation

3. Load
   S3 Stage → Snowflake BRONZE
   • TRUNCATE + COPY INTO with FORCE = TRUE in transaction
   • All VARCHAR columns + 6 metadata columns
   • Idempotent (same result every run)

4. Transform
   BRONZE → SILVER → GOLD
   • dbt transformations
   • Silver: type conversions, historical preservation
   • Gold: enrichment, aggregation, reporting views

Learn more about ELT Flow →

DAG Factory

The DAG Factory converts YAML configurations into production-ready Airflow DAGs:

YAML Config → Pydantic Validation → Jinja2 Template → Python DAG

Learn more about DAG Factory →

Load Strategies

Three strategies per ADR 001:

Strategy Description Use Case
full_refresh Immutable daily snapshots Most common (Gladly reports)
incremental Full pulls with warehouse dedup Google Sheets
historical Late-arriving data, SCD Type 2 Avoid when possible

Time Windows

Centralized time window management supports:

  • Daily: Single date (execution date minus lag)
  • Intraday Relative: Window from now minus lookback to now minus lag
  • Intraday Absolute: Fixed start/end times

Copy-then-Peek Pattern

A performance optimization for S3-to-Snowflake operations:

# S3-to-S3 copy (fast, no download)
# Range request for first 8KB (headers only)
# Constant time (~0.5s) regardless of file size
from xo_foundry.s3_utils import copy_and_peek_s3_file
headers = copy_and_peek_s3_file(source_bucket, source_key, dest_bucket, dest_key)

Learn more about Copy-then-Peek →

Snowflake Architecture

Medallion Layers

Layer Purpose Naming Key Rules
BRONZE Raw landing zone {SOURCE}_{OBJECT} All VARCHAR, truncated daily, 6 metadata columns
SILVER Historical preservation {OBJECT} Typed, no enrichment, no filtering
GOLD Analytics (4 types) fct_, dim_, agg_, rpt_ Enriched, aggregated, consumption-ready

Database Structure

WBP_DB (Warby Parker)
├── BRONZE.GLADLY_CONTACT_TIMESTAMPS
├── BRONZE.GLADLY_WORK_SESSIONS
├── BRONZE.SPROUT_MESSAGES
├── SILVER.CONTACT_TIMESTAMPS
├── SILVER.WORK_SESSIONS
├── GOLD.fct_contacts
├── GOLD.agg_agent_daily
└── GOLD.rpt_agent_dashboard

CND_DB (Conde Nast)
├── BRONZE.GLADLY_CONVERSATIONS
├── SILVER.CONVERSATIONS
└── GOLD.rpt_email_daily

CORE_DB (Shared)
├── BRONZE.BAMBOOHR_EMPLOYEES
├── SILVER.ROSTER_WARBYPARKER
├── SILVER.ROSTER_CONDENAST
└── GOLD.(cross-client dimensions)

Learn more about Medallion Architecture →

Orchestration with Airflow

TaskFlow API (Airflow 3.0)

We use modern Airflow decorators:

from airflow.decorators import dag, task
from xo_foundry.tasks.extract_tasks import extract_gladly_data
from xo_foundry.tasks.stage_tasks import copy_and_standardize
from xo_foundry.tasks.snowflake_tasks import copy_to_snowflake

@dag(schedule="50 6 * * *", catchup=False)
def warbyparker_gladly_daily_dag():
    ingest = extract_gladly_data(...)
    stage = copy_and_standardize(ingest)
    load = copy_to_snowflake(stage)

warbyparker_gladly_daily_dag()

Deployment

  • Location: apps/airflow/xo-pipelines/
  • Local: astro dev start
  • Production: astro deploy <deployment-id>

Key Technologies

Component Technology Purpose
Language Python 3.12+ Core platform
Package Manager uv Fast dependency management
Orchestration Apache Airflow 3.0 Pipeline scheduling
Data Warehouse Snowflake Data storage and transformation
Object Storage AWS S3 File staging
Type Checking ty Static type analysis
Linting ruff Code quality
Config Validation Pydantic YAML schema validation
Template Engine Jinja2 DAG code generation
Schema Management schemachange Snowflake migrations
Transformations dbt SILVER/GOLD layer SQL

Security

Credentials Management

  • Environment variables for local development (.env)
  • Airflow Connections for deployed pipelines
  • AWS Secrets Manager for production credentials
  • Never commit credentials to git

Access Control

  • Snowflake RBAC with role hierarchy
  • S3 bucket policies for data isolation
  • Airflow role-based UI access

Next Steps


See Also: