Architecture Overview¶
The XO-Data platform is a modern, monorepo-based data engineering platform built for scalability, reusability, and maintainability.
High-Level Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ Data Sources │
│ │
│ • Gladly API (customer service data) │
│ • Sprout Social API (social media data) │
│ • Gmail (email attachments) │
│ • Google Sheets (manual data entry) │
│ • S3 (file uploads) │
└──────────────────────┬──────────────────────────────────────┘
│
│ Extract (xo-foundry tasks)
▼
┌─────────────────────────────────────────────────────────────┐
│ S3 Staging Layer │
│ │
│ Ingest Bucket → Stage Bucket │
│ • Copy-then-Peek pattern (8KB header read) │
│ • Standardize column names │
│ • Load strategy path segmentation │
└──────────────────────┬──────────────────────────────────────┘
│
│ Load (TRUNCATE + COPY INTO)
▼
┌─────────────────────────────────────────────────────────────┐
│ Snowflake Medallion Architecture │
│ │
│ BRONZE Layer (Raw Data) │
│ • All VARCHAR, truncated daily, 6 metadata columns │
│ • Tables: GLADLY_CONTACT_TIMESTAMPS, SPROUT_MESSAGES │
│ ▼ │
│ SILVER Layer (Historical Preservation) │
│ • Proper data types, no enrichment, no filtering │
│ • Tables: CONTACT_TIMESTAMPS, MESSAGES │
│ ▼ │
│ GOLD Layer (Analytics - 4 Types) │
│ • fct_contacts, dim_agents, agg_agent_daily, rpt_dashboard │
└─────────────────────────────────────────────────────────────┘
Core Design Principles¶
1. Separation of Concerns¶
The platform separates concerns into distinct layers:
- Extraction (
xo-core): Data source connectors and extractors - Orchestration (
xo-foundry): DAG Factory, Airflow tasks, pipeline configuration - Storage (Snowflake): Medallion architecture for data quality
- Analytics (
xo-lens): BI tools and visualizations - Navigation (
xo-bosun): Monorepo CLI for developer productivity
2. Reusable Components¶
Common utilities are packaged for reuse:
- xo-core: Foundation package with extractors, managers, utilities
- xo-foundry: DAG Factory, task library, time windows, CLI tools
- xo-lens: Analytics and visualization tools
- xo-bosun: Monorepo navigation CLI
3. Configuration-Driven Pipelines (DAG Factory)¶
Pipelines are defined through YAML configuration and generated into Python DAGs:
dag:
domain: warbyparker
pipeline_name: gladly_daily
pipeline_type: snowflake_load
schedule: "50 6 * * *"
time_window:
refresh_type: daily
lag: { days: 1 }
timezone: "America/New_York"
globals:
snowflake:
database: WBP_DB_DEV
schema: BRONZE
sources:
contact_timestamps:
source_type: gladly_api
load_strategy: full_refresh
extractor:
metric_set: ContactTimestampsReport
snowflake:
target_table: GLADLY_CONTACT_TIMESTAMPS
4. Type Safety¶
All code must pass ty with zero errors:
- Modern Python typing (
list[str],dict[str, Any]) - Type hints on all functions
- Pydantic models for configuration validation
Monorepo Structure¶
xo-data/
├── packages/ # Reusable Python packages
│ ├── xo-core/ # Foundation utilities
│ │ ├── extractors/ # Data source connectors
│ │ ├── processors/ # Data transformations
│ │ ├── loaders/ # Data loading
│ │ └── utils/ # Shared utilities
│ │
│ ├── xo-foundry/ # Orchestration layer
│ │ ├── dag_factory/ # YAML → Python DAG generation
│ │ ├── tasks/ # Airflow task library
│ │ ├── schemas/ # Pydantic config models
│ │ ├── time_window/ # Time window management
│ │ └── cli/ # CLI tools
│ │
│ ├── xo-lens/ # Analytics layer
│ │ ├── dashboards/ # Streamlit apps
│ │ └── notebooks/ # Jupyter analysis
│ │
│ └── xo-bosun/ # Monorepo navigation CLI
│ └── cli/ # xo cd, xo list, xo setup
│
├── apps/ # Deployment targets
│ ├── airflow/xo-pipelines/ # Airflow deployment (DAGs + configs)
│ ├── snowflake-schema/ # Snowflake schema migrations
│ └── material-mkdocs/ # This documentation
│
└── .claude/ # Project documentation & ADRs
└── ongoing/ # Active documentation
Package Dependencies¶
xo-lens (Analytics)
└── xo-core (Utilities)
xo-foundry (Orchestration)
└── xo-core (Utilities)
xo-core (Foundation)
└── pandas, snowflake-connector, boto3, etc.
xo-bosun (CLI)
└── typer (standalone)
Key Principle: Packages can depend on xo-core, but should not depend on each other (except through xo-core).
Data Flow Pattern¶
ELT Workflow¶
All pipelines follow a standard Extract → Stage → Load → Transform pattern:
1. Extract
Source System → S3 Ingest Bucket
• API calls (Gladly, Sprout Social, Gmail, etc.)
• Native Python csv.DictWriter (never pandas)
• Original column names preserved
2. Stage
S3 Ingest → S3 Stage Bucket
• Copy-then-Peek pattern (S3-to-S3 copy + 8KB header read)
• Standardize column names (UPPERCASE)
• Load strategy path segmentation
3. Load
S3 Stage → Snowflake BRONZE
• TRUNCATE + COPY INTO with FORCE = TRUE in transaction
• All VARCHAR columns + 6 metadata columns
• Idempotent (same result every run)
4. Transform
BRONZE → SILVER → GOLD
• dbt transformations
• Silver: type conversions, historical preservation
• Gold: enrichment, aggregation, reporting views
DAG Factory¶
The DAG Factory converts YAML configurations into production-ready Airflow DAGs:
Learn more about DAG Factory →
Load Strategies¶
Three strategies per ADR 001:
| Strategy | Description | Use Case |
|---|---|---|
full_refresh |
Immutable daily snapshots | Most common (Gladly reports) |
incremental |
Full pulls with warehouse dedup | Google Sheets |
historical |
Late-arriving data, SCD Type 2 | Avoid when possible |
Time Windows¶
Centralized time window management supports:
- Daily: Single date (execution date minus lag)
- Intraday Relative: Window from now minus lookback to now minus lag
- Intraday Absolute: Fixed start/end times
Copy-then-Peek Pattern¶
A performance optimization for S3-to-Snowflake operations:
# S3-to-S3 copy (fast, no download)
# Range request for first 8KB (headers only)
# Constant time (~0.5s) regardless of file size
from xo_foundry.s3_utils import copy_and_peek_s3_file
headers = copy_and_peek_s3_file(source_bucket, source_key, dest_bucket, dest_key)
Learn more about Copy-then-Peek →
Snowflake Architecture¶
Medallion Layers¶
| Layer | Purpose | Naming | Key Rules |
|---|---|---|---|
| BRONZE | Raw landing zone | {SOURCE}_{OBJECT} |
All VARCHAR, truncated daily, 6 metadata columns |
| SILVER | Historical preservation | {OBJECT} |
Typed, no enrichment, no filtering |
| GOLD | Analytics (4 types) | fct_, dim_, agg_, rpt_ |
Enriched, aggregated, consumption-ready |
Database Structure¶
WBP_DB (Warby Parker)
├── BRONZE.GLADLY_CONTACT_TIMESTAMPS
├── BRONZE.GLADLY_WORK_SESSIONS
├── BRONZE.SPROUT_MESSAGES
├── SILVER.CONTACT_TIMESTAMPS
├── SILVER.WORK_SESSIONS
├── GOLD.fct_contacts
├── GOLD.agg_agent_daily
└── GOLD.rpt_agent_dashboard
CND_DB (Conde Nast)
├── BRONZE.GLADLY_CONVERSATIONS
├── SILVER.CONVERSATIONS
└── GOLD.rpt_email_daily
CORE_DB (Shared)
├── BRONZE.BAMBOOHR_EMPLOYEES
├── SILVER.ROSTER_WARBYPARKER
├── SILVER.ROSTER_CONDENAST
└── GOLD.(cross-client dimensions)
Learn more about Medallion Architecture →
Orchestration with Airflow¶
TaskFlow API (Airflow 3.0)¶
We use modern Airflow decorators:
from airflow.decorators import dag, task
from xo_foundry.tasks.extract_tasks import extract_gladly_data
from xo_foundry.tasks.stage_tasks import copy_and_standardize
from xo_foundry.tasks.snowflake_tasks import copy_to_snowflake
@dag(schedule="50 6 * * *", catchup=False)
def warbyparker_gladly_daily_dag():
ingest = extract_gladly_data(...)
stage = copy_and_standardize(ingest)
load = copy_to_snowflake(stage)
warbyparker_gladly_daily_dag()
Deployment¶
- Location:
apps/airflow/xo-pipelines/ - Local:
astro dev start - Production:
astro deploy <deployment-id>
Key Technologies¶
| Component | Technology | Purpose |
|---|---|---|
| Language | Python 3.12+ | Core platform |
| Package Manager | uv | Fast dependency management |
| Orchestration | Apache Airflow 3.0 | Pipeline scheduling |
| Data Warehouse | Snowflake | Data storage and transformation |
| Object Storage | AWS S3 | File staging |
| Type Checking | ty | Static type analysis |
| Linting | ruff | Code quality |
| Config Validation | Pydantic | YAML schema validation |
| Template Engine | Jinja2 | DAG code generation |
| Schema Management | schemachange | Snowflake migrations |
| Transformations | dbt | SILVER/GOLD layer SQL |
Security¶
Credentials Management¶
- Environment variables for local development (
.env) - Airflow Connections for deployed pipelines
- AWS Secrets Manager for production credentials
- Never commit credentials to git
Access Control¶
- Snowflake RBAC with role hierarchy
- S3 bucket policies for data isolation
- Airflow role-based UI access
Next Steps¶
- ELT Pipeline Flow -- Detailed pipeline walkthrough
- ELT Layer Architecture -- Layer responsibilities
- xo-core Package -- Foundation utilities
- xo-foundry Package -- Orchestration layer
- Snowflake Medallion Architecture -- Data layers
- Naming Conventions -- Standards
- Architecture Decisions -- ADRs
See Also:
- Architecture Decisions -- All 10 ADRs
- Client Registry -- Client codes and domains