xo-foundry Roadmap¶

Status: Active Development Current Version: 0.4.0 Last Updated: 2025-12-03

Overview¶

This document tracks the development roadmap for xo-foundry, the Airflow DAG generation and orchestration package for xo-data pipelines.

✅ Phase 1: Core DAG Factory (COMPLETE)¶

Status: Deployed to feature/xo-foundry-dag-factory branch Completion Date: 2025-12-03

Completed Features¶

1. Path Builder Module ✅¶

S3 path generation with load strategy support
Three load strategies: full_refresh, incremental, historical
Separate builders for ingest and stage paths
Industry-standard terminology

Files: - packages/xo-foundry/src/xo_foundry/dag_factory/builders/path_builder.py

2. Pydantic Schemas ✅¶

Full YAML configuration validation
Type-safe with mypy (zero errors)
Support for multiple source types and pipeline types

Files: - packages/xo-foundry/src/xo_foundry/schemas/dag_config.py

3. DAG Generator ✅¶

YAML → Python DAG generation
Template-based approach using Jinja2
Batch generation support
Reproducible DAG generation

Files: - packages/xo-foundry/src/xo_foundry/dag_factory/factory.py - packages/xo-foundry/src/xo_foundry/dag_factory/templates/snowflake_load.py.j2

4. CLI Tool ✅¶

xo-foundry command-line interface
Subcommands: generate-dag, generate-dags, validate-config
Easy CI/CD integration

Files: - packages/xo-foundry/src/xo_foundry/cli/generate_dags.py

5. Load Strategy Integration ✅¶

Updated extract tasks to use path builder
Updated stage tasks to use path builder
S3 paths include load strategy
Clear data lineage

Files: - packages/xo-foundry/src/xo_foundry/tasks/extract_tasks.py - packages/xo-foundry/src/xo_foundry/tasks/stage_tasks.py

6. Working Reference Implementation ✅¶

warbyparker_timestamps_daily DAG
Gladly API extraction
Parallel loading
Production-ready

Files: - apps/airflow/xo-pipelines/dags/warbyparker_timestamps_daily.py - packages/xo-foundry/configs/warbyparker-timestamps.yaml

🚧 Phase 2: Additional Extractors (IN PROGRESS)¶

Priority: High Target: Q1 2025

2.1 Gmail Extractor Task ⏳¶

Status: Not Started Dependencies: xo-core Gmail extractor (exists)

Requirements: - Create extract_gmail_data() task in extract_tasks.py - Support label filtering - Handle attachment downloads - Extract email metadata (subject, from, to, date) - Support both ingest-only and full pipeline modes

Acceptance Criteria: - [ ] Task wrapper for GmailExtractor from xo-core - [ ] YAML configuration schema for Gmail sources - [ ] Example YAML config - [ ] Integration test with local Airflow - [ ] Documentation in task docstring

Estimated Effort: 1 day

2.2 Google Sheets Extractor Task ⏳¶

Status: Not Started Dependencies: xo-core Google Sheets extractor (exists)

Requirements: - Create extract_gsheet_data() task in extract_tasks.py - Support range specifications - Handle sheet-level extraction - Support incremental load strategy (full pulls) - Proper column type detection

Acceptance Criteria: - [ ] Task wrapper for GoogleSheetsExtractor from xo-core - [ ] YAML configuration schema for GSheet sources - [ ] Example YAML config - [ ] Integration test with local Airflow - [ ] Documentation in task docstring

Estimated Effort: 1 day

2.3 S3 File Extractor Task ⏳¶

Status: Not Started Dependencies: xo-core S3 utilities (exist)

Requirements: - Create extract_s3_files() task in extract_tasks.py - Support glob patterns for file matching - Handle multiple file formats (CSV, JSON, Parquet) - Support cross-bucket operations - Metadata extraction (file size, last modified)

Acceptance Criteria: - [ ] Task for S3 file discovery and processing - [ ] YAML configuration schema for S3 sources - [ ] Example YAML config - [ ] Integration test with local Airflow - [ ] Documentation in task docstring

Estimated Effort: 2 days

2.4 Generic API Extractor Task ⏳¶

Status: Not Started Dependencies: None

Requirements: - Create extract_api_data() task for generic REST APIs - Support configurable authentication (API key, OAuth, Basic) - Pagination support (offset, cursor, page-based) - Request/response logging - Retry logic with exponential backoff

Acceptance Criteria: - [ ] Generic API task with flexible configuration - [ ] YAML configuration schema for API sources - [ ] Example YAML configs (multiple auth types) - [ ] Integration test with mock API - [ ] Documentation in task docstring

Estimated Effort: 3 days

🔮 Phase 3: dbt Integration (PLANNED)¶

Priority: High Target: Q1 2025

3.1 dbt Cloud API Integration ⏳¶

Status: Not Started Dependencies: dbt Cloud account and API access

Requirements: - Replace placeholder trigger_dbt_run() with dbt Cloud API calls - Support job triggering with parameters - Poll for job completion - Retrieve run artifacts (logs, results) - Handle failures gracefully

Acceptance Criteria: - [ ] trigger_dbt_cloud_job() task - [ ] Configuration schema for dbt Cloud settings - [ ] Run status polling with timeout - [ ] Artifact retrieval and logging - [ ] Example YAML config with dbt enabled - [ ] Integration test with dbt Cloud sandbox

Estimated Effort: 3 days

3.2 dbt Core Integration (Alternative) ⏳¶

Status: Not Started Dependencies: dbt-core package

Requirements: - Support local dbt execution via BashOperator - Handle dbt project discovery - Parse dbt run results - Support selective model execution - Environment-specific target configuration

Acceptance Criteria: - [ ] trigger_dbt_core() task - [ ] Configuration schema for dbt Core settings - [ ] Result parsing and error handling - [ ] Example YAML config with dbt Core - [ ] Integration test with sample dbt project

Estimated Effort: 2 days

3.3 dbt Metadata Integration 🔮¶

Status: Future Consideration Priority: Medium

Ideas: - Generate YAML configs from dbt source definitions - Bidirectional sync between DAG configs and dbt sources - Automatic lineage documentation - Data quality test integration

🌟 Phase 4: Additional Pipeline Templates (PLANNED)¶

Priority: Medium Target: Q2 2025

4.1 Data Export Template ⏳¶

Status: Not Started Dependencies: None

Requirements: - Jinja2 template for Snowflake → External system pipelines - Support API exports (POST requests with data) - Support S3 exports (Snowflake COPY TO) - Support SFTP exports - Incremental export support (watermarking)

Acceptance Criteria: - [ ] data_export.py.j2 template - [ ] Configuration schema for export pipelines - [ ] Example YAML configs (API, S3, SFTP) - [ ] Generated DAG validation - [ ] Integration test with mock endpoints

Estimated Effort: 4 days

4.2 Hybrid Pipeline Template ⏳¶

Status: Not Started Dependencies: None

Requirements: - Template for multi-source → multi-destination pipelines - Support mixed operations (import + export in same DAG) - Complex task dependencies - Parallel branch execution - Conditional path execution

Acceptance Criteria: - [ ] hybrid.py.j2 template - [ ] Configuration schema for hybrid pipelines - [ ] Example YAML config - [ ] Generated DAG validation - [ ] Integration test

Estimated Effort: 5 days

4.3 Reverse ETL Template 🔮¶

Status: Future Consideration Priority: Low

Ideas: - Snowflake → Operational systems (Salesforce, HubSpot, etc.) - Audience sync patterns - Change detection and delta exports - Conflict resolution strategies

🔧 Phase 5: Quality & Observability (PLANNED)¶

Priority: Medium Target: Q2 2025

5.1 Data Quality Checks ⏳¶

Status: Not Started Dependencies: None

Requirements: - Pre-load data quality validation - Row count checks - Schema validation - Custom quality rules (configurable) - Failure handling (warn vs. fail)

Acceptance Criteria: - [ ] validate_data_quality() task - [ ] Configuration schema for quality checks - [ ] Example checks (row count, nulls, duplicates) - [ ] Integration with alerting - [ ] Documentation

Estimated Effort: 3 days

5.2 Monitoring & Alerting ⏳¶

Status: Not Started Dependencies: Alerting platform (Slack, PagerDuty, etc.)

Requirements: - Task-level success/failure notifications - Pipeline-level SLA monitoring - Custom alerting rules - Slack integration - Email integration

Acceptance Criteria: - [ ] Alerting task wrappers - [ ] Configuration schema for alerts - [ ] Example YAML configs - [ ] Integration test with mock alerting - [ ] Documentation

Estimated Effort: 3 days

5.3 Lineage Tracking 🔮¶

Status: Future Consideration Priority: Low

Ideas: - Automatic lineage diagram generation - Integration with data catalogs (Atlan, Alation) - Source → Target mapping documentation - dbt lineage integration

📦 Phase 6: Developer Experience (PLANNED)¶

Priority: Low Target: Q2 2025

6.1 VS Code Extension 🔮¶

Status: Future Consideration

Ideas: - YAML syntax highlighting for xo-foundry configs - Auto-completion for config fields - Inline validation errors - "Generate DAG" command from editor

6.2 Web UI 🔮¶

Status: Future Consideration

Ideas: - Browser-based DAG configuration builder - Visual pipeline designer - Config validation and preview - One-click DAG generation

6.3 Testing Framework ⏳¶

Status: Not Started Priority: Medium

Requirements: - Unit test utilities for DAG validation - Mock data generators - Integration test helpers - CI/CD test runner

Estimated Effort: 4 days

🐛 Bug Fixes & Tech Debt¶

Known Issues¶

None currently identified.

Technical Debt¶

Pydantic Schema Warning: schema field name conflicts with BaseModel
Status: Workaround implemented (using schema_ with alias)
Future: Consider renaming to snowflake_schema in next major version
Template Duplication: Only one template exists (snowflake_load.py.j2)
Status: Acceptable for Phase 1
Future: Extract common patterns when adding more templates

Version History¶

v0.4.0 (2025-12-03) - Current¶

✅ DAG factory with YAML generation
✅ Load strategy path builder
✅ Pydantic schemas
✅ CLI tool
✅ Gladly API extractor task
✅ Snowflake load tasks

Future Versions¶

v0.5.0 (Q1 2025): Additional extractors (Gmail, GSheets, S3, API)
v0.6.0 (Q1 2025): dbt integration (Cloud or Core)
v0.7.0 (Q2 2025): Data export templates
v0.8.0 (Q2 2025): Quality & observability features
v1.0.0 (Q3 2025): Stable release with comprehensive features

Contributing¶

When adding new features:

Update Schemas: Add/modify Pydantic schemas in schemas/dag_config.py
Create Tests: Write unit tests for new functionality
Update Templates: Modify or create Jinja2 templates
Update Docs: Add examples and documentation
Update This Roadmap: Mark features as complete and add new items

Architecture: .claude/ongoing/archived/2025-12-03-dag-factory-implementation/
Load Strategies: .claude/ongoing/reference/data-refresh-patterns.md
Main Docs: .claude/CLAUDE.md

xo-foundry Roadmap¶

Overview¶

✅ Phase 1: Core DAG Factory (COMPLETE)¶

Completed Features¶

1. Path Builder Module ✅¶

2. Pydantic Schemas ✅¶

3. DAG Generator ✅¶

4. CLI Tool ✅¶

5. Load Strategy Integration ✅¶

6. Working Reference Implementation ✅¶

🚧 Phase 2: Additional Extractors (IN PROGRESS)¶

2.1 Gmail Extractor Task ⏳¶

2.2 Google Sheets Extractor Task ⏳¶

2.3 S3 File Extractor Task ⏳¶

2.4 Generic API Extractor Task ⏳¶

🔮 Phase 3: dbt Integration (PLANNED)¶

3.1 dbt Cloud API Integration ⏳¶

3.2 dbt Core Integration (Alternative) ⏳¶

3.3 dbt Metadata Integration 🔮¶

🌟 Phase 4: Additional Pipeline Templates (PLANNED)¶

4.1 Data Export Template ⏳¶

4.2 Hybrid Pipeline Template ⏳¶

4.3 Reverse ETL Template 🔮¶

🔧 Phase 5: Quality & Observability (PLANNED)¶

5.1 Data Quality Checks ⏳¶

5.2 Monitoring & Alerting ⏳¶

5.3 Lineage Tracking 🔮¶

📦 Phase 6: Developer Experience (PLANNED)¶

6.1 VS Code Extension 🔮¶

6.2 Web UI 🔮¶

6.3 Testing Framework ⏳¶

🐛 Bug Fixes & Tech Debt¶

Known Issues¶

Technical Debt¶

Version History¶

v0.4.0 (2025-12-03) - Current¶

Future Versions¶

Contributing¶

Related Documentation¶