Skip to content

xo-foundry Roadmap

Status: Active Development Current Version: 0.4.0 Last Updated: 2025-12-03

Overview

This document tracks the development roadmap for xo-foundry, the Airflow DAG generation and orchestration package for xo-data pipelines.

✅ Phase 1: Core DAG Factory (COMPLETE)

Status: Deployed to feature/xo-foundry-dag-factory branch Completion Date: 2025-12-03

Completed Features

1. Path Builder Module ✅

  • S3 path generation with load strategy support
  • Three load strategies: full_refresh, incremental, historical
  • Separate builders for ingest and stage paths
  • Industry-standard terminology

Files: - packages/xo-foundry/src/xo_foundry/dag_factory/builders/path_builder.py

2. Pydantic Schemas ✅

  • Full YAML configuration validation
  • Type-safe with mypy (zero errors)
  • Support for multiple source types and pipeline types

Files: - packages/xo-foundry/src/xo_foundry/schemas/dag_config.py

3. DAG Generator ✅

  • YAML → Python DAG generation
  • Template-based approach using Jinja2
  • Batch generation support
  • Reproducible DAG generation

Files: - packages/xo-foundry/src/xo_foundry/dag_factory/factory.py - packages/xo-foundry/src/xo_foundry/dag_factory/templates/snowflake_load.py.j2

4. CLI Tool ✅

  • xo-foundry command-line interface
  • Subcommands: generate-dag, generate-dags, validate-config
  • Easy CI/CD integration

Files: - packages/xo-foundry/src/xo_foundry/cli/generate_dags.py

5. Load Strategy Integration ✅

  • Updated extract tasks to use path builder
  • Updated stage tasks to use path builder
  • S3 paths include load strategy
  • Clear data lineage

Files: - packages/xo-foundry/src/xo_foundry/tasks/extract_tasks.py - packages/xo-foundry/src/xo_foundry/tasks/stage_tasks.py

6. Working Reference Implementation ✅

  • warbyparker_timestamps_daily DAG
  • Gladly API extraction
  • Parallel loading
  • Production-ready

Files: - apps/airflow/xo-pipelines/dags/warbyparker_timestamps_daily.py - packages/xo-foundry/configs/warbyparker-timestamps.yaml

🚧 Phase 2: Additional Extractors (IN PROGRESS)

Priority: High Target: Q1 2025

2.1 Gmail Extractor Task ⏳

Status: Not Started Dependencies: xo-core Gmail extractor (exists)

Requirements: - Create extract_gmail_data() task in extract_tasks.py - Support label filtering - Handle attachment downloads - Extract email metadata (subject, from, to, date) - Support both ingest-only and full pipeline modes

Acceptance Criteria: - [ ] Task wrapper for GmailExtractor from xo-core - [ ] YAML configuration schema for Gmail sources - [ ] Example YAML config - [ ] Integration test with local Airflow - [ ] Documentation in task docstring

Estimated Effort: 1 day

2.2 Google Sheets Extractor Task ⏳

Status: Not Started Dependencies: xo-core Google Sheets extractor (exists)

Requirements: - Create extract_gsheet_data() task in extract_tasks.py - Support range specifications - Handle sheet-level extraction - Support incremental load strategy (full pulls) - Proper column type detection

Acceptance Criteria: - [ ] Task wrapper for GoogleSheetsExtractor from xo-core - [ ] YAML configuration schema for GSheet sources - [ ] Example YAML config - [ ] Integration test with local Airflow - [ ] Documentation in task docstring

Estimated Effort: 1 day

2.3 S3 File Extractor Task ⏳

Status: Not Started Dependencies: xo-core S3 utilities (exist)

Requirements: - Create extract_s3_files() task in extract_tasks.py - Support glob patterns for file matching - Handle multiple file formats (CSV, JSON, Parquet) - Support cross-bucket operations - Metadata extraction (file size, last modified)

Acceptance Criteria: - [ ] Task for S3 file discovery and processing - [ ] YAML configuration schema for S3 sources - [ ] Example YAML config - [ ] Integration test with local Airflow - [ ] Documentation in task docstring

Estimated Effort: 2 days

2.4 Generic API Extractor Task ⏳

Status: Not Started Dependencies: None

Requirements: - Create extract_api_data() task for generic REST APIs - Support configurable authentication (API key, OAuth, Basic) - Pagination support (offset, cursor, page-based) - Request/response logging - Retry logic with exponential backoff

Acceptance Criteria: - [ ] Generic API task with flexible configuration - [ ] YAML configuration schema for API sources - [ ] Example YAML configs (multiple auth types) - [ ] Integration test with mock API - [ ] Documentation in task docstring

Estimated Effort: 3 days

🔮 Phase 3: dbt Integration (PLANNED)

Priority: High Target: Q1 2025

3.1 dbt Cloud API Integration ⏳

Status: Not Started Dependencies: dbt Cloud account and API access

Requirements: - Replace placeholder trigger_dbt_run() with dbt Cloud API calls - Support job triggering with parameters - Poll for job completion - Retrieve run artifacts (logs, results) - Handle failures gracefully

Acceptance Criteria: - [ ] trigger_dbt_cloud_job() task - [ ] Configuration schema for dbt Cloud settings - [ ] Run status polling with timeout - [ ] Artifact retrieval and logging - [ ] Example YAML config with dbt enabled - [ ] Integration test with dbt Cloud sandbox

Estimated Effort: 3 days

3.2 dbt Core Integration (Alternative) ⏳

Status: Not Started Dependencies: dbt-core package

Requirements: - Support local dbt execution via BashOperator - Handle dbt project discovery - Parse dbt run results - Support selective model execution - Environment-specific target configuration

Acceptance Criteria: - [ ] trigger_dbt_core() task - [ ] Configuration schema for dbt Core settings - [ ] Result parsing and error handling - [ ] Example YAML config with dbt Core - [ ] Integration test with sample dbt project

Estimated Effort: 2 days

3.3 dbt Metadata Integration 🔮

Status: Future Consideration Priority: Medium

Ideas: - Generate YAML configs from dbt source definitions - Bidirectional sync between DAG configs and dbt sources - Automatic lineage documentation - Data quality test integration

🌟 Phase 4: Additional Pipeline Templates (PLANNED)

Priority: Medium Target: Q2 2025

4.1 Data Export Template ⏳

Status: Not Started Dependencies: None

Requirements: - Jinja2 template for Snowflake → External system pipelines - Support API exports (POST requests with data) - Support S3 exports (Snowflake COPY TO) - Support SFTP exports - Incremental export support (watermarking)

Acceptance Criteria: - [ ] data_export.py.j2 template - [ ] Configuration schema for export pipelines - [ ] Example YAML configs (API, S3, SFTP) - [ ] Generated DAG validation - [ ] Integration test with mock endpoints

Estimated Effort: 4 days

4.2 Hybrid Pipeline Template ⏳

Status: Not Started Dependencies: None

Requirements: - Template for multi-source → multi-destination pipelines - Support mixed operations (import + export in same DAG) - Complex task dependencies - Parallel branch execution - Conditional path execution

Acceptance Criteria: - [ ] hybrid.py.j2 template - [ ] Configuration schema for hybrid pipelines - [ ] Example YAML config - [ ] Generated DAG validation - [ ] Integration test

Estimated Effort: 5 days

4.3 Reverse ETL Template 🔮

Status: Future Consideration Priority: Low

Ideas: - Snowflake → Operational systems (Salesforce, HubSpot, etc.) - Audience sync patterns - Change detection and delta exports - Conflict resolution strategies

🔧 Phase 5: Quality & Observability (PLANNED)

Priority: Medium Target: Q2 2025

5.1 Data Quality Checks ⏳

Status: Not Started Dependencies: None

Requirements: - Pre-load data quality validation - Row count checks - Schema validation - Custom quality rules (configurable) - Failure handling (warn vs. fail)

Acceptance Criteria: - [ ] validate_data_quality() task - [ ] Configuration schema for quality checks - [ ] Example checks (row count, nulls, duplicates) - [ ] Integration with alerting - [ ] Documentation

Estimated Effort: 3 days

5.2 Monitoring & Alerting ⏳

Status: Not Started Dependencies: Alerting platform (Slack, PagerDuty, etc.)

Requirements: - Task-level success/failure notifications - Pipeline-level SLA monitoring - Custom alerting rules - Slack integration - Email integration

Acceptance Criteria: - [ ] Alerting task wrappers - [ ] Configuration schema for alerts - [ ] Example YAML configs - [ ] Integration test with mock alerting - [ ] Documentation

Estimated Effort: 3 days

5.3 Lineage Tracking 🔮

Status: Future Consideration Priority: Low

Ideas: - Automatic lineage diagram generation - Integration with data catalogs (Atlan, Alation) - Source → Target mapping documentation - dbt lineage integration

📦 Phase 6: Developer Experience (PLANNED)

Priority: Low Target: Q2 2025

6.1 VS Code Extension 🔮

Status: Future Consideration

Ideas: - YAML syntax highlighting for xo-foundry configs - Auto-completion for config fields - Inline validation errors - "Generate DAG" command from editor

6.2 Web UI 🔮

Status: Future Consideration

Ideas: - Browser-based DAG configuration builder - Visual pipeline designer - Config validation and preview - One-click DAG generation

6.3 Testing Framework ⏳

Status: Not Started Priority: Medium

Requirements: - Unit test utilities for DAG validation - Mock data generators - Integration test helpers - CI/CD test runner

Estimated Effort: 4 days

🐛 Bug Fixes & Tech Debt

Known Issues

None currently identified.

Technical Debt

  1. Pydantic Schema Warning: schema field name conflicts with BaseModel
  2. Status: Workaround implemented (using schema_ with alias)
  3. Future: Consider renaming to snowflake_schema in next major version

  4. Template Duplication: Only one template exists (snowflake_load.py.j2)

  5. Status: Acceptable for Phase 1
  6. Future: Extract common patterns when adding more templates

Version History

v0.4.0 (2025-12-03) - Current

  • ✅ DAG factory with YAML generation
  • ✅ Load strategy path builder
  • ✅ Pydantic schemas
  • ✅ CLI tool
  • ✅ Gladly API extractor task
  • ✅ Snowflake load tasks

Future Versions

  • v0.5.0 (Q1 2025): Additional extractors (Gmail, GSheets, S3, API)
  • v0.6.0 (Q1 2025): dbt integration (Cloud or Core)
  • v0.7.0 (Q2 2025): Data export templates
  • v0.8.0 (Q2 2025): Quality & observability features
  • v1.0.0 (Q3 2025): Stable release with comprehensive features

Contributing

When adding new features:

  1. Update Schemas: Add/modify Pydantic schemas in schemas/dag_config.py
  2. Create Tests: Write unit tests for new functionality
  3. Update Templates: Modify or create Jinja2 templates
  4. Update Docs: Add examples and documentation
  5. Update This Roadmap: Mark features as complete and add new items
  • Architecture: .claude/ongoing/archived/2025-12-03-dag-factory-implementation/
  • Load Strategies: .claude/ongoing/reference/data-refresh-patterns.md
  • Main Docs: .claude/CLAUDE.md