xo-foundry Roadmap¶
Status: Active Development Current Version: 0.4.0 Last Updated: 2025-12-03
Overview¶
This document tracks the development roadmap for xo-foundry, the Airflow DAG generation and orchestration package for xo-data pipelines.
✅ Phase 1: Core DAG Factory (COMPLETE)¶
Status: Deployed to feature/xo-foundry-dag-factory branch
Completion Date: 2025-12-03
Completed Features¶
1. Path Builder Module ✅¶
- S3 path generation with load strategy support
- Three load strategies:
full_refresh,incremental,historical - Separate builders for ingest and stage paths
- Industry-standard terminology
Files:
- packages/xo-foundry/src/xo_foundry/dag_factory/builders/path_builder.py
2. Pydantic Schemas ✅¶
- Full YAML configuration validation
- Type-safe with mypy (zero errors)
- Support for multiple source types and pipeline types
Files:
- packages/xo-foundry/src/xo_foundry/schemas/dag_config.py
3. DAG Generator ✅¶
- YAML → Python DAG generation
- Template-based approach using Jinja2
- Batch generation support
- Reproducible DAG generation
Files:
- packages/xo-foundry/src/xo_foundry/dag_factory/factory.py
- packages/xo-foundry/src/xo_foundry/dag_factory/templates/snowflake_load.py.j2
4. CLI Tool ✅¶
xo-foundrycommand-line interface- Subcommands:
generate-dag,generate-dags,validate-config - Easy CI/CD integration
Files:
- packages/xo-foundry/src/xo_foundry/cli/generate_dags.py
5. Load Strategy Integration ✅¶
- Updated extract tasks to use path builder
- Updated stage tasks to use path builder
- S3 paths include load strategy
- Clear data lineage
Files:
- packages/xo-foundry/src/xo_foundry/tasks/extract_tasks.py
- packages/xo-foundry/src/xo_foundry/tasks/stage_tasks.py
6. Working Reference Implementation ✅¶
warbyparker_timestamps_dailyDAG- Gladly API extraction
- Parallel loading
- Production-ready
Files:
- apps/airflow/xo-pipelines/dags/warbyparker_timestamps_daily.py
- packages/xo-foundry/configs/warbyparker-timestamps.yaml
🚧 Phase 2: Additional Extractors (IN PROGRESS)¶
Priority: High Target: Q1 2025
2.1 Gmail Extractor Task ⏳¶
Status: Not Started Dependencies: xo-core Gmail extractor (exists)
Requirements:
- Create extract_gmail_data() task in extract_tasks.py
- Support label filtering
- Handle attachment downloads
- Extract email metadata (subject, from, to, date)
- Support both ingest-only and full pipeline modes
Acceptance Criteria:
- [ ] Task wrapper for GmailExtractor from xo-core
- [ ] YAML configuration schema for Gmail sources
- [ ] Example YAML config
- [ ] Integration test with local Airflow
- [ ] Documentation in task docstring
Estimated Effort: 1 day
2.2 Google Sheets Extractor Task ⏳¶
Status: Not Started Dependencies: xo-core Google Sheets extractor (exists)
Requirements:
- Create extract_gsheet_data() task in extract_tasks.py
- Support range specifications
- Handle sheet-level extraction
- Support incremental load strategy (full pulls)
- Proper column type detection
Acceptance Criteria:
- [ ] Task wrapper for GoogleSheetsExtractor from xo-core
- [ ] YAML configuration schema for GSheet sources
- [ ] Example YAML config
- [ ] Integration test with local Airflow
- [ ] Documentation in task docstring
Estimated Effort: 1 day
2.3 S3 File Extractor Task ⏳¶
Status: Not Started Dependencies: xo-core S3 utilities (exist)
Requirements:
- Create extract_s3_files() task in extract_tasks.py
- Support glob patterns for file matching
- Handle multiple file formats (CSV, JSON, Parquet)
- Support cross-bucket operations
- Metadata extraction (file size, last modified)
Acceptance Criteria: - [ ] Task for S3 file discovery and processing - [ ] YAML configuration schema for S3 sources - [ ] Example YAML config - [ ] Integration test with local Airflow - [ ] Documentation in task docstring
Estimated Effort: 2 days
2.4 Generic API Extractor Task ⏳¶
Status: Not Started Dependencies: None
Requirements:
- Create extract_api_data() task for generic REST APIs
- Support configurable authentication (API key, OAuth, Basic)
- Pagination support (offset, cursor, page-based)
- Request/response logging
- Retry logic with exponential backoff
Acceptance Criteria: - [ ] Generic API task with flexible configuration - [ ] YAML configuration schema for API sources - [ ] Example YAML configs (multiple auth types) - [ ] Integration test with mock API - [ ] Documentation in task docstring
Estimated Effort: 3 days
🔮 Phase 3: dbt Integration (PLANNED)¶
Priority: High Target: Q1 2025
3.1 dbt Cloud API Integration ⏳¶
Status: Not Started Dependencies: dbt Cloud account and API access
Requirements:
- Replace placeholder trigger_dbt_run() with dbt Cloud API calls
- Support job triggering with parameters
- Poll for job completion
- Retrieve run artifacts (logs, results)
- Handle failures gracefully
Acceptance Criteria:
- [ ] trigger_dbt_cloud_job() task
- [ ] Configuration schema for dbt Cloud settings
- [ ] Run status polling with timeout
- [ ] Artifact retrieval and logging
- [ ] Example YAML config with dbt enabled
- [ ] Integration test with dbt Cloud sandbox
Estimated Effort: 3 days
3.2 dbt Core Integration (Alternative) ⏳¶
Status: Not Started Dependencies: dbt-core package
Requirements:
- Support local dbt execution via BashOperator
- Handle dbt project discovery
- Parse dbt run results
- Support selective model execution
- Environment-specific target configuration
Acceptance Criteria:
- [ ] trigger_dbt_core() task
- [ ] Configuration schema for dbt Core settings
- [ ] Result parsing and error handling
- [ ] Example YAML config with dbt Core
- [ ] Integration test with sample dbt project
Estimated Effort: 2 days
3.3 dbt Metadata Integration 🔮¶
Status: Future Consideration Priority: Medium
Ideas: - Generate YAML configs from dbt source definitions - Bidirectional sync between DAG configs and dbt sources - Automatic lineage documentation - Data quality test integration
🌟 Phase 4: Additional Pipeline Templates (PLANNED)¶
Priority: Medium Target: Q2 2025
4.1 Data Export Template ⏳¶
Status: Not Started Dependencies: None
Requirements: - Jinja2 template for Snowflake → External system pipelines - Support API exports (POST requests with data) - Support S3 exports (Snowflake COPY TO) - Support SFTP exports - Incremental export support (watermarking)
Acceptance Criteria:
- [ ] data_export.py.j2 template
- [ ] Configuration schema for export pipelines
- [ ] Example YAML configs (API, S3, SFTP)
- [ ] Generated DAG validation
- [ ] Integration test with mock endpoints
Estimated Effort: 4 days
4.2 Hybrid Pipeline Template ⏳¶
Status: Not Started Dependencies: None
Requirements: - Template for multi-source → multi-destination pipelines - Support mixed operations (import + export in same DAG) - Complex task dependencies - Parallel branch execution - Conditional path execution
Acceptance Criteria:
- [ ] hybrid.py.j2 template
- [ ] Configuration schema for hybrid pipelines
- [ ] Example YAML config
- [ ] Generated DAG validation
- [ ] Integration test
Estimated Effort: 5 days
4.3 Reverse ETL Template 🔮¶
Status: Future Consideration Priority: Low
Ideas: - Snowflake → Operational systems (Salesforce, HubSpot, etc.) - Audience sync patterns - Change detection and delta exports - Conflict resolution strategies
🔧 Phase 5: Quality & Observability (PLANNED)¶
Priority: Medium Target: Q2 2025
5.1 Data Quality Checks ⏳¶
Status: Not Started Dependencies: None
Requirements: - Pre-load data quality validation - Row count checks - Schema validation - Custom quality rules (configurable) - Failure handling (warn vs. fail)
Acceptance Criteria:
- [ ] validate_data_quality() task
- [ ] Configuration schema for quality checks
- [ ] Example checks (row count, nulls, duplicates)
- [ ] Integration with alerting
- [ ] Documentation
Estimated Effort: 3 days
5.2 Monitoring & Alerting ⏳¶
Status: Not Started Dependencies: Alerting platform (Slack, PagerDuty, etc.)
Requirements: - Task-level success/failure notifications - Pipeline-level SLA monitoring - Custom alerting rules - Slack integration - Email integration
Acceptance Criteria: - [ ] Alerting task wrappers - [ ] Configuration schema for alerts - [ ] Example YAML configs - [ ] Integration test with mock alerting - [ ] Documentation
Estimated Effort: 3 days
5.3 Lineage Tracking 🔮¶
Status: Future Consideration Priority: Low
Ideas: - Automatic lineage diagram generation - Integration with data catalogs (Atlan, Alation) - Source → Target mapping documentation - dbt lineage integration
📦 Phase 6: Developer Experience (PLANNED)¶
Priority: Low Target: Q2 2025
6.1 VS Code Extension 🔮¶
Status: Future Consideration
Ideas: - YAML syntax highlighting for xo-foundry configs - Auto-completion for config fields - Inline validation errors - "Generate DAG" command from editor
6.2 Web UI 🔮¶
Status: Future Consideration
Ideas: - Browser-based DAG configuration builder - Visual pipeline designer - Config validation and preview - One-click DAG generation
6.3 Testing Framework ⏳¶
Status: Not Started Priority: Medium
Requirements: - Unit test utilities for DAG validation - Mock data generators - Integration test helpers - CI/CD test runner
Estimated Effort: 4 days
🐛 Bug Fixes & Tech Debt¶
Known Issues¶
None currently identified.
Technical Debt¶
- Pydantic Schema Warning:
schemafield name conflicts with BaseModel - Status: Workaround implemented (using
schema_with alias) -
Future: Consider renaming to
snowflake_schemain next major version -
Template Duplication: Only one template exists (
snowflake_load.py.j2) - Status: Acceptable for Phase 1
- Future: Extract common patterns when adding more templates
Version History¶
v0.4.0 (2025-12-03) - Current¶
- ✅ DAG factory with YAML generation
- ✅ Load strategy path builder
- ✅ Pydantic schemas
- ✅ CLI tool
- ✅ Gladly API extractor task
- ✅ Snowflake load tasks
Future Versions¶
- v0.5.0 (Q1 2025): Additional extractors (Gmail, GSheets, S3, API)
- v0.6.0 (Q1 2025): dbt integration (Cloud or Core)
- v0.7.0 (Q2 2025): Data export templates
- v0.8.0 (Q2 2025): Quality & observability features
- v1.0.0 (Q3 2025): Stable release with comprehensive features
Contributing¶
When adding new features:
- Update Schemas: Add/modify Pydantic schemas in
schemas/dag_config.py - Create Tests: Write unit tests for new functionality
- Update Templates: Modify or create Jinja2 templates
- Update Docs: Add examples and documentation
- Update This Roadmap: Mark features as complete and add new items
Related Documentation¶
- Architecture:
.claude/ongoing/archived/2025-12-03-dag-factory-implementation/ - Load Strategies:
.claude/ongoing/reference/data-refresh-patterns.md - Main Docs:
.claude/CLAUDE.md