ADR 001: Load Strategy Terminology¶
Status: Accepted Date: 2025-12-03 Decision Makers: Data Engineering Team Related: DAG Factory Implementation, S3 Path Structure
Context¶
When implementing the xo-foundry DAG factory, we needed standardized terminology for different data refresh patterns. Our initial internal terms didn't align with industry standards, which could cause confusion when:
- Onboarding new team members
- Discussing patterns with clients
- Integrating with third-party tools
- Reading external documentation
Original Terminology (Internal)¶
| XO Term | Meaning |
|---|---|
| Static Refresh | Daily snapshot that doesn't change |
| Incremental Refresh | Pull only new/changed records |
| Refreshed Sources | Data that changes retroactively |
Issues Identified¶
- "Incremental Refresh" was misleading:
- Implied API-level filtering (watermarking)
- Reality: Google Sheets pulls ENTIRE dataset every time
- No way to pull "only new records" from most sources
-
Deduplication happens in warehouse, not at extraction
-
"Static Refresh" was unclear:
- "Static" could mean "never changes" or "daily snapshot"
-
Not a widely recognized industry term
-
"Refreshed Sources" was vague:
- Doesn't describe the pattern clearly
- Could apply to any data that updates
Decision¶
Adopt industry-standard terminology for load strategies:
New Terminology (Industry Standard)¶
| Industry Term | XO Implementation | S3 Path Pattern | Use Cases |
|---|---|---|---|
| Full Refresh | full_refresh |
{report}/full_refresh/{YYYY-MM-DD}/ |
Daily snapshots, API reports, email attachments |
| Incremental Load | incremental |
{report}/incremental/{YYYY-MM-DD}T{HH:MM:SS}/ |
Google Sheets (pull full, dedupe in warehouse) |
| Historical Refresh | historical |
{report}/historical/{extraction_ts}/{data_date}/ |
Late-arriving data, SCD Type 2 (avoid when possible) |
Key Clarifications¶
- Full Refresh (formerly "Static Refresh"):
- Immutable daily snapshots
- One extraction per day
- TRUNCATE + INSERT load pattern
- Used for: API reports, email attachments, exported files
-
S3 Lifecycle: Retain 90 days
-
Incremental Load (formerly "Incremental Refresh"):
- Critical: Pull ENTIRE dataset each time
- No API-level watermarking or filtering
- Deduplication in warehouse using RECORD_HASH or timestamps
- Multiple extractions per day possible (timestamped paths)
- Used for: Google Sheets (no API delta support)
-
S3 Lifecycle: Retain 30 days (only latest needed)
-
Historical Refresh (formerly "Refreshed Sources"):
- Data that changes retroactively
- Requires SCD Type 2 historization
- Complex to manage and query
- Avoid when possible - negotiate with clients for daily snapshots
- Used for: Metrics that update retroactively (e.g., Gladly historical reports)
- S3 Lifecycle: Retain longer for audit trail
Consequences¶
Positive¶
- Industry Alignment: Team can reference standard documentation and best practices
- Clear Communication: Unambiguous terminology with clients and partners
- Onboarding: New engineers immediately understand patterns
- Tool Integration: Aligns with dbt incremental strategies and other tools
- S3 Management: Clear path structure enables lifecycle policies
Negative¶
- Migration Effort: Need to update existing documentation and code
- Learning Curve: Team needs to adopt new terminology
- Confusion Period: Temporary overlap of old/new terms during transition
Neutral¶
- No Code Changes: Internal implementation remains the same, only naming changes
- Backward Compatibility: Old configs still work (defaulting to
full_refresh)
Implementation¶
Changes Made¶
-
Pydantic Schemas (
packages/xo-foundry/src/xo_foundry/schemas/dag_config.py): -
Path Builder (
packages/xo-foundry/src/xo_foundry/dag_factory/builders/path_builder.py):def build_s3_path(domain, report_name, load_strategy, execution_date): if load_strategy == "full_refresh": return f"{domain}/{report_name}/full_refresh/{YYYY-MM-DD}/" elif load_strategy == "incremental": return f"{domain}/{report_name}/incremental/{YYYY-MM-DD}T{HH:MM:SS}/" elif load_strategy == "historical": return f"{domain}/{report_name}/historical/{extraction_ts}/{data_date}/" -
YAML Configurations:
-
Warnings: System warns when
historicalis used (discourage usage)
Migration Guide¶
For existing pipelines:
1. Review current data refresh pattern
2. Choose appropriate load strategy:
- Daily snapshot? → full_refresh
- Full pulls with warehouse dedupe? → incremental
- Late-arriving data? → historical (try to avoid)
3. Add load_strategy to YAML config
4. Regenerate DAG using xo-foundry CLI
Alternatives Considered¶
Option 1: Keep Internal Terminology¶
Pros: - No migration effort - Team already familiar
Cons: - Confusion with industry standards - Misleading names ("incremental" without actual incrementation) - Onboarding friction
Decision: Rejected - Technical debt would accumulate
Option 2: Create Custom Terminology¶
Pros: - Could be more descriptive for our specific use cases
Cons: - Creates yet another standard - No external references - Harder to find solutions to problems
Decision: Rejected - Reinventing the wheel
Option 3: Adopt Industry Standards (SELECTED)¶
Pros: - Clear alignment with external resources - Well-understood patterns - Future-proof for tool integrations
Cons: - Migration effort (minimal)
Decision: Accepted ✅
References¶
- Data Refresh Patterns Reference
- dbt Incremental Models: https://docs.getdbt.com/docs/build/incremental-models
- Snowflake CDC Patterns: https://docs.snowflake.com/en/user-guide/data-pipelines-intro
Review Schedule¶
- Next Review: 2025-06 (6 months)
- Trigger for Earlier Review:
- Major tool integration requiring different terminology
- Client feedback indicating confusion
- Industry terminology evolution
Approval¶
- Data Engineering Lead
- Implementation Complete
- Documentation Updated
- Team Notified