ADR 001: Load Strategy Terminology¶

Status: Accepted Date: 2025-12-03 Decision Makers: Data Engineering Team Related: DAG Factory Implementation, S3 Path Structure

Context¶

When implementing the xo-foundry DAG factory, we needed standardized terminology for different data refresh patterns. Our initial internal terms didn't align with industry standards, which could cause confusion when:

Onboarding new team members
Discussing patterns with clients
Integrating with third-party tools
Reading external documentation

Original Terminology (Internal)¶

XO Term	Meaning
Static Refresh	Daily snapshot that doesn't change
Incremental Refresh	Pull only new/changed records
Refreshed Sources	Data that changes retroactively

Issues Identified¶

"Incremental Refresh" was misleading:
Implied API-level filtering (watermarking)
Reality: Google Sheets pulls ENTIRE dataset every time
No way to pull "only new records" from most sources
Deduplication happens in warehouse, not at extraction
"Static Refresh" was unclear:
"Static" could mean "never changes" or "daily snapshot"
Not a widely recognized industry term
"Refreshed Sources" was vague:
Doesn't describe the pattern clearly
Could apply to any data that updates

Decision¶

Adopt industry-standard terminology for load strategies:

New Terminology (Industry Standard)¶

Industry Term	XO Implementation	S3 Path Pattern	Use Cases
Full Refresh	`full_refresh`	`{report}/full_refresh/{YYYY-MM-DD}/`	Daily snapshots, API reports, email attachments
Incremental Load	`incremental`	`{report}/incremental/{YYYY-MM-DD}T{HH:MM:SS}/`	Google Sheets (pull full, dedupe in warehouse)
Historical Refresh	`historical`	`{report}/historical/{extraction_ts}/{data_date}/`	Late-arriving data, SCD Type 2 (avoid when possible)

Key Clarifications¶

Full Refresh (formerly "Static Refresh"):
Immutable daily snapshots
One extraction per day
TRUNCATE + INSERT load pattern
Used for: API reports, email attachments, exported files
S3 Lifecycle: Retain 90 days
Incremental Load (formerly "Incremental Refresh"):
Critical: Pull ENTIRE dataset each time
No API-level watermarking or filtering
Deduplication in warehouse using RECORD_HASH or timestamps
Multiple extractions per day possible (timestamped paths)
Used for: Google Sheets (no API delta support)
S3 Lifecycle: Retain 30 days (only latest needed)
Historical Refresh (formerly "Refreshed Sources"):
Data that changes retroactively
Requires SCD Type 2 historization
Complex to manage and query
Avoid when possible - negotiate with clients for daily snapshots
Used for: Metrics that update retroactively (e.g., Gladly historical reports)
S3 Lifecycle: Retain longer for audit trail

Consequences¶

Positive¶

Industry Alignment: Team can reference standard documentation and best practices
Clear Communication: Unambiguous terminology with clients and partners
Onboarding: New engineers immediately understand patterns
Tool Integration: Aligns with dbt incremental strategies and other tools
S3 Management: Clear path structure enables lifecycle policies

Negative¶

Migration Effort: Need to update existing documentation and code
Learning Curve: Team needs to adopt new terminology
Confusion Period: Temporary overlap of old/new terms during transition

Neutral¶

No Code Changes: Internal implementation remains the same, only naming changes
Backward Compatibility: Old configs still work (defaulting to full_refresh)

Implementation¶

Changes Made¶

Pydantic Schemas (packages/xo-foundry/src/xo_foundry/schemas/dag_config.py):
```
load_strategy: Literal["full_refresh", "incremental", "historical"]
```

Path Builder (packages/xo-foundry/src/xo_foundry/dag_factory/builders/path_builder.py):

def build_s3_path(domain, report_name, load_strategy, execution_date):
    if load_strategy == "full_refresh":
        return f"{domain}/{report_name}/full_refresh/{YYYY-MM-DD}/"
    elif load_strategy == "incremental":
        return f"{domain}/{report_name}/incremental/{YYYY-MM-DD}T{HH:MM:SS}/"
    elif load_strategy == "historical":
        return f"{domain}/{report_name}/historical/{extraction_ts}/{data_date}/"

YAML Configurations:

sources:
  contact_timestamps:
    source_type: gladly_api
    load_strategy: full_refresh  # NEW FIELD

Warnings: System warns when historical is used (discourage usage)

Migration Guide¶

For existing pipelines: 1. Review current data refresh pattern 2. Choose appropriate load strategy: - Daily snapshot? → full_refresh - Full pulls with warehouse dedupe? → incremental - Late-arriving data? → historical (try to avoid) 3. Add load_strategy to YAML config 4. Regenerate DAG using xo-foundry CLI

Alternatives Considered¶

Option 1: Keep Internal Terminology¶

Pros: - No migration effort - Team already familiar

Cons: - Confusion with industry standards - Misleading names ("incremental" without actual incrementation) - Onboarding friction

Decision: Rejected - Technical debt would accumulate

Option 2: Create Custom Terminology¶

Pros: - Could be more descriptive for our specific use cases

Cons: - Creates yet another standard - No external references - Harder to find solutions to problems

Decision: Rejected - Reinventing the wheel

Option 3: Adopt Industry Standards (SELECTED)¶

Pros: - Clear alignment with external resources - Well-understood patterns - Future-proof for tool integrations

Cons: - Migration effort (minimal)

Decision: Accepted ✅

References¶

Data Refresh Patterns Reference
dbt Incremental Models: https://docs.getdbt.com/docs/build/incremental-models
Snowflake CDC Patterns: https://docs.snowflake.com/en/user-guide/data-pipelines-intro

Review Schedule¶

Next Review: 2025-06 (6 months)
Trigger for Earlier Review:
Major tool integration requiring different terminology
Client feedback indicating confusion
Industry terminology evolution

Approval¶

Data Engineering Lead
Implementation Complete
Documentation Updated
Team Notified