Skip to content

ADR 001: Load Strategy Terminology

Status: Accepted Date: 2025-12-03 Decision Makers: Data Engineering Team Related: DAG Factory Implementation, S3 Path Structure

Context

When implementing the xo-foundry DAG factory, we needed standardized terminology for different data refresh patterns. Our initial internal terms didn't align with industry standards, which could cause confusion when:

  • Onboarding new team members
  • Discussing patterns with clients
  • Integrating with third-party tools
  • Reading external documentation

Original Terminology (Internal)

XO Term Meaning
Static Refresh Daily snapshot that doesn't change
Incremental Refresh Pull only new/changed records
Refreshed Sources Data that changes retroactively

Issues Identified

  1. "Incremental Refresh" was misleading:
  2. Implied API-level filtering (watermarking)
  3. Reality: Google Sheets pulls ENTIRE dataset every time
  4. No way to pull "only new records" from most sources
  5. Deduplication happens in warehouse, not at extraction

  6. "Static Refresh" was unclear:

  7. "Static" could mean "never changes" or "daily snapshot"
  8. Not a widely recognized industry term

  9. "Refreshed Sources" was vague:

  10. Doesn't describe the pattern clearly
  11. Could apply to any data that updates

Decision

Adopt industry-standard terminology for load strategies:

New Terminology (Industry Standard)

Industry Term XO Implementation S3 Path Pattern Use Cases
Full Refresh full_refresh {report}/full_refresh/{YYYY-MM-DD}/ Daily snapshots, API reports, email attachments
Incremental Load incremental {report}/incremental/{YYYY-MM-DD}T{HH:MM:SS}/ Google Sheets (pull full, dedupe in warehouse)
Historical Refresh historical {report}/historical/{extraction_ts}/{data_date}/ Late-arriving data, SCD Type 2 (avoid when possible)

Key Clarifications

  1. Full Refresh (formerly "Static Refresh"):
  2. Immutable daily snapshots
  3. One extraction per day
  4. TRUNCATE + INSERT load pattern
  5. Used for: API reports, email attachments, exported files
  6. S3 Lifecycle: Retain 90 days

  7. Incremental Load (formerly "Incremental Refresh"):

  8. Critical: Pull ENTIRE dataset each time
  9. No API-level watermarking or filtering
  10. Deduplication in warehouse using RECORD_HASH or timestamps
  11. Multiple extractions per day possible (timestamped paths)
  12. Used for: Google Sheets (no API delta support)
  13. S3 Lifecycle: Retain 30 days (only latest needed)

  14. Historical Refresh (formerly "Refreshed Sources"):

  15. Data that changes retroactively
  16. Requires SCD Type 2 historization
  17. Complex to manage and query
  18. Avoid when possible - negotiate with clients for daily snapshots
  19. Used for: Metrics that update retroactively (e.g., Gladly historical reports)
  20. S3 Lifecycle: Retain longer for audit trail

Consequences

Positive

  1. Industry Alignment: Team can reference standard documentation and best practices
  2. Clear Communication: Unambiguous terminology with clients and partners
  3. Onboarding: New engineers immediately understand patterns
  4. Tool Integration: Aligns with dbt incremental strategies and other tools
  5. S3 Management: Clear path structure enables lifecycle policies

Negative

  1. Migration Effort: Need to update existing documentation and code
  2. Learning Curve: Team needs to adopt new terminology
  3. Confusion Period: Temporary overlap of old/new terms during transition

Neutral

  1. No Code Changes: Internal implementation remains the same, only naming changes
  2. Backward Compatibility: Old configs still work (defaulting to full_refresh)

Implementation

Changes Made

  1. Pydantic Schemas (packages/xo-foundry/src/xo_foundry/schemas/dag_config.py):

    load_strategy: Literal["full_refresh", "incremental", "historical"]
    

  2. Path Builder (packages/xo-foundry/src/xo_foundry/dag_factory/builders/path_builder.py):

    def build_s3_path(domain, report_name, load_strategy, execution_date):
        if load_strategy == "full_refresh":
            return f"{domain}/{report_name}/full_refresh/{YYYY-MM-DD}/"
        elif load_strategy == "incremental":
            return f"{domain}/{report_name}/incremental/{YYYY-MM-DD}T{HH:MM:SS}/"
        elif load_strategy == "historical":
            return f"{domain}/{report_name}/historical/{extraction_ts}/{data_date}/"
    

  3. YAML Configurations:

    sources:
      contact_timestamps:
        source_type: gladly_api
        load_strategy: full_refresh  # NEW FIELD
    

  4. Warnings: System warns when historical is used (discourage usage)

Migration Guide

For existing pipelines: 1. Review current data refresh pattern 2. Choose appropriate load strategy: - Daily snapshot? → full_refresh - Full pulls with warehouse dedupe? → incremental - Late-arriving data? → historical (try to avoid) 3. Add load_strategy to YAML config 4. Regenerate DAG using xo-foundry CLI

Alternatives Considered

Option 1: Keep Internal Terminology

Pros: - No migration effort - Team already familiar

Cons: - Confusion with industry standards - Misleading names ("incremental" without actual incrementation) - Onboarding friction

Decision: Rejected - Technical debt would accumulate

Option 2: Create Custom Terminology

Pros: - Could be more descriptive for our specific use cases

Cons: - Creates yet another standard - No external references - Harder to find solutions to problems

Decision: Rejected - Reinventing the wheel

Option 3: Adopt Industry Standards (SELECTED)

Pros: - Clear alignment with external resources - Well-understood patterns - Future-proof for tool integrations

Cons: - Migration effort (minimal)

Decision: Accepted ✅

References

  • Data Refresh Patterns Reference
  • dbt Incremental Models: https://docs.getdbt.com/docs/build/incremental-models
  • Snowflake CDC Patterns: https://docs.snowflake.com/en/user-guide/data-pipelines-intro

Review Schedule

  • Next Review: 2025-06 (6 months)
  • Trigger for Earlier Review:
  • Major tool integration requiring different terminology
  • Client feedback indicating confusion
  • Industry terminology evolution

Approval

  • Data Engineering Lead
  • Implementation Complete
  • Documentation Updated
  • Team Notified