Data Refresh Patterns - Industry Standards & XO Implementation¶

Status: Reference Document Created: 2025-12-02 Purpose: Define standardized terminology and implementation for data refresh strategies

Industry-Standard Terminology¶

Research Findings¶

After analyzing industry best practices from Airflow, dbt, Fivetran, and modern data platforms, here are the accepted terms:

1. Full Refresh (Your "Static Refresh")¶

Industry Terms: Full Refresh, Complete Refresh, Snapshot, Full Load
Definition: Complete dataset extracted at a point in time, never changes after extraction
Common Use Cases:
Daily report snapshots (Gladly ContactTimestampsReport)
Historical point-in-time exports
Transactional data exports (closed transactions)
Characteristics:
Same date → Same data (immutable)
No need for change tracking
TRUNCATE + INSERT pattern in warehouse

2. Incremental Refresh (Your "Incremental Refresh")¶

Industry Terms: Incremental Load, Delta Load, Append-Only
Definition: Extract full dataset, identify new/changed records after loading
Common Use Cases:
Google Sheets (no API watermarking, pull full sheet)
APIs without change tracking
Files with full exports but only new rows matter
Characteristics:
Pull entire dataset each time
Identify changes after load (in warehouse)
Append new records, update changed records

3. Change Data Capture (CDC) - Not in your list but relevant¶

Industry Terms: CDC, Log-Based Replication, Real-time Sync
Definition: Stream only changed records from source
Common Use Cases:
Database replication (Postgres WAL, MySQL binlog)
SaaS platforms with webhook events
APIs with modified_since parameters
Characteristics:
Source provides only deltas
Highly efficient
Requires source system support

4. Refreshed Historical Data (Your "Refreshed Sources")¶

Industry Terms: Late-Arriving Data, Historical Refresh, Slowly Changing Dimension (SCD) Type 2
Definition: Historical data changes over time, same period returns different values on different extraction dates
Common Use Cases:
Rolling aggregates (last 30 days)
Late-arriving transactions
Data corrections/adjustments
SaaS platforms with retroactive updates
Characteristics:
Same date range → Different data on different pull dates
Requires SCD Type 2 historization
Complex to manage
Best Practice: Avoid when possible

XO Implementation Proposal¶

Terminology Alignment¶

XO Term (Current)	Industry Standard	XO Term (Proposed)	Use In Code
Static Refresh	Full Refresh	Full Refresh	`full_refresh`
Incremental Refresh	Incremental Load	Incremental Load	`incremental`
Refreshed Sources	Historical Refresh / SCD Type 2	Historical Refresh	`historical`
N/A	Change Data Capture	CDC (future)	`cdc`

Recommended Values¶

# In YAML configs
source:
  load_strategy: full_refresh  # Options: full_refresh, incremental, historical, cdc

Rationale for naming: - full_refresh - Clear, matches dbt materialization terminology - incremental - Standard across all platforms - historical - Describes the problem (history changes) - cdc - Industry standard abbreviation

Implementation Details¶

1. Full Refresh Pattern¶

Source Behavior: Gladly API, static reports Extraction: Complete dataset for a specific date S3 Path: stage-bucket/{domain}/{report}/full/{date}/ Load Strategy: TRUNCATE + INSERT (daily partition)

source:
  load_strategy: full_refresh
  extractor:
    metric_set: ContactTimestampsReport
    date_param: report_date  # Fixed date, immutable

S3 Example:

stage-bucket/warbyparker/contact_timestamps/full/2025-12-01/contact_timestamps_20251201.csv
stage-bucket/warbyparker/contact_timestamps/full/2025-12-02/contact_timestamps_20251202.csv

dbt Downstream:

-- Simple SELECT, no deduplication needed
{{ config(materialized='table') }}
SELECT * FROM {{ source('bronze', 'gladly_contact_timestamps') }}

2. Incremental Pattern (Google Sheets Case)¶

Source Behavior: Google Sheets API Extraction: Full spreadsheet every time (no API support for deltas) Identification: Detect changes AFTER loading to warehouse S3 Path: stage-bucket/{domain}/{report}/incremental/{timestamp}/ Load Strategy: APPEND (keep all extractions), deduplicate in SILVER

source:
  load_strategy: incremental
  extractor:
    type: gsheet
    spreadsheet_id: "abc123"
    range: "Sheet1!A:Z"
  deduplication:
    # These define RECORD_KEY for change detection
    unique_columns:
      - row_id
      - email

S3 Example:

stage-bucket/warbyparker/contacts_sheet/incremental/2025-12-01T07:00:00/contacts_20251201_070000.csv
stage-bucket/warbyparker/contacts_sheet/incremental/2025-12-01T14:30:00/contacts_20251201_143000.csv

BRONZE Table: Append all extractions

-- Multiple rows per row_id across different extraction times
row_id | email          | name  | extraction_timestamp
1      | a@example.com  | Alice | 2025-12-01 07:00:00
2      | b@example.com  | Bob   | 2025-12-01 07:00:00
1      | a@example.com  | Alice | 2025-12-01 14:30:00  -- duplicate
2      | b@example.com  | Bob   | 2025-12-01 14:30:00  -- duplicate
3      | c@example.com  | Carol | 2025-12-01 14:30:00  -- NEW

dbt SILVER: Deduplicate, keep latest

{{ config(
    materialized='table',
    unique_key='row_id'
) }}

WITH latest_extraction AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY row_id
      ORDER BY extraction_timestamp DESC
    ) as rn
  FROM {{ source('bronze', 'contacts_sheet') }}
)
SELECT * EXCLUDE(rn)
FROM latest_extraction
WHERE rn = 1

Alternative dbt Approach: Use RECORD_HASH to detect changes

WITH ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY row_id, RECORD_HASH
      ORDER BY extraction_timestamp DESC
    ) as rn,
    LEAD(RECORD_HASH) OVER (
      PARTITION BY row_id
      ORDER BY extraction_timestamp
    ) as next_hash
  FROM {{ source('bronze', 'contacts_sheet') }}
)
-- Only keep records where data actually changed
SELECT * EXCLUDE(rn, next_hash)
FROM ranked
WHERE rn = 1 AND (next_hash IS NULL OR RECORD_HASH != next_hash)

3. Historical Refresh Pattern (Avoid When Possible)¶

Source Behavior: Gladly rolling reports (last 30 days) Extraction: Same date range returns different data on different extraction dates S3 Path: stage-bucket/{domain}/{report}/historical/{extraction_ts}/{data_date}/ Load Strategy: APPEND with SCD Type 2

source:
  load_strategy: historical
  extractor:
    metric_set: RollingMetricsReport
    lookback_days: 30
  scd:
    type: 2
    unique_columns:
      - contact_id
    valid_from_column: extraction_timestamp

S3 Example:

# First extraction on Dec 1
stage-bucket/warbyparker/rolling/historical/2025-12-01T07:00:00/2025-11-01/data.csv
stage-bucket/warbyparker/rolling/historical/2025-12-01T07:00:00/2025-11-02/data.csv

# Second extraction on Dec 2 - same dates, different data
stage-bucket/warbyparker/rolling/historical/2025-12-02T07:00:00/2025-11-01/data.csv
stage-bucket/warbyparker/rolling/historical/2025-12-02T07:00:00/2025-11-02/data.csv

BRONZE Table: Keep all historical versions

contact_id | metric_value | data_date  | extraction_timestamp
101        | 50           | 2025-11-01 | 2025-12-01 07:00:00
101        | 55           | 2025-11-01 | 2025-12-02 07:00:00  -- Updated value

dbt SILVER: SCD Type 2 with versioning

{{ config(materialized='table') }}

WITH versioned AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY contact_id, data_date, extraction_timestamp
      ORDER BY extraction_timestamp DESC
    ) as version,
    LEAD(extraction_timestamp) OVER (
      PARTITION BY contact_id, data_date
      ORDER BY extraction_timestamp
    ) as valid_to
  FROM {{ source('bronze', 'rolling_metrics') }}
)
SELECT
  *,
  extraction_timestamp as valid_from,
  COALESCE(valid_to, '9999-12-31'::timestamp) as valid_to,
  CASE WHEN valid_to IS NULL THEN TRUE ELSE FALSE END as is_current
FROM versioned

S3 Path Convention¶

Proposed Standard¶

s3://{bucket}/
├── ingest-bucket/           # Raw from source
│   └── {domain}/
│       └── {report_name}/
│           ├── full/                    # Full refresh
│           │   └── {YYYY-MM-DD}/
│           │       └── {filename}
│           ├── incremental/             # Incremental load
│           │   └── {YYYY-MM-DD}T{HH:MM:SS}/
│           │       └── {filename}
│           └── historical/              # Historical refresh
│               └── {extraction_timestamp}/
│                   └── {data_date}/
│                       └── {filename}
│
└── stage-bucket/            # Standardized columns
    └── {domain}/
        └── {report_name}/
            ├── full/{YYYY-MM-DD}/
            ├── incremental/{YYYY-MM-DD}T{HH:MM:SS}/
            └── historical/{extraction_ts}/{data_date}/

Path Builder Logic¶

def build_s3_path(domain, report_name, load_strategy, execution_date, **kwargs):
    """Build S3 path based on load strategy."""

    base = f"{domain}/{report_name}/{load_strategy}"

    if load_strategy == "full_refresh":
        # Date partition only
        date_str = execution_date.strftime("%Y-%m-%d")
        return f"{base}/{date_str}/"

    elif load_strategy == "incremental":
        # Timestamp for each extraction
        ts_str = execution_date.strftime("%Y-%m-%dT%H:%M:%S")
        return f"{base}/{ts_str}/"

    elif load_strategy == "historical":
        # Extraction timestamp + data date
        extraction_ts = execution_date.strftime("%Y-%m-%dT%H:%M:%S")
        data_date = kwargs.get("data_date", execution_date).strftime("%Y-%m-%d")
        return f"{base}/{extraction_ts}/{data_date}/"

    else:
        raise ValueError(f"Unknown load_strategy: {load_strategy}")

XO Sources Classification¶

Current Sources¶

Source	Type	Load Strategy	Rationale
Gladly ContactTimestampsReport	API	`full_refresh`	Daily snapshot, immutable
Gladly ConversationTimestampsReport	API	`full_refresh`	Daily snapshot, immutable
Gladly ContactExportReport	API	`full_refresh`	Daily snapshot, immutable
Gladly Rolling Metrics	API	`historical`	Last N days, changes daily (⚠️ avoid)
Google Sheets - Contacts	Spreadsheet	`incremental`	Full pull, identify changes in warehouse
Google Sheets - Metrics	Spreadsheet	`incremental`	Full pull, identify changes in warehouse
Gmail Attachments	Email	`full_refresh`	Static files, immutable
S3 Files	Storage	`full_refresh`	Static files, immutable

Recommendations¶

Gladly Rolling Metrics: Work with client to switch to daily snapshots (full_refresh)
Google Sheets: Always use incremental pattern
New Sources: Default to full_refresh when possible

YAML Configuration Examples¶

Full Refresh (Gladly Timestamps)¶

sources:
  contact_timestamps:
    load_strategy: full_refresh
    source_type: gladly_api

    extractor:
      metric_set: ContactTimestampsReport
      date_field: report_date  # Fixed date

    snowflake:
      target_table: GLADLY_CONTACT_TIMESTAMPS
      load_method: truncate_insert  # Safe for full_refresh

Incremental (Google Sheets)¶

sources:
  contacts_sheet:
    load_strategy: incremental
    source_type: gsheet

    extractor:
      spreadsheet_id: "abc123xyz"
      range: "Contacts!A:Z"
      # No watermark - pull full sheet every time

    snowflake:
      target_table: GSHEET_CONTACTS
      load_method: append  # Keep all extractions
      deduplication:
        unique_columns:
          - row_id
          - email

Historical (Avoid)¶

sources:
  rolling_metrics:
    load_strategy: historical
    source_type: gladly_api

    extractor:
      metric_set: RollingMetricsReport
      lookback_days: 30

    snowflake:
      target_table: GLADLY_ROLLING_METRICS
      load_method: append
      scd_type: 2
      deduplication:
        unique_columns:
          - contact_id
          - metric_date

Implementation Checklist¶

Update YAML schema to use load_strategy instead of refresh_type
Implement S3 path builder with load_strategy support
Update extraction tasks to pass load_strategy
Update staging tasks to use correct S3 paths
Add load_strategy to BRONZE table metadata
Document dbt patterns for each load_strategy
Add validation: warn if historical is used

Migration Path¶

Existing Code¶

# Old
merge:
  refresh_type: static

New Code¶

# New
source:
  load_strategy: full_refresh

Backward Compatibility¶

Support both during transition:

load_strategy = config.get("load_strategy") or config.get("refresh_type", "full_refresh")

# Map old values to new
if load_strategy == "static":
    load_strategy = "full_refresh"
elif load_strategy == "refreshed":
    load_strategy = "historical"

Best Practices¶

Default to full_refresh: Simplest, most reliable
Use incremental for live documents: Google Sheets, spreadsheets
Avoid historical: Negotiate with clients for daily snapshots instead
Document the strategy: Make it explicit in YAML and table comments
S3 path reflects strategy: Path structure makes strategy obvious