Skip to content

Data Refresh Patterns - Industry Standards & XO Implementation

Status: Reference Document Created: 2025-12-02 Purpose: Define standardized terminology and implementation for data refresh strategies

Industry-Standard Terminology

Research Findings

After analyzing industry best practices from Airflow, dbt, Fivetran, and modern data platforms, here are the accepted terms:

1. Full Refresh (Your "Static Refresh")

  • Industry Terms: Full Refresh, Complete Refresh, Snapshot, Full Load
  • Definition: Complete dataset extracted at a point in time, never changes after extraction
  • Common Use Cases:
  • Daily report snapshots (Gladly ContactTimestampsReport)
  • Historical point-in-time exports
  • Transactional data exports (closed transactions)
  • Characteristics:
  • Same date → Same data (immutable)
  • No need for change tracking
  • TRUNCATE + INSERT pattern in warehouse

2. Incremental Refresh (Your "Incremental Refresh")

  • Industry Terms: Incremental Load, Delta Load, Append-Only
  • Definition: Extract full dataset, identify new/changed records after loading
  • Common Use Cases:
  • Google Sheets (no API watermarking, pull full sheet)
  • APIs without change tracking
  • Files with full exports but only new rows matter
  • Characteristics:
  • Pull entire dataset each time
  • Identify changes after load (in warehouse)
  • Append new records, update changed records

3. Change Data Capture (CDC) - Not in your list but relevant

  • Industry Terms: CDC, Log-Based Replication, Real-time Sync
  • Definition: Stream only changed records from source
  • Common Use Cases:
  • Database replication (Postgres WAL, MySQL binlog)
  • SaaS platforms with webhook events
  • APIs with modified_since parameters
  • Characteristics:
  • Source provides only deltas
  • Highly efficient
  • Requires source system support

4. Refreshed Historical Data (Your "Refreshed Sources")

  • Industry Terms: Late-Arriving Data, Historical Refresh, Slowly Changing Dimension (SCD) Type 2
  • Definition: Historical data changes over time, same period returns different values on different extraction dates
  • Common Use Cases:
  • Rolling aggregates (last 30 days)
  • Late-arriving transactions
  • Data corrections/adjustments
  • SaaS platforms with retroactive updates
  • Characteristics:
  • Same date range → Different data on different pull dates
  • Requires SCD Type 2 historization
  • Complex to manage
  • Best Practice: Avoid when possible

XO Implementation Proposal

Terminology Alignment

XO Term (Current) Industry Standard XO Term (Proposed) Use In Code
Static Refresh Full Refresh Full Refresh full_refresh
Incremental Refresh Incremental Load Incremental Load incremental
Refreshed Sources Historical Refresh / SCD Type 2 Historical Refresh historical
N/A Change Data Capture CDC (future) cdc
# In YAML configs
source:
  load_strategy: full_refresh  # Options: full_refresh, incremental, historical, cdc

Rationale for naming: - full_refresh - Clear, matches dbt materialization terminology - incremental - Standard across all platforms - historical - Describes the problem (history changes) - cdc - Industry standard abbreviation

Implementation Details

1. Full Refresh Pattern

Source Behavior: Gladly API, static reports Extraction: Complete dataset for a specific date S3 Path: stage-bucket/{domain}/{report}/full/{date}/ Load Strategy: TRUNCATE + INSERT (daily partition)

source:
  load_strategy: full_refresh
  extractor:
    metric_set: ContactTimestampsReport
    date_param: report_date  # Fixed date, immutable

S3 Example:

stage-bucket/warbyparker/contact_timestamps/full/2025-12-01/contact_timestamps_20251201.csv
stage-bucket/warbyparker/contact_timestamps/full/2025-12-02/contact_timestamps_20251202.csv

dbt Downstream:

-- Simple SELECT, no deduplication needed
{{ config(materialized='table') }}
SELECT * FROM {{ source('bronze', 'gladly_contact_timestamps') }}

2. Incremental Pattern (Google Sheets Case)

Source Behavior: Google Sheets API Extraction: Full spreadsheet every time (no API support for deltas) Identification: Detect changes AFTER loading to warehouse S3 Path: stage-bucket/{domain}/{report}/incremental/{timestamp}/ Load Strategy: APPEND (keep all extractions), deduplicate in SILVER

source:
  load_strategy: incremental
  extractor:
    type: gsheet
    spreadsheet_id: "abc123"
    range: "Sheet1!A:Z"
  deduplication:
    # These define RECORD_KEY for change detection
    unique_columns:
      - row_id
      - email

S3 Example:

stage-bucket/warbyparker/contacts_sheet/incremental/2025-12-01T07:00:00/contacts_20251201_070000.csv
stage-bucket/warbyparker/contacts_sheet/incremental/2025-12-01T14:30:00/contacts_20251201_143000.csv

BRONZE Table: Append all extractions

-- Multiple rows per row_id across different extraction times
row_id | email          | name  | extraction_timestamp
1      | a@example.com  | Alice | 2025-12-01 07:00:00
2      | b@example.com  | Bob   | 2025-12-01 07:00:00
1      | a@example.com  | Alice | 2025-12-01 14:30:00  -- duplicate
2      | b@example.com  | Bob   | 2025-12-01 14:30:00  -- duplicate
3      | c@example.com  | Carol | 2025-12-01 14:30:00  -- NEW

dbt SILVER: Deduplicate, keep latest

{{ config(
    materialized='table',
    unique_key='row_id'
) }}

WITH latest_extraction AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY row_id
      ORDER BY extraction_timestamp DESC
    ) as rn
  FROM {{ source('bronze', 'contacts_sheet') }}
)
SELECT * EXCLUDE(rn)
FROM latest_extraction
WHERE rn = 1

Alternative dbt Approach: Use RECORD_HASH to detect changes

WITH ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY row_id, RECORD_HASH
      ORDER BY extraction_timestamp DESC
    ) as rn,
    LEAD(RECORD_HASH) OVER (
      PARTITION BY row_id
      ORDER BY extraction_timestamp
    ) as next_hash
  FROM {{ source('bronze', 'contacts_sheet') }}
)
-- Only keep records where data actually changed
SELECT * EXCLUDE(rn, next_hash)
FROM ranked
WHERE rn = 1 AND (next_hash IS NULL OR RECORD_HASH != next_hash)

3. Historical Refresh Pattern (Avoid When Possible)

Source Behavior: Gladly rolling reports (last 30 days) Extraction: Same date range returns different data on different extraction dates S3 Path: stage-bucket/{domain}/{report}/historical/{extraction_ts}/{data_date}/ Load Strategy: APPEND with SCD Type 2

source:
  load_strategy: historical
  extractor:
    metric_set: RollingMetricsReport
    lookback_days: 30
  scd:
    type: 2
    unique_columns:
      - contact_id
    valid_from_column: extraction_timestamp

S3 Example:

# First extraction on Dec 1
stage-bucket/warbyparker/rolling/historical/2025-12-01T07:00:00/2025-11-01/data.csv
stage-bucket/warbyparker/rolling/historical/2025-12-01T07:00:00/2025-11-02/data.csv

# Second extraction on Dec 2 - same dates, different data
stage-bucket/warbyparker/rolling/historical/2025-12-02T07:00:00/2025-11-01/data.csv
stage-bucket/warbyparker/rolling/historical/2025-12-02T07:00:00/2025-11-02/data.csv

BRONZE Table: Keep all historical versions

contact_id | metric_value | data_date  | extraction_timestamp
101        | 50           | 2025-11-01 | 2025-12-01 07:00:00
101        | 55           | 2025-11-01 | 2025-12-02 07:00:00  -- Updated value

dbt SILVER: SCD Type 2 with versioning

{{ config(materialized='table') }}

WITH versioned AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY contact_id, data_date, extraction_timestamp
      ORDER BY extraction_timestamp DESC
    ) as version,
    LEAD(extraction_timestamp) OVER (
      PARTITION BY contact_id, data_date
      ORDER BY extraction_timestamp
    ) as valid_to
  FROM {{ source('bronze', 'rolling_metrics') }}
)
SELECT
  *,
  extraction_timestamp as valid_from,
  COALESCE(valid_to, '9999-12-31'::timestamp) as valid_to,
  CASE WHEN valid_to IS NULL THEN TRUE ELSE FALSE END as is_current
FROM versioned

S3 Path Convention

Proposed Standard

s3://{bucket}/
├── ingest-bucket/           # Raw from source
│   └── {domain}/
│       └── {report_name}/
│           ├── full/                    # Full refresh
│           │   └── {YYYY-MM-DD}/
│           │       └── {filename}
│           ├── incremental/             # Incremental load
│           │   └── {YYYY-MM-DD}T{HH:MM:SS}/
│           │       └── {filename}
│           └── historical/              # Historical refresh
│               └── {extraction_timestamp}/
│                   └── {data_date}/
│                       └── {filename}
└── stage-bucket/            # Standardized columns
    └── {domain}/
        └── {report_name}/
            ├── full/{YYYY-MM-DD}/
            ├── incremental/{YYYY-MM-DD}T{HH:MM:SS}/
            └── historical/{extraction_ts}/{data_date}/

Path Builder Logic

def build_s3_path(domain, report_name, load_strategy, execution_date, **kwargs):
    """Build S3 path based on load strategy."""

    base = f"{domain}/{report_name}/{load_strategy}"

    if load_strategy == "full_refresh":
        # Date partition only
        date_str = execution_date.strftime("%Y-%m-%d")
        return f"{base}/{date_str}/"

    elif load_strategy == "incremental":
        # Timestamp for each extraction
        ts_str = execution_date.strftime("%Y-%m-%dT%H:%M:%S")
        return f"{base}/{ts_str}/"

    elif load_strategy == "historical":
        # Extraction timestamp + data date
        extraction_ts = execution_date.strftime("%Y-%m-%dT%H:%M:%S")
        data_date = kwargs.get("data_date", execution_date).strftime("%Y-%m-%d")
        return f"{base}/{extraction_ts}/{data_date}/"

    else:
        raise ValueError(f"Unknown load_strategy: {load_strategy}")

XO Sources Classification

Current Sources

Source Type Load Strategy Rationale
Gladly ContactTimestampsReport API full_refresh Daily snapshot, immutable
Gladly ConversationTimestampsReport API full_refresh Daily snapshot, immutable
Gladly ContactExportReport API full_refresh Daily snapshot, immutable
Gladly Rolling Metrics API historical Last N days, changes daily (⚠️ avoid)
Google Sheets - Contacts Spreadsheet incremental Full pull, identify changes in warehouse
Google Sheets - Metrics Spreadsheet incremental Full pull, identify changes in warehouse
Gmail Attachments Email full_refresh Static files, immutable
S3 Files Storage full_refresh Static files, immutable

Recommendations

  1. Gladly Rolling Metrics: Work with client to switch to daily snapshots (full_refresh)
  2. Google Sheets: Always use incremental pattern
  3. New Sources: Default to full_refresh when possible

YAML Configuration Examples

Full Refresh (Gladly Timestamps)

sources:
  contact_timestamps:
    load_strategy: full_refresh
    source_type: gladly_api

    extractor:
      metric_set: ContactTimestampsReport
      date_field: report_date  # Fixed date

    snowflake:
      target_table: GLADLY_CONTACT_TIMESTAMPS
      load_method: truncate_insert  # Safe for full_refresh

Incremental (Google Sheets)

sources:
  contacts_sheet:
    load_strategy: incremental
    source_type: gsheet

    extractor:
      spreadsheet_id: "abc123xyz"
      range: "Contacts!A:Z"
      # No watermark - pull full sheet every time

    snowflake:
      target_table: GSHEET_CONTACTS
      load_method: append  # Keep all extractions
      deduplication:
        unique_columns:
          - row_id
          - email

Historical (Avoid)

sources:
  rolling_metrics:
    load_strategy: historical
    source_type: gladly_api

    extractor:
      metric_set: RollingMetricsReport
      lookback_days: 30

    snowflake:
      target_table: GLADLY_ROLLING_METRICS
      load_method: append
      scd_type: 2
      deduplication:
        unique_columns:
          - contact_id
          - metric_date

Implementation Checklist

  • Update YAML schema to use load_strategy instead of refresh_type
  • Implement S3 path builder with load_strategy support
  • Update extraction tasks to pass load_strategy
  • Update staging tasks to use correct S3 paths
  • Add load_strategy to BRONZE table metadata
  • Document dbt patterns for each load_strategy
  • Add validation: warn if historical is used

Migration Path

Existing Code

# Old
merge:
  refresh_type: static

New Code

# New
source:
  load_strategy: full_refresh

Backward Compatibility

Support both during transition:

load_strategy = config.get("load_strategy") or config.get("refresh_type", "full_refresh")

# Map old values to new
if load_strategy == "static":
    load_strategy = "full_refresh"
elif load_strategy == "refreshed":
    load_strategy = "historical"

Best Practices

  1. Default to full_refresh: Simplest, most reliable
  2. Use incremental for live documents: Google Sheets, spreadsheets
  3. Avoid historical: Negotiate with clients for daily snapshots instead
  4. Document the strategy: Make it explicit in YAML and table comments
  5. S3 path reflects strategy: Path structure makes strategy obvious