Data Refresh Patterns - Industry Standards & XO Implementation¶
Status: Reference Document Created: 2025-12-02 Purpose: Define standardized terminology and implementation for data refresh strategies
Industry-Standard Terminology¶
Research Findings¶
After analyzing industry best practices from Airflow, dbt, Fivetran, and modern data platforms, here are the accepted terms:
1. Full Refresh (Your "Static Refresh")¶
- Industry Terms: Full Refresh, Complete Refresh, Snapshot, Full Load
- Definition: Complete dataset extracted at a point in time, never changes after extraction
- Common Use Cases:
- Daily report snapshots (Gladly ContactTimestampsReport)
- Historical point-in-time exports
- Transactional data exports (closed transactions)
- Characteristics:
- Same date → Same data (immutable)
- No need for change tracking
- TRUNCATE + INSERT pattern in warehouse
2. Incremental Refresh (Your "Incremental Refresh")¶
- Industry Terms: Incremental Load, Delta Load, Append-Only
- Definition: Extract full dataset, identify new/changed records after loading
- Common Use Cases:
- Google Sheets (no API watermarking, pull full sheet)
- APIs without change tracking
- Files with full exports but only new rows matter
- Characteristics:
- Pull entire dataset each time
- Identify changes after load (in warehouse)
- Append new records, update changed records
3. Change Data Capture (CDC) - Not in your list but relevant¶
- Industry Terms: CDC, Log-Based Replication, Real-time Sync
- Definition: Stream only changed records from source
- Common Use Cases:
- Database replication (Postgres WAL, MySQL binlog)
- SaaS platforms with webhook events
- APIs with
modified_sinceparameters - Characteristics:
- Source provides only deltas
- Highly efficient
- Requires source system support
4. Refreshed Historical Data (Your "Refreshed Sources")¶
- Industry Terms: Late-Arriving Data, Historical Refresh, Slowly Changing Dimension (SCD) Type 2
- Definition: Historical data changes over time, same period returns different values on different extraction dates
- Common Use Cases:
- Rolling aggregates (last 30 days)
- Late-arriving transactions
- Data corrections/adjustments
- SaaS platforms with retroactive updates
- Characteristics:
- Same date range → Different data on different pull dates
- Requires SCD Type 2 historization
- Complex to manage
- Best Practice: Avoid when possible
XO Implementation Proposal¶
Terminology Alignment¶
| XO Term (Current) | Industry Standard | XO Term (Proposed) | Use In Code |
|---|---|---|---|
| Static Refresh | Full Refresh | Full Refresh | full_refresh |
| Incremental Refresh | Incremental Load | Incremental Load | incremental |
| Refreshed Sources | Historical Refresh / SCD Type 2 | Historical Refresh | historical |
| N/A | Change Data Capture | CDC (future) | cdc |
Recommended Values¶
# In YAML configs
source:
load_strategy: full_refresh # Options: full_refresh, incremental, historical, cdc
Rationale for naming:
- full_refresh - Clear, matches dbt materialization terminology
- incremental - Standard across all platforms
- historical - Describes the problem (history changes)
- cdc - Industry standard abbreviation
Implementation Details¶
1. Full Refresh Pattern¶
Source Behavior: Gladly API, static reports
Extraction: Complete dataset for a specific date
S3 Path: stage-bucket/{domain}/{report}/full/{date}/
Load Strategy: TRUNCATE + INSERT (daily partition)
source:
load_strategy: full_refresh
extractor:
metric_set: ContactTimestampsReport
date_param: report_date # Fixed date, immutable
S3 Example:
stage-bucket/warbyparker/contact_timestamps/full/2025-12-01/contact_timestamps_20251201.csv
stage-bucket/warbyparker/contact_timestamps/full/2025-12-02/contact_timestamps_20251202.csv
dbt Downstream:
-- Simple SELECT, no deduplication needed
{{ config(materialized='table') }}
SELECT * FROM {{ source('bronze', 'gladly_contact_timestamps') }}
2. Incremental Pattern (Google Sheets Case)¶
Source Behavior: Google Sheets API
Extraction: Full spreadsheet every time (no API support for deltas)
Identification: Detect changes AFTER loading to warehouse
S3 Path: stage-bucket/{domain}/{report}/incremental/{timestamp}/
Load Strategy: APPEND (keep all extractions), deduplicate in SILVER
source:
load_strategy: incremental
extractor:
type: gsheet
spreadsheet_id: "abc123"
range: "Sheet1!A:Z"
deduplication:
# These define RECORD_KEY for change detection
unique_columns:
- row_id
- email
S3 Example:
stage-bucket/warbyparker/contacts_sheet/incremental/2025-12-01T07:00:00/contacts_20251201_070000.csv
stage-bucket/warbyparker/contacts_sheet/incremental/2025-12-01T14:30:00/contacts_20251201_143000.csv
BRONZE Table: Append all extractions
-- Multiple rows per row_id across different extraction times
row_id | email | name | extraction_timestamp
1 | a@example.com | Alice | 2025-12-01 07:00:00
2 | b@example.com | Bob | 2025-12-01 07:00:00
1 | a@example.com | Alice | 2025-12-01 14:30:00 -- duplicate
2 | b@example.com | Bob | 2025-12-01 14:30:00 -- duplicate
3 | c@example.com | Carol | 2025-12-01 14:30:00 -- NEW
dbt SILVER: Deduplicate, keep latest
{{ config(
materialized='table',
unique_key='row_id'
) }}
WITH latest_extraction AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY row_id
ORDER BY extraction_timestamp DESC
) as rn
FROM {{ source('bronze', 'contacts_sheet') }}
)
SELECT * EXCLUDE(rn)
FROM latest_extraction
WHERE rn = 1
Alternative dbt Approach: Use RECORD_HASH to detect changes
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY row_id, RECORD_HASH
ORDER BY extraction_timestamp DESC
) as rn,
LEAD(RECORD_HASH) OVER (
PARTITION BY row_id
ORDER BY extraction_timestamp
) as next_hash
FROM {{ source('bronze', 'contacts_sheet') }}
)
-- Only keep records where data actually changed
SELECT * EXCLUDE(rn, next_hash)
FROM ranked
WHERE rn = 1 AND (next_hash IS NULL OR RECORD_HASH != next_hash)
3. Historical Refresh Pattern (Avoid When Possible)¶
Source Behavior: Gladly rolling reports (last 30 days)
Extraction: Same date range returns different data on different extraction dates
S3 Path: stage-bucket/{domain}/{report}/historical/{extraction_ts}/{data_date}/
Load Strategy: APPEND with SCD Type 2
source:
load_strategy: historical
extractor:
metric_set: RollingMetricsReport
lookback_days: 30
scd:
type: 2
unique_columns:
- contact_id
valid_from_column: extraction_timestamp
S3 Example:
# First extraction on Dec 1
stage-bucket/warbyparker/rolling/historical/2025-12-01T07:00:00/2025-11-01/data.csv
stage-bucket/warbyparker/rolling/historical/2025-12-01T07:00:00/2025-11-02/data.csv
# Second extraction on Dec 2 - same dates, different data
stage-bucket/warbyparker/rolling/historical/2025-12-02T07:00:00/2025-11-01/data.csv
stage-bucket/warbyparker/rolling/historical/2025-12-02T07:00:00/2025-11-02/data.csv
BRONZE Table: Keep all historical versions
contact_id | metric_value | data_date | extraction_timestamp
101 | 50 | 2025-11-01 | 2025-12-01 07:00:00
101 | 55 | 2025-11-01 | 2025-12-02 07:00:00 -- Updated value
dbt SILVER: SCD Type 2 with versioning
{{ config(materialized='table') }}
WITH versioned AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY contact_id, data_date, extraction_timestamp
ORDER BY extraction_timestamp DESC
) as version,
LEAD(extraction_timestamp) OVER (
PARTITION BY contact_id, data_date
ORDER BY extraction_timestamp
) as valid_to
FROM {{ source('bronze', 'rolling_metrics') }}
)
SELECT
*,
extraction_timestamp as valid_from,
COALESCE(valid_to, '9999-12-31'::timestamp) as valid_to,
CASE WHEN valid_to IS NULL THEN TRUE ELSE FALSE END as is_current
FROM versioned
S3 Path Convention¶
Proposed Standard¶
s3://{bucket}/
├── ingest-bucket/ # Raw from source
│ └── {domain}/
│ └── {report_name}/
│ ├── full/ # Full refresh
│ │ └── {YYYY-MM-DD}/
│ │ └── {filename}
│ ├── incremental/ # Incremental load
│ │ └── {YYYY-MM-DD}T{HH:MM:SS}/
│ │ └── {filename}
│ └── historical/ # Historical refresh
│ └── {extraction_timestamp}/
│ └── {data_date}/
│ └── {filename}
│
└── stage-bucket/ # Standardized columns
└── {domain}/
└── {report_name}/
├── full/{YYYY-MM-DD}/
├── incremental/{YYYY-MM-DD}T{HH:MM:SS}/
└── historical/{extraction_ts}/{data_date}/
Path Builder Logic¶
def build_s3_path(domain, report_name, load_strategy, execution_date, **kwargs):
"""Build S3 path based on load strategy."""
base = f"{domain}/{report_name}/{load_strategy}"
if load_strategy == "full_refresh":
# Date partition only
date_str = execution_date.strftime("%Y-%m-%d")
return f"{base}/{date_str}/"
elif load_strategy == "incremental":
# Timestamp for each extraction
ts_str = execution_date.strftime("%Y-%m-%dT%H:%M:%S")
return f"{base}/{ts_str}/"
elif load_strategy == "historical":
# Extraction timestamp + data date
extraction_ts = execution_date.strftime("%Y-%m-%dT%H:%M:%S")
data_date = kwargs.get("data_date", execution_date).strftime("%Y-%m-%d")
return f"{base}/{extraction_ts}/{data_date}/"
else:
raise ValueError(f"Unknown load_strategy: {load_strategy}")
XO Sources Classification¶
Current Sources¶
| Source | Type | Load Strategy | Rationale |
|---|---|---|---|
| Gladly ContactTimestampsReport | API | full_refresh |
Daily snapshot, immutable |
| Gladly ConversationTimestampsReport | API | full_refresh |
Daily snapshot, immutable |
| Gladly ContactExportReport | API | full_refresh |
Daily snapshot, immutable |
| Gladly Rolling Metrics | API | historical |
Last N days, changes daily (⚠️ avoid) |
| Google Sheets - Contacts | Spreadsheet | incremental |
Full pull, identify changes in warehouse |
| Google Sheets - Metrics | Spreadsheet | incremental |
Full pull, identify changes in warehouse |
| Gmail Attachments | full_refresh |
Static files, immutable | |
| S3 Files | Storage | full_refresh |
Static files, immutable |
Recommendations¶
- Gladly Rolling Metrics: Work with client to switch to daily snapshots (full_refresh)
- Google Sheets: Always use incremental pattern
- New Sources: Default to full_refresh when possible
YAML Configuration Examples¶
Full Refresh (Gladly Timestamps)¶
sources:
contact_timestamps:
load_strategy: full_refresh
source_type: gladly_api
extractor:
metric_set: ContactTimestampsReport
date_field: report_date # Fixed date
snowflake:
target_table: GLADLY_CONTACT_TIMESTAMPS
load_method: truncate_insert # Safe for full_refresh
Incremental (Google Sheets)¶
sources:
contacts_sheet:
load_strategy: incremental
source_type: gsheet
extractor:
spreadsheet_id: "abc123xyz"
range: "Contacts!A:Z"
# No watermark - pull full sheet every time
snowflake:
target_table: GSHEET_CONTACTS
load_method: append # Keep all extractions
deduplication:
unique_columns:
- row_id
- email
Historical (Avoid)¶
sources:
rolling_metrics:
load_strategy: historical
source_type: gladly_api
extractor:
metric_set: RollingMetricsReport
lookback_days: 30
snowflake:
target_table: GLADLY_ROLLING_METRICS
load_method: append
scd_type: 2
deduplication:
unique_columns:
- contact_id
- metric_date
Implementation Checklist¶
- Update YAML schema to use
load_strategyinstead ofrefresh_type - Implement S3 path builder with load_strategy support
- Update extraction tasks to pass load_strategy
- Update staging tasks to use correct S3 paths
- Add load_strategy to BRONZE table metadata
- Document dbt patterns for each load_strategy
- Add validation: warn if
historicalis used
Migration Path¶
Existing Code¶
New Code¶
Backward Compatibility¶
Support both during transition:
load_strategy = config.get("load_strategy") or config.get("refresh_type", "full_refresh")
# Map old values to new
if load_strategy == "static":
load_strategy = "full_refresh"
elif load_strategy == "refreshed":
load_strategy = "historical"
Best Practices¶
- Default to full_refresh: Simplest, most reliable
- Use incremental for live documents: Google Sheets, spreadsheets
- Avoid historical: Negotiate with clients for daily snapshots instead
- Document the strategy: Make it explicit in YAML and table comments
- S3 path reflects strategy: Path structure makes strategy obvious