ADR 004: Metadata Columns and Hashing Strategy¶
Status¶
Accepted
Context¶
When loading data into Snowflake via the enhanced COPY INTO pattern, we need metadata columns for tracking, deduplication, and change detection. Two key columns serve different purposes:
- RECORD_KEY: Unique identifier for deduplication
- RECORD_HASH: Hash of all columns for change detection
The implementation required decisions about: 1. How to handle NULL values in hash calculations 2. What RECORD_KEY should represent (raw value vs hash) 3. What RECORD_HASH should represent 4. How to make these strategies configurable
Decision¶
RECORD_KEY Strategy¶
RECORD_KEY behavior is controlled by the use_hash_for_single_field toggle in deduplication config:
When use_hash_for_single_field: true (default):
When use_hash_for_single_field: false:
RECORD_HASH Strategy¶
RECORD_HASH ALWAYS hashes ALL columns for full-row change detection:
RECORD_HASH = MD5(CONCAT(
COALESCE(CAST($1 AS STRING), ''),
COALESCE(CAST($2 AS STRING), ''),
COALESCE(CAST($3 AS STRING), ''),
...
))
Key characteristics:
1. NULL-safe: Uses COALESCE(CAST(column AS STRING), '') to convert NULLs to empty strings
2. Type-safe: Casts all columns to STRING before concatenation
3. Always includes all columns: Independent of deduplication strategy
4. Purpose: Detect if ANY field changed between runs
NULL Handling Rationale¶
Problem: In Snowflake, if ANY argument to CONCAT() is NULL, the entire result is NULL:
Solution: Wrap each column with COALESCE:
-- This always returns a hash, even with NULLs
MD5(CONCAT(
COALESCE(CAST($1 AS STRING), ''),
COALESCE(CAST($2 AS STRING), '')
))
Why this approach: - Ensures RECORD_HASH is never NULL - Distinguishes between "column is empty string" and "column is NULL" (both become empty string) - Enables change detection even on sparse data - Consistent hash values across different data types
Consequences¶
Positive¶
- Flexibility:
use_hash_for_single_fieldtoggle allows both human-readable and hashed RECORD_KEY - Reliability: NULL-safe hashing prevents unexpected NULL values in metadata columns
- Change Detection: RECORD_HASH enables identifying changed records by comparing hashes
- Type Safety: Explicit CAST ensures consistent behavior across different column types
- Debugging: When
use_hash_for_single_field: false, RECORD_KEY is human-readable
Negative¶
- Hash Collisions: MD5 can theoretically produce collisions (extremely rare in practice)
- NULL Semantics: NULL and empty string are treated identically in hashes
- Performance: COALESCE and CAST add minor overhead to COPY INTO operation
- Verbosity: Generated SQL is longer due to COALESCE wrapping
Neutral¶
- Configuration Required: Users must understand
use_hash_for_single_fieldtoggle - Hash Immutability: Changing column order or adding columns changes RECORD_HASH (expected behavior)
Implementation¶
Code Location¶
packages/xo-foundry/src/xo_foundry/tasks/snowflake_tasks.py(lines 111-123)packages/xo-foundry/src/xo_foundry/schemas/dag_config.py(DeduplicationConfig)packages/xo-foundry/src/xo_foundry/schemas/pipeline_config.py(MetadataConfig)
Configuration Example¶
sources:
sprout_messages:
snowflake:
deduplication:
strategy: single_field
unique_columns: [GUID]
use_hash: true # Controls RECORD_KEY only
Result:
- RECORD_KEY: cee3536dfb5de30c4cc2afa9e9d67eda (MD5 of GUID)
- RECORD_HASH: a1b2c3d4e5f6... (MD5 of all 14 columns, NULL-safe)
Generated SQL Example¶
COPY INTO table (col1, col2, col3, RECORD_KEY, RECORD_HASH, ...)
FROM (
SELECT
$1, -- col1
$2, -- col2
$3, -- col3
MD5($1), -- RECORD_KEY (when use_hash_for_single_field=true)
MD5(CONCAT(
COALESCE(CAST($1 AS STRING), ''),
COALESCE(CAST($2 AS STRING), ''),
COALESCE(CAST($3 AS STRING), '')
)), -- RECORD_HASH (always all columns, NULL-safe)
...
FROM @stage/
)
Alternatives Considered¶
Alternative 1: Skip NULL Columns in RECORD_HASH¶
Approach: Only include non-NULL columns in hash calculation
Rejected because: - Dynamic SQL generation complexity (need to check each value) - Inconsistent hash values (same record with different NULLs = different hashes) - Cannot use in Snowflake COPY INTO SELECT (no conditional logic)
Alternative 2: Use Snowflake HASH Function¶
Approach: Use HASH($1, $2, $3) instead of MD5(CONCAT(...))
Rejected because: - Still requires NULL handling - Less portable (Snowflake-specific) - MD5 is more widely recognized/standardized
Alternative 3: Always Use Raw Values for RECORD_KEY¶
Approach: Remove use_hash_for_single_field toggle, always use raw values
Rejected because: - Composite keys require concatenation/hashing - Inconsistent format across different key types - Loses benefits of uniform hash-based keys
Related Decisions¶
- ADR 001: Load Strategy Terminology - Deduplication strategies
- ADR 003: DAG Generation Strategy - YAML-driven configuration
References¶
- Issue: RECORD_HASH returning NULL due to NULL columns
- PR: Fixed with COALESCE wrapping (2025-12-11)
- Discussion: User requirement for full-row change detection