Skip to content

ADR 004: Metadata Columns and Hashing Strategy

Status

Accepted

Context

When loading data into Snowflake via the enhanced COPY INTO pattern, we need metadata columns for tracking, deduplication, and change detection. Two key columns serve different purposes:

  • RECORD_KEY: Unique identifier for deduplication
  • RECORD_HASH: Hash of all columns for change detection

The implementation required decisions about: 1. How to handle NULL values in hash calculations 2. What RECORD_KEY should represent (raw value vs hash) 3. What RECORD_HASH should represent 4. How to make these strategies configurable

Decision

RECORD_KEY Strategy

RECORD_KEY behavior is controlled by the use_hash_for_single_field toggle in deduplication config:

When use_hash_for_single_field: true (default):

RECORD_KEY = MD5(unique_field)
- Stores an MD5 hash of the unique field(s) - Consistent format regardless of unique field type - Better for composite keys or long identifiers

When use_hash_for_single_field: false:

RECORD_KEY = CAST(unique_field AS STRING)
- Stores the actual unique field value - Human-readable in Snowflake queries - Simpler for debugging and data validation

RECORD_HASH Strategy

RECORD_HASH ALWAYS hashes ALL columns for full-row change detection:

RECORD_HASH = MD5(CONCAT(
    COALESCE(CAST($1 AS STRING), ''),
    COALESCE(CAST($2 AS STRING), ''),
    COALESCE(CAST($3 AS STRING), ''),
    ...
))

Key characteristics: 1. NULL-safe: Uses COALESCE(CAST(column AS STRING), '') to convert NULLs to empty strings 2. Type-safe: Casts all columns to STRING before concatenation 3. Always includes all columns: Independent of deduplication strategy 4. Purpose: Detect if ANY field changed between runs

NULL Handling Rationale

Problem: In Snowflake, if ANY argument to CONCAT() is NULL, the entire result is NULL:

-- This returns NULL if $2 is NULL
MD5(CONCAT($1, $2, $3))

Solution: Wrap each column with COALESCE:

-- This always returns a hash, even with NULLs
MD5(CONCAT(
    COALESCE(CAST($1 AS STRING), ''),
    COALESCE(CAST($2 AS STRING), '')
))

Why this approach: - Ensures RECORD_HASH is never NULL - Distinguishes between "column is empty string" and "column is NULL" (both become empty string) - Enables change detection even on sparse data - Consistent hash values across different data types

Consequences

Positive

  • Flexibility: use_hash_for_single_field toggle allows both human-readable and hashed RECORD_KEY
  • Reliability: NULL-safe hashing prevents unexpected NULL values in metadata columns
  • Change Detection: RECORD_HASH enables identifying changed records by comparing hashes
  • Type Safety: Explicit CAST ensures consistent behavior across different column types
  • Debugging: When use_hash_for_single_field: false, RECORD_KEY is human-readable

Negative

  • Hash Collisions: MD5 can theoretically produce collisions (extremely rare in practice)
  • NULL Semantics: NULL and empty string are treated identically in hashes
  • Performance: COALESCE and CAST add minor overhead to COPY INTO operation
  • Verbosity: Generated SQL is longer due to COALESCE wrapping

Neutral

  • Configuration Required: Users must understand use_hash_for_single_field toggle
  • Hash Immutability: Changing column order or adding columns changes RECORD_HASH (expected behavior)

Implementation

Code Location

  • packages/xo-foundry/src/xo_foundry/tasks/snowflake_tasks.py (lines 111-123)
  • packages/xo-foundry/src/xo_foundry/schemas/dag_config.py (DeduplicationConfig)
  • packages/xo-foundry/src/xo_foundry/schemas/pipeline_config.py (MetadataConfig)

Configuration Example

sources:
  sprout_messages:
    snowflake:
      deduplication:
        strategy: single_field
        unique_columns: [GUID]
        use_hash: true  # Controls RECORD_KEY only

Result: - RECORD_KEY: cee3536dfb5de30c4cc2afa9e9d67eda (MD5 of GUID) - RECORD_HASH: a1b2c3d4e5f6... (MD5 of all 14 columns, NULL-safe)

Generated SQL Example

COPY INTO table (col1, col2, col3, RECORD_KEY, RECORD_HASH, ...)
FROM (
    SELECT
        $1,  -- col1
        $2,  -- col2
        $3,  -- col3
        MD5($1),  -- RECORD_KEY (when use_hash_for_single_field=true)
        MD5(CONCAT(
            COALESCE(CAST($1 AS STRING), ''),
            COALESCE(CAST($2 AS STRING), ''),
            COALESCE(CAST($3 AS STRING), '')
        )),  -- RECORD_HASH (always all columns, NULL-safe)
        ...
    FROM @stage/
)

Alternatives Considered

Alternative 1: Skip NULL Columns in RECORD_HASH

Approach: Only include non-NULL columns in hash calculation

Rejected because: - Dynamic SQL generation complexity (need to check each value) - Inconsistent hash values (same record with different NULLs = different hashes) - Cannot use in Snowflake COPY INTO SELECT (no conditional logic)

Alternative 2: Use Snowflake HASH Function

Approach: Use HASH($1, $2, $3) instead of MD5(CONCAT(...))

Rejected because: - Still requires NULL handling - Less portable (Snowflake-specific) - MD5 is more widely recognized/standardized

Alternative 3: Always Use Raw Values for RECORD_KEY

Approach: Remove use_hash_for_single_field toggle, always use raw values

Rejected because: - Composite keys require concatenation/hashing - Inconsistent format across different key types - Loses benefits of uniform hash-based keys

References

  • Issue: RECORD_HASH returning NULL due to NULL columns
  • PR: Fixed with COALESCE wrapping (2025-12-11)
  • Discussion: User requirement for full-row change detection