Skip to content

dbt Models — Design & Architecture

Status: Canonical source of truth for dbt model design in XO-Data. Supersedes: elt-layer-architecture-best-practices.md (partially superseded by ADR 019; this document is the authoritative replacement for day-to-day model design). Last updated: 2026-04-27

This document defines how every dbt model in XO-Data is built, layer by layer. It is the source of truth for Claude and for humans. When this document conflicts with older docs, this wins. When it conflicts with an ADR, the ADR wins and this document must be updated.

dbt project location: apps/dbt/xo-medallion/


1. The Layers, at a Glance

Layer Owner Purpose Materialization Shape of transformation
Bronze xo-foundry Raw landing — data as-is from source Table (batch_replace / truncate_insert) Column-name sanitization + metadata columns. Nothing else.
Silver dbt Data QA, type casting, deduplication Incremental table (merge + RECORD_HASH) Data cleaning, not business transformation. Tests fail on non-conforming data.
Gold int_ dbt Shared logic between 2+ Gold models Ephemeral CTEs only — no Snowflake object.
Gold fct_ dbt Business model of the source (not a 1:1 copy of Silver) Table Bulk of transformations. Business logic + business metadata. Granular data ready for KPI computation.
Gold dim_ dbt Current-state reference entities Table SCD Type 1 descriptive attributes (agent, inbox, team).
Gold agg_ dbt KPI computed layer — sums + counts at daily grain Table Pre-computed KPIs. Channel-level minimum grain. Never stores averages.
Gold rpt_ dbt Tableau-ready presentation views (stable, SLA-backed) View (zero storage) Projections, cross-channel rollups, WoW/MoM derivations.
Gold exp_ dbt Exploratory/pilot views — no SLA View (zero storage) Same rules as rpt_. Promote to rpt_ when stable. See ADR 021.

2. Bronze — Data As-Is From Source

Owner: xo-foundry. dbt does not build Bronze.

Rules

  • Column names sanitized only. UPPERCASE, underscores. Values untouched.
  • All columns VARCHAR (ADR 013). No type inference. No casting.
  • No transformations. No filtering. No deduplication. No JOINs.
  • Six metadata columns always present: RECORD_KEY, RECORD_HASH, DATE_TO_WAREHOUSE, SOURCE_FILE, BATCH_ID, PIPELINE_RUN_ID.
  • Naming: {SOURCE}_{OBJECT} — e.g., GLADLY_CONTACT_TIMESTAMPS, SPROUT_MESSAGES.

Loading Strategies

Bronze loading strategy depends on the source's refresh behavior. See ADR 001 and ADR 011 for canonical definitions. These are not interchangeable — picking the wrong one silently corrupts history.

Strategy When to use Bronze behavior
full_refresh Source re-delivers the complete dataset each run (small reference data, Google Sheets, glossaries). Replace all rows on every run.
incremental Source delivers only the changed window. New rows appended; corrections via batch_replace of the affected batch. Append a window of new rows; history accumulates across batches.
historical One-time backfill from a historical source. Bulk load, then handed off to the incremental pipeline.
batch_replace The physical mechanic used with incremental pipelines — DELETE WHERE BATCH_ID=... THEN COPY INTO. Preserves cross-batch history. Default for all new pipelines. Idempotent per batch; full history retained.
truncate_insert Legacy physical mechanic for older pipelines only. Bronze keeps only the current batch. Do not use for new pipelines. Silver is the only historical record.

Note: Source-type classification (which sources are full_refresh vs incremental vs historical) is tracked per-pipeline in the Airflow config. A proper inventory review is pending; when done, update this table and link the inventory here.


3. Silver — Data Quality & Cleaning

Owner: dbt. Source: Bronze tables only.

Silver is a data quality layer. It takes raw VARCHAR data from Bronze, casts it to proper types, and deduplicates. It is not a business-transformation layer.

What Silver Does

  • Type casting via Snowflake safe functions — TRY_TO_TIMESTAMP_NTZ, TRY_TO_NUMBER, IFF(UPPER(col)='TRUE',...). IDs and free text stay VARCHAR.
  • Column standardization — UPPERCASE, underscores (inherited from Bronze).
  • Deduplication by RECORD_KEY (handled by incremental merge strategy).
  • Data cleaning — removing literal garbage values produced by the source API (e.g. the sentinel, empty-string-as-null normalization). These are fixes for broken source data, not business rules.
  • Data quality tests (schema.yml) — not_null, unique, accepted_values. Tests must fail (error severity, not warn) when data doesn't conform, so bad data surfaces in CI instead of silently propagating to Gold.

What Silver Does NOT Do

  • No business filtering. Non-XO agents, terminated employees, exclusion lists — none of that happens here. Those are Gold concerns.
  • No enrichment. No JOINs to rosters, glossaries, or other tables.
  • No aggregation. Same grain as Bronze.
  • No business metadata. No IS_XO_AGENT, no tenure buckets, no wave periods.

Incremental Strategy — Canonical

All new Silver models use incremental_strategy='merge' with RECORD_HASH change detection (ADR 020).

{{ config(
    materialized='incremental',
    unique_key='RECORD_KEY',
    incremental_strategy='merge',
    on_schema_change='sync_all_columns',
) }}

The merge strategy is required because Bronze batch_replace re-emits corrected rows for the same RECORD_KEY. append silently drops those corrections. RECORD_HASH comparison detects changed rows and updates them.

Naming

{client}_{object} — e.g., wbp_contact_timestamps, cnd_messages. Schema is SILVER.


4. Gold — Business Layer

Gold is the business layer. It takes clean typed Silver data and produces analytics-ready tables and views. It uses a five-type architecture: int_, fct_, dim_, agg_, rpt_.

4.1 Gold Intermediates (int_)

  • Materialization: ephemeral — compiled as inline subquery, no Snowflake object.
  • Purpose: Shared logic used by 2+ downstream Gold models (e.g., hold-time computation using LAG).
  • Rule: If logic is used by only one downstream model, inline it as a CTE. Don't create an int_ for it.
  • Naming: int_{client}_{description} — e.g., int_wbp_contacts_hold.

4.2 Gold Facts (fct_) — The Business Model

Key principle: fct_ tables are not 1:1 copies of Silver. They are the business model of the source, with business logic applied and business metadata attached. This is where the bulk of transformations live.

fct_ delivers clean, granular data ready for KPI computation. Silver gives you raw typed rows; Gold gives you the business-meaningful rows you actually want to count, aggregate, and report on.

What fct_ does:

  • Business logic applied — status normalization, timezone conversion, contact-vs-work-session resolution, exclusion of test records, derived fields (handle time components, hold time, SLA eligibility, FCR eligibility).
  • Business metadata attachedIS_XO_AGENT flag, tenure buckets, wave periods, roster EID join.
  • Roster enrichment — JOIN to CORE_DB.SILVER.ROSTER_{CLIENT} to attach EID. Descriptive agent attributes (name, team) live in dim_employee_{client} and are joined at the rpt_ layer, not denormalized into every fact row.
  • Grain: Most granular available form of the source. This may be event-level (e.g., contact-level from GLADLY_CONTACT_TIMESTAMPS) or pre-aggregated if that's the source's native grain (e.g., Gladly Agent Summary API arrives pre-aggregated daily — fct_wbp_agent_channel_summary preserves that grain).
  • Channel as first-class dimension — if the source has a channel (email / chat / voice / messaging), the fact must carry it. Never collapse channel here. See §5.

Join-to-roster strategy is selected per ADR 019 source category:

Category Join type When
1 LEFT JOIN roster XO-only source (internal tools, QA sheets). NULL EID is a DQ warning.
2 Filter-then-LEFT JOIN Mixed source with an in-source XO identifier (e.g., tu. email prefix).
3 INNER JOIN Mixed source with no in-source XO identifier. Non-XO rows drop naturally.

Every new fct_ must document its source category in the model header and in schema.yml (source_category, xo_filter_strategy, join_key).

Naming: fct_{client}_{object} — e.g., fct_wbp_contacts, fct_cnd_conversations.

4.3 Gold Dimensions (dim_)

  • Purpose: Current-state reference entities (SCD Type 1).
  • Content: Descriptive attributes only (agent name, team, leader, hire date, inbox category).
  • Source: Silver roster/glossary tables.
  • Materialization: Table.
  • Naming: dim_{entity} — e.g., dim_employee_warbyparker, dim_inbox_warbyparker.

Descriptive attributes live here and are joined at the rpt_ layer, so fct_ and agg_ stay narrow (keys + metrics + core business metadata).

4.4 Gold Aggregates (agg_) — The KPI Layer

agg_ IS the KPI computed layer. There is no separate "KPI" layer. When you need a computed metric, you build an agg_. The agg_ prefix is the KPI signal.

What agg_ does:

  • Compute KPIs from facts — sums, counts, distinct counts.
  • Daily grain only as the base. Weekly / monthly / quarterly rollups happen in the rpt_ view via DATE_TRUNC + GROUP BY. We do not materialize a daily, weekly, and monthly version of the same metric.
  • Channel as minimum grain (see §5) — every agg_ must include channel if the underlying fact has channel.
  • XO filter: WHERE IS_XO_AGENT = TRUE (inherited from the fact's flag).
  • Store sums and counts only — never pre-computed averages. Downstream computes avg = SUM(x) / SUM(count) at any rollup grain. Storing averages kills composability.
  • From fct_ only. Never directly from Silver. Never from Bronze. Never from a roster. The fact owns the business logic; the agg just aggregates.
  • Standardized columns across clients — each KPI domain has a standard column set; NULLs for metrics a client doesn't have.

When not to create an agg_:

  • Trivial dimension collapses (e.g., "sum channel-level CSAT to agent-daily") → inline as a CTE inside the consuming rpt_.
  • Only one downstream model needs it → inline as a CTE.
  • Only create a new agg_ table when the metric is consumed by 2+ downstream reports or the computation is expensive enough that materialization is cheaper than repeated view execution.

Naming: agg_{client}_{subject}_{grain} — grain is always the last token. E.g., agg_wbp_qa_channel_daily, agg_cnd_csat_channel_daily.

4.5 Gold Reports (rpt_) — The Presentation Layer

  • Always views. Zero storage. Never materialize as table or dynamic table.
  • Tableau-ready. Final column names, pre-computed averages, display-formatted values.
  • Source: Gold aggregates (ideally). Reaching down into fct_ or Silver from a rpt_ is a layering violation and should be fixed by introducing the missing agg_.
  • Grain: The most granular dimension desired by the consumer. Everything else (WoW/MoM/QoQ, cross-channel totals, rolling averages) is derived from this view, not materialized as a separate report.

What rpt_ owns:

  • Cross-channel rollupsGROUP BY date, eid over a channel-grained agg_. This is explicitly an rpt_ concern, not an agg_ concern. Channel-collapsed versions of a metric should never be stored as separate aggregates.
  • Period rollups — WoW/MoM/QoQ via DATE_TRUNC + GROUP BY. Period-over-period deltas via LAG() window functions on the same view.
  • Rate computation by grain: account-level (date × channel, no agent dimension) views pre-compute final rates (AHT_SEC, SLA_PCT, CSAT_AVG, QA_AVG, FCR_PCT) because the output is consumed as-is with no further aggregation in Tableau. Agent-grained (date × eid × channel) views keep raw sums and counts — Tableau must re-aggregate across agents without dividing by a pre-computed denominator. Pre-computing averages at agent grain silently breaks team-level or date-range aggregations downstream.
  • Descriptive joins — join dim_ tables to attach names, teams, leader info.

Multi-source daily report — the spine pattern

When a rpt_ joins multiple aggregates that may not share the same (date × EID × channel) combinations (e.g., a CSAT survey received on a weekend when the agent had no interactions), always build a spine CTE using UNION across all contributing aggregates. Never use one aggregate as the primary FROM; it silently drops rows present only in the other aggregates.

WITH spine AS (
    SELECT date, eid, channel FROM {{ ref('agg_..._channel_daily') }}
    UNION
    SELECT date, eid, interaction_type AS channel FROM {{ ref('agg_..._csat_channel_daily') }}
    UNION
    SELECT date, eid, channel_formatted AS channel FROM {{ ref('agg_..._qa_channel_daily') }}
)
SELECT s.date, s.eid, s.channel, ...
FROM spine s
LEFT JOIN {{ ref('agg_..._channel_daily') }} a ON ...
LEFT JOIN {{ ref('agg_..._csat_channel_daily') }} c ON ...

Use UNION (not UNION ALL) so duplicates collapse to a single row.

Channel column normalization in the spine: Different agg_ sources name the channel dimension with different column aliases (interaction_type in CSAT aggs, channel_formatted in QA aggs, channel in agent summary / FCR aggs). Alias all to a single common name (channel) inside the spine UNION. In the outer LEFT JOIN predicates, use the source-native column name — e.g., s.channel = c.interaction_type, not s.channel = c.channel.

Naming: rpt_{client}_{purpose} — e.g., rpt_wbp_agent_performance, rpt_wbp_agent_performance_channel. Cross-business reports: rpt_master_{purpose} (e.g., rpt_master_performance), also always views.

4.6 Gold Exploration (exp_) — Pilot / Ad-hoc Views

Exploration views follow the same layering rules as rpt_ — always view, same GOLD schema, sourced from agg_ (or fct_ when individual record identity matters). The distinction is intent and SLA:

rpt_ exp_
Tableau dashboard Yes — formally adopted No — or not yet adopted
SLA / stability Yes — column contract is public None — may change or drop without notice
Tags ['report'] ['exp']
Naming rpt_{client}_{purpose} exp_{client}_{purpose}

Promotion path: When an exp_ view stabilizes, rename it to rpt_ (one-liner git mv + schema.yml update). Old Snowflake views persist after dbt rename — drop RPT_WBP_* views manually after verifying the EXP_WBP_* replacements.

See ADR 021 for the full decision record.


5. Channel Is a First-Class Dimension ⭐

This is a cross-cutting rule that applies to every fct_ and every agg_. It is important enough to call out separately.

When the underlying source exposes a channel (email, chat, voice, messaging, SMS, etc.), channel must be preserved as a dimension at both the fct_ and agg_ layers. Cross-channel aggregation belongs in rpt_ views, not in agg_ tables.

Why this matters

Contact-center metrics are almost always analyzed at channel level or lower (queue, brand, inbox). Collapsing channels at the aggregate layer makes the aggregate unusable for any channel-level analysis downstream and wrong to use as a building block for channel-grained views. A channel-collapsed aggregate is not reusable — it's a pre-computed projection that only serves one specific report.

The rule

  • Bronze / Silver: carry whatever channel field the source provides, as-is (after sanitization in Bronze; after type-cast in Silver).
  • fct_: if channel is present in the source, it must be a column on the fact.
  • agg_: channel (or its canonical equivalent: interaction_type, channel_formatted, etc.) is a required grain dimension. Do not store a channel-collapsed aggregate.
  • rpt_: if a report needs a cross-channel total, derive it inline via GROUP BY date, eid over the channel-grained agg_ — either as a CTE in the same view, or as a second view that projects the same underlying aggregate.

Exception

Only collapse channel in an agg_ when the user explicitly approves it for a specific model, and the reason is documented in the model header. The default is always channel-preserved.

Example — correct pattern

-- ✅ agg_wbp_csat_channel_daily — channel preserved
SELECT date, eid, interaction_type, SUM(stars) AS csat_stars_sum, COUNT(*) AS csat_survey_count
FROM {{ ref('fct_wbp_csat_survey') }}
GROUP BY date, eid, interaction_type;

-- ✅ rpt_wbp_agent_performance — cross-channel rollup inline
WITH csat_total AS (
    SELECT date, eid,
           SUM(csat_stars_sum) / NULLIF(SUM(csat_survey_count), 0) AS csat_stars,
           SUM(csat_survey_count) AS csat_survey_count
    FROM {{ ref('agg_wbp_csat_channel_daily') }}
    GROUP BY date, eid
)
SELECT ... FROM spine LEFT JOIN csat_total USING (date, eid);
-- ❌ agg_wbp_csat_daily — channel collapsed at agg layer (deleted 2026-04-20)
-- This was wrong: destroyed reusability and violated the channel-level minimum grain rule.

Exception — rate metrics over total active time

Rate metrics whose denominator is total active time (ACW %, EPH, MPH, CPH) cannot be broken down by channel without double-counting the denominator. Active time is a property of the agent, not the channel.

These metrics live at agent-day grain (date × eid) in a dedicated productivity view, parallel to the channel-grained performance view. See the semantic split in §5a below.


5a. Performance vs Productivity — Semantic Categories

Two semantic categories divide the rpt_ layer. The category encodes whether rows are safe to sum across channels:

Category Typical grain Channel-additive? Example metrics
performance date × eid × channel or date × channel ✅ Yes Contacts handled, CSAT, QA, SLA, handle time, ACW time
productivity date × eid ❌ No — never sum rates across channels ACW %, EPH, MPH, CPH, active_time_hours

Why two views, not one

The denominator of productivity rates (total active time from Gladly agent durations) is a single value per agent per day. It is not channel-specific. Replicating it across channel rows and summing would double-count.

Rather than rely on Tableau FIXED LODs to work around this, the architecture separates the two categories by grain:

  • rpt_{client}_agent_performance_channel[_daily] — channel-grained. Rows are channel-additive.
  • rpt_{client}_agent_productivity_daily — agent-day grained. One row per agent per day, rates pre-computed.
  • rpt_{client}_account_performance_channel_daily — account-level channel-grained (no agent dimension).

When to build which

Pick the view grain by answering "what's the natural grain of this metric's denominator":

  • If the denominator is a channel-specific count (e.g., contacts_fulfilled per channel) → performance (channel grain).
  • If the denominator is an agent-day total that does not split by channel (e.g., total_active_time_sec) → productivity (agent-day grain).
  • If the question has no agent dimension (account-level SLA, ASA) → performance (account grain, no agent).

Naming and grain suffix

Per ADR 019 §5a, all new rpt_ views carry a grain suffix as the last token (_daily, _weekly, _monthly, _quarterly). Period-over-period views use the short form (_wow, _mom, _qoq, _yoy). Grandfathered views without suffix continue to operate until renamed through a migration ticket.

Example — correct mapping

Metric Natural grain Goes in
total_handle_time_sec channel (sum of per-contact times) rpt_wbp_agent_performance_channel
contacts_fulfilled channel rpt_wbp_agent_performance_channel
csat_stars_sum, csat_survey_count channel rpt_wbp_agent_performance_channel
total_active_time_sec agent-day (from durations) rpt_wbp_agent_productivity_daily
acw_pct (durations formula) agent-day rpt_wbp_agent_productivity_daily
sms_messages_per_hour (MPH) agent-day (numerator is channel-filtered, denominator is not) rpt_wbp_agent_productivity_daily
sla_pct at account level channel (no agent) rpt_wbp_account_performance_channel_daily
asa_sec at account level channel (no agent) rpt_wbp_account_performance_channel_daily

Note on the agent-performance-channel extension

rpt_wbp_agent_performance_channel was extended in 2026-04 with durations-sourced ACW columns (*_agent_day suffix). These columns intentionally repeat across channel rows for the same agent-day. Tableau consumers must use MIN() / MAX() / FIXED LOD on these columns to avoid inflation. Going forward, prefer the productivity view for ACW % and other rate metrics — the agent-day columns on the channel view are a compatibility shim.

Rename migration — pass-through pattern

When a rpt_ view needs the _daily grain suffix (correct naming per §5a) but Tableau still points to the old name, use the pass-through pattern to maintain backward compatibility without duplicating SQL:

Phase 1 — Tableau not yet updated (current state for some views): 1. The old view (without suffix) retains the full SQL so Tableau continues to work. 2. The new _daily view is SELECT * FROM {{ ref('old_view') }} — it creates the correctly-named Snowflake object with zero SQL duplication.

Phase 2 — after Tableau is updated to point at _daily: 3. Move the full SQL into the _daily view (single source of truth going forward). 4. Change the old view to SELECT * FROM {{ ref('new_daily_view') }} to preserve any remaining connections. 5. Drop the old view once all consumers are confirmed migrated.

Never maintain two full copies of the SQL. At all times, exactly one view owns the logic.

Example: rpt_wbp_agent_performance_channel currently owns the SQL (Tableau source). rpt_wbp_agent_performance_channel_daily is SELECT * FROM {{ ref('rpt_wbp_agent_performance_channel') }} until the Tableau workbook is migrated.


6. ref() vs source() — Dependency Graph Rule

Always use ref() for models built by this dbt project. Only use source() for tables not built here.

Situation Use
Referencing a Silver model from Gold ref('model_name')
Referencing a Gold model from rpt_ ref('model_name')
Reading from a Bronze table source('client_bronze', 'TABLE_NAME')
Reading from a non-dbt upstream (BambooHR, legacy OPERATIONS) source('source_name', 'TABLE_NAME')

ref() creates a dependency edge in the dbt DAG. Astronomer Cosmos reads this graph to order tasks in Airflow — Silver tasks will run before Gold tasks that ref() them. source() has no dependency edge, so Gold models using source() on the same physical table a Silver model produces can race ahead of that Silver model.


7. Database Targeting

Database selection is driven by target.name (from profiles.yml or the --target flag). See .claude/rules/dbt.md for the full block-style selector pattern. Never hardcode database names. Never set database= in individual model SQL config blocks — it belongs in dbt_project.yml at the client-folder level only.

target.name Database used
dev WBP_DB_DEV, CND_DB_DEV, CORE_DB_DEV
prod WBP_DB_PROD, CND_DB_PROD, CORE_DB_PROD
anything else invalid_environment (compile fails loudly)

8. Materialization Decision

Default is table. Two exceptions: int_ is ephemeral, rpt_ is view.

dynamic_table is not a default. It is appropriate only when Snowflake should own the refresh schedule (continuous/streaming pipelines, sub-hourly lag requirements). For all daily Airflow-orchestrated models, use table. If you believe a model warrants dynamic_table, document the evaluation in the model header before using it.


9. Testing Expectations

Every model must have a schema.yml entry with:

  • Description with grain stated explicitly.
  • Primary key column: unique + not_null tests (error severity).
  • Foreign key columns: not_null tests. For source-category-1 facts, not_null on EID is warn severity (NULL means DQ gap, not a hard failure).
  • Enumerated columns: accepted_values tests.
  • Silver data cleaning: tests that check for known source garbage (e.g., no literal '∅' remaining) fail as errors.

Tests run in CI. A failing test blocks the PR. This is the mechanism that enforces the "Silver throws on non-conforming data" rule.


10. Project Layout

Domain-first structure (ADR 019):

apps/dbt/xo-medallion/
├── models/
│   ├── warbyparker/
│   │   ├── _warbyparker__sources.yml
│   │   ├── silver/
│   │   ├── intermediate/
│   │   └── gold/
│   ├── condenast/
│   │   ├── _condenast__sources.yml
│   │   ├── silver/
│   │   └── gold/
│   └── core/
│       ├── _core__sources.yml
│       ├── silver/
│       └── gold/                # dim_employee_*, rpt_master_*
├── macros/
├── seeds/
└── dbt_project.yml

Client codes (3-char, global uniqueness): wbp (Warby Parker), cnd (Condé Nast), core (shared). New clients get a 3-letter code at onboarding.


11. Quick Decision Reference

Question Answer
"Should this transformation go in Silver or Gold?" If it's type casting or data cleaning → Silver. If it's business logic, filtering, or enrichment → Gold.
"Should I drop channel in this aggregate?" No. Channel stays. Collapse in the rpt_ view.
"Do I need a new agg_ table for this metric?" Only if 2+ reports consume it, or the computation is expensive. Otherwise, CTE in the rpt_.
"Do I need a weekly version of my daily aggregate?" No. Derive it in the rpt_ via DATE_TRUNC('week', date) + GROUP BY.
"Can I read from fct_ in a rpt_?" It's a layering violation. Build the missing agg_ instead.
"Can I read from Silver in a Gold model that isn't fct_ or int_?" No. Only fct_ and int_ read from Silver. agg_ reads from fct_. rpt_ reads from agg_ + dim_.
"Should this rpt_ pre-compute averages (AHT, CSAT avg, etc.)?" Only if the view is account-grained (no agent dimension). Agent-grained views keep raw sums so Tableau can re-aggregate across agents without breaking team-level calculations.
"My rpt_ joins multiple agg_ sources — what's the pattern?" Build a spine CTE with UNION across all contributing aggs, then LEFT JOIN each agg back. This preserves rows present in only one source (e.g., weekend CSAT with no interactions). See §4.5.
"Should this be a table or a dynamic table?" Table. Dynamic tables require explicit evaluation.