Skip to content

ADR 015 — SANDBOX Schema for Ad-Hoc and Supplementary Data

Status

Accepted

Context

Some datasets are complementary sources for analyst use — they do not have dbt consumers, do not need the full BRONZE → SILVER → GOLD pipeline overhead, and have no production SLA. The immediate trigger was onboarding WBP QA evaluation data (issue #568): a manually-loaded Google Sheet used for analyst evaluation that supplements, but does not replace, the existing GSHEETS_QA Bronze table.

Loading these datasets directly into BRONZE creates ambiguity:

  • BRONZE implies a production pipeline owns the table (xo-foundry + Airflow).
  • BRONZE tables have metadata columns (RECORD_KEY, RECORD_HASH, BATCH_ID, etc.) populated by the loader. Manual analyst loads would lack these, breaking downstream assumptions.
  • Mixing governed pipeline tables with ad-hoc analyst tables makes Bronze schema inspection unreliable as a proxy for "what pipelines exist."

Without a formal schema, analysts resort to loading into arbitrary locations (OPERATIONS, one-off schemas, etc.), losing traceability.

Decision

Create a SANDBOX schema in every client database (WBP_DB, CND_DB, CORE_DB and their _DEV equivalents). SANDBOX is provisioned by schemachange alongside BRONZE/SILVER/GOLD, making it available to analysts from day one of client onboarding.

Characteristics:

  • Owner: Analyst (not a pipeline). Tables are manually created and loaded.
  • Naming convention: {SOURCE}_{OBJECT}_{ISSUE_NUMBER} — e.g., GSHEET_XO_EXTERNAL_QA_568. The issue number is a metadata suffix, not a topic prefix. It provides a direct pointer to the context behind the table and ensures names stay unique across loads.
  • No pipeline automation: No Airflow DAGs, no xo-foundry tasks own SANDBOX tables.
  • No dbt models: SANDBOX tables are not referenced by any dbt model. If analyst convenience views are needed, they are written as Snowflake R__ repeatable scripts via schemachange — not dbt models.
  • No SLA: SANDBOX is best-effort. Tables may be stale, incomplete, or temporary.
  • No metadata columns required: RECORD_KEY, RECORD_HASH, etc. are optional; their absence distinguishes SANDBOX from BRONZE.

Promotion path: If a SANDBOX table proves permanent and needs pipeline ownership:

  1. Create a proper Bronze table in BRONZE schema (rename without issue suffix, add metadata columns).
  2. Build the full xo-foundry pipeline and dbt Silver model.
  3. Retain the SANDBOX copy briefly for traceability, then drop it.

Consequences

Gets easier: - Analysts have a safe, tracked location for supplementary loads without polluting BRONZE. - Every SANDBOX table name points to a GitHub issue with full context. - New client onboarding automatically gets SANDBOX via the V1.0.0 template migration. - Clear promotion path from ad-hoc to production pipeline.

Gets harder / trade-offs: - Analysts must follow the {SOURCE}_{OBJECT}_{ISSUE_NUMBER} naming convention consistently. - Governance is lower by design — no automated freshness checks or data quality tests. - SANDBOX tables are not visible in the dbt model registry or lineage graph.

Options Considered

WORKSPACE schema — Rejected. "Workspace" implies a persistent, curated space; it would attract permanent objects that should stay in BRONZE or GOLD. The name doesn't signal lower governance.

BRONZE_EXT schema — Rejected. Not an industry-standard term. Would be confused with BRONZE by new team members. The "EXT" suffix is ambiguous (external? extended?).

Flat dump into OPERATIONS — Rejected. OPERATIONS is a legacy multi-client database being deprecated. Routing new analyst work there reinforces technical debt and loses per-client isolation.

No schema, issue-specific schemas (e.g., ISSUE_568) — Rejected. One schema per issue would create schema sprawl and make permissions unmanageable. A single SANDBOX schema is simpler.