Skip to content

Data Governance Fundamentals - From Basics to Implementation

Date: 2025-11-19 Audience: Data engineers learning governance Context: Setting up governance for Snowflake medallion architecture


What is Data Governance? (ELI5)

Simple definition: Rules about who can see, change, or use your data.

Think of it like a library: - Librarian (admin): Can add/remove books, manage everything - Staff (developer): Can organize books, update catalog - Members (analysts): Can read books, can't change them - Guest (viewer): Can only look at the catalog, not read books

In data terms: - Who can see data? (Read access) - Who can change data? (Write access) - Who can create new databases/tables? (Admin access) - Who can delete data? (Destructive access)


Why Governance Matters (Real-World Problems)

Without Governance:

Scenario 1: Accidental Deletion

Developer testing in "dev" → accidentally connected to prod →
runs DELETE FROM customers → deletes production data
Cost: Hours of downtime, angry customers, data recovery headache

Scenario 2: Data Leak

Junior analyst has access to all databases →
accidentally shares client data in Slack → compliance violation
Cost: GDPR fine ($50K+), client trust lost

Scenario 3: Messy Permissions

Everyone is admin → everyone can do everything →
no audit trail → can't trace who deleted the table
Cost: Impossible to debug, security nightmare

With Governance:

Scenario 1 Prevention:

Developer has dev-only access → can't connect to prod →
physically impossible to delete prod data

Scenario 2 Prevention:

Analyst only sees aggregate data → never sees raw PII →
can't leak what they can't access

Scenario 3 Prevention:

Clear roles: admin, developer, analyst →
audit log shows "John (analyst) ran SELECT on GOLD.METRICS"


Core Concepts (The Building Blocks)

1. Authentication vs Authorization

Authentication = "Who are you?" - Login with username/password - SSO (Single Sign-On) - API keys

Authorization = "What can you do?" - Roles (admin, developer, analyst) - Permissions (read, write, delete) - Access control

Example:

Authentication: "Hi, I'm John (password: ****)"
Snowflake: "Confirmed, you're John"

Authorization: "Can John read PROD_DB?"
Snowflake: "John has ANALYST role → ANALYST can read GOLD → Yes"


2. Principle of Least Privilege

Definition: Give people the minimum access needed to do their job.

Bad Example (too much access):

Analyst needs to read reports →
Given ACCOUNTADMIN role →
Can delete databases, change permissions, see all data

Good Example (least privilege):

Analyst needs to read reports →
Given ANALYST role →
Can only SELECT from GOLD layer tables
Can't see BRONZE (raw PII), can't DELETE, can't create databases

Why it matters: - Limits damage from accidents - Limits damage from malicious actors - Limits data exposure (compliance)


3. Role-Based Access Control (RBAC)

Definition: Group permissions into "roles", assign roles to people.

Without RBAC (permission chaos):

John: Can read TABLE_A, TABLE_B, TABLE_C, write TABLE_D...
Sarah: Can read TABLE_A, TABLE_C, write TABLE_E...
Mike: Can read TABLE_B, TABLE_D...

(Nightmare to manage 100 tables × 50 people)

With RBAC (organized):

ANALYST role: Can read all GOLD tables
DEVELOPER role: Can read GOLD, write BRONZE/SILVER
ADMIN role: Can do everything

John → ANALYST
Sarah → DEVELOPER
Mike → ANALYST

(Easy to manage: 3 roles cover everyone)


4. Data Layers and Access Patterns

Different data = different sensitivity = different access

Layer Data Type Sensitivity Who Needs Access
BRONZE Raw from APIs, all VARCHAR, has PII HIGH Only ETL pipelines + admins
SILVER Cleaned, typed, still has PII MEDIUM-HIGH Developers + data engineers
GOLD Aggregated metrics, no PII LOW Everyone (analysts, BI tools)

Example - Customer Support Data:

BRONZE (HIGH sensitivity):

-- Raw customer data
SELECT
    customer_email,        -- PII!
    customer_phone,        -- PII!
    conversation_text      -- May contain sensitive info
FROM CND_DB.BRONZE.GLADLY_CONVERSATIONS
Who should see: Only data engineers building pipelines

SILVER (MEDIUM sensitivity):

-- Cleaned data, still has identifiable info
SELECT
    customer_id,           -- Pseudonymized
    agent_id,
    conversation_id,
    handle_time_seconds
FROM CND_DB.SILVER.WH_CONVERSATIONS
Who should see: Data engineers + senior analysts

GOLD (LOW sensitivity):

-- Aggregated metrics, no PII
SELECT
    date,
    agent_name,            -- OK, internal employee
    total_conversations,
    avg_handle_time
FROM CND_DB.GOLD.DM_METRICS_DAILY
Who should see: Everyone (analysts, managers, BI tools)


Snowflake RBAC Hierarchy (How It Actually Works)

Snowflake Objects Hierarchy

ACCOUNT
├── USER (person or service)
├── ROLE (collection of privileges)
├── DATABASE
│   ├── SCHEMA
│   │   ├── TABLE
│   │   ├── VIEW
│   │   └── STORED PROCEDURE
│   └── SCHEMA
└── WAREHOUSE (compute)

How Permissions Flow

Rule: To access a table, you need permission on every level:

Access TABLE requires:
✅ USAGE on ACCOUNT
✅ USAGE on DATABASE
✅ USAGE on SCHEMA
✅ SELECT on TABLE

Example:

-- User wants to: SELECT * FROM CND_DB.GOLD.DM_SCORECARD

-- Snowflake checks:
1. Does user have a ROLE?  Check user's roles
2. Does ROLE have USAGE on CND_DB?  Check grants
3. Does ROLE have USAGE on GOLD schema?  Check grants
4. Does ROLE have SELECT on DM_SCORECARD?  Check grants

-- If ANY step fails → Permission denied


Standard Snowflake Roles (Built-in)

Snowflake comes with default roles:

Role Power Level Use Case
ACCOUNTADMIN God mode Billing, account settings, create roles
SECURITYADMIN Security Manage users, roles, grants
SYSADMIN Infrastructure Create databases, warehouses
PUBLIC Minimal Default for everyone

Best Practice: - ❌ Don't give ACCOUNTADMIN to developers - ❌ Don't use SYSADMIN for daily work - ✅ Create custom roles for your needs


Custom Role Strategy (What You Should Build)

Role Hierarchy Pattern

ACCOUNTADMIN (built-in, emergency only)
    ├─ SECURITYADMIN (built-in)
    │      │
    │      └─ DATA_GOVERNANCE_ADMIN (custom)
    └─ SYSADMIN (built-in)
           ├─ DATA_ENGINEER (custom)
           │      │
           │      ├─ DATA_ENGINEER_DEV (custom)
           │      └─ DATA_ENGINEER_PROD (custom)
           ├─ DATA_ANALYST (custom)
           │      │
           │      └─ DATA_ANALYST_VIEWER (custom)
           └─ SERVICE_ACCOUNT (custom)
                  ├─ AIRFLOW_LOADER (custom)
                  └─ DBT_TRANSFORMER (custom)

Hierarchy means: Parent roles inherit child permissions


Typical Role Definitions

1. DATA_ENGINEER

Purpose: Build and maintain data pipelines

Permissions: - ✅ USAGE on all databases (DEV and PROD) - ✅ CREATE TABLE in BRONZE schemas - ✅ SELECT, INSERT, UPDATE, DELETE on BRONZE tables - ✅ SELECT on SILVER/GOLD (to validate transformations) - ❌ DELETE databases - ❌ DROP tables in PROD (only in DEV)

Use case: Daily pipeline development and debugging


2. DATA_ANALYST

Purpose: Query data for insights, create reports

Permissions: - ✅ USAGE on databases - ✅ SELECT on GOLD schemas (aggregated, no PII) - ✅ Maybe SELECT on SILVER (if needed for advanced analysis) - ❌ No access to BRONZE (raw PII) - ❌ No INSERT/UPDATE/DELETE anywhere - ❌ No CREATE TABLE

Use case: Business analysts building dashboards in Tableau/Power BI


3. SERVICE_ACCOUNT (e.g., AIRFLOW_LOADER)

Purpose: Automated pipeline loading data

Permissions: - ✅ USAGE on specific databases only (e.g., *_DEV or *_PROD) - ✅ INSERT into BRONZE tables - ✅ SELECT on BRONZE (to check data loaded) - ❌ No UPDATE or DELETE (pipelines should append, not modify) - ❌ No access to SILVER/GOLD (dbt handles that)

Use case: Airflow DAG loading raw data from APIs to BRONZE


4. DBT_TRANSFORMER

Purpose: dbt transforming BRONZE → SILVER → GOLD

Permissions: - ✅ USAGE on databases - ✅ SELECT on BRONZE (read source data) - ✅ CREATE TABLE, VIEW in SILVER/GOLD schemas - ✅ INSERT, UPDATE, DELETE in SILVER/GOLD (build models) - ❌ No write access to BRONZE (read-only)

Use case: dbt running transformations


Access Patterns by Layer

BRONZE Layer (Restricted)

Who can access: - ✅ DATA_ENGINEER (read/write for pipeline dev) - ✅ AIRFLOW_LOADER (write only, load data) - ✅ DBT_TRANSFORMER (read only, source for transformations) - ❌ DATA_ANALYST (no access, has PII)

Why restricted: Raw PII, sensitive data, compliance


SILVER Layer (Developer Access)

Who can access: - ✅ DATA_ENGINEER (read/write for development) - ✅ DBT_TRANSFORMER (read/write for transformations) - ✅ Senior DATA_ANALYST (read only, advanced analysis) - ❌ Junior analysts (too sensitive)

Why semi-restricted: Still has some identifiable data, but cleaned


GOLD Layer (Open Access)

Who can access: - ✅ DATA_ENGINEER (read/write) - ✅ DATA_ANALYST (read only) - ✅ BI Tools (read only) - ✅ Executives (read only via BI tools)

Why open: Aggregated metrics, no PII, safe for broad access


Real-World Example: Your Use Case

Your Scenario

Company: BPO analytics for customer support Clients: CondeNast, Warby Parker, etc. Data: Customer support conversations, agent metrics Sensitive data: Customer emails, phone numbers, conversation text

Your Role Requirements

1. You (Data Engineer)

Need: - Build pipelines (Airflow) - Develop dbt models - Debug issues - Test in DEV, deploy to PROD

Role: DATA_ENGINEER

Access: - ✅ Full access to *_DEV databases (BRONZE/SILVER/GOLD) - ✅ Read access to *_PROD databases (to debug) - ✅ Write access to *_PROD BRONZE (if emergency pipeline fixes) - ⚠️ Ideally no direct write to PROD (use CI/CD)


2. Client Stakeholders (e.g., CondeNast Manager)

Need: - See their client's metrics only - Dashboard with daily/weekly reports - No access to other clients' data

Role: ANALYST_CONDENAST

Access: - ✅ SELECT on CND_DB.GOLD.* (CondeNast metrics only) - ❌ No access to WBP_DB (Warby Parker data) - ❌ No access to BRONZE/SILVER (raw data)


3. Your CEO/Leadership

Need: - See cross-client aggregate metrics - High-level dashboards - No client-specific details (privacy)

Role: ANALYST_EXECUTIVE

Access: - ✅ SELECT on CORE_DB.GOLD.* (cross-client aggregates) - ✅ SELECT on _DB.GOLD. (high-level metrics per client) - ❌ No BRONZE/SILVER


4. Airflow Service Account

Need: - Load raw data from APIs to BRONZE - Run on schedule - No human interaction

Role: AIRFLOW_LOADER_DEV and AIRFLOW_LOADER_PROD

Access: - ✅ INSERT into _DEV.BRONZE. (dev pipelines) - ✅ INSERT into _PROD.BRONZE. (prod pipelines) - ❌ No UPDATE or DELETE (append-only) - ❌ No SILVER/GOLD access


5. dbt Service Account

Need: - Read BRONZE - Transform to SILVER/GOLD - Create/replace tables

Role: DBT_TRANSFORMER_DEV and DBT_TRANSFORMER_PROD

Access: - ✅ SELECT on _DEV.BRONZE. - ✅ CREATE TABLE, INSERT, UPDATE on _DEV.SILVER. and _DEV.GOLD. - ✅ SELECT on _PROD.BRONZE. - ✅ CREATE TABLE, INSERT, UPDATE on _PROD.SILVER. and _PROD.GOLD.


Client Data Isolation (Multi-Tenancy)

Your Challenge: Multiple Clients in Same Account

Problem: CondeNast shouldn't see Warby Parker data

Solutions:

Option 1: Database-Level Isolation (What You're Doing) ✅

CND_DB    → Only CondeNast data
WBP_DB    → Only Warby Parker data
CORE_DB   → Shared/aggregate only

Grant strategy:

-- CondeNast analyst can only see their DB
GRANT USAGE ON DATABASE CND_DB TO ROLE ANALYST_CONDENAST;
REVOKE USAGE ON DATABASE WBP_DB FROM ROLE ANALYST_CONDENAST;

-- Warby Parker analyst can only see their DB
GRANT USAGE ON DATABASE WBP_DB TO ROLE ANALYST_WARBYPARKER;
REVOKE USAGE ON DATABASE CND_DB FROM ROLE ANALYST_WARBYPARKER;

This is the best approach for your architecture


Option 2: Row-Level Security (Alternative, More Complex)

If you had single database:

-- Enable row-level security
CREATE ROW ACCESS POLICY client_isolation AS (client_code VARCHAR) RETURNS BOOLEAN ->
    CASE
        WHEN CURRENT_ROLE() = 'ANALYST_CONDENAST' THEN client_code = 'CND'
        WHEN CURRENT_ROLE() = 'ANALYST_WARBYPARKER' THEN client_code = 'WBP'
        ELSE FALSE
    END;

ALTER TABLE conversations ADD ROW ACCESS POLICY client_isolation ON (client_code);

Not recommended for you: Database isolation is simpler and more secure


Key Governance Principles (Remember These)

1. Separation of Duties

Different people/systems have different access: - Developers can build, but can't approve - Airflow can load, but can't transform - dbt can transform, but can't load raw data - Analysts can read, but can't write

Why: Prevents single point of failure, limits blast radius


2. Audit Trail

Every action is logged: - Who accessed what data? - When did they access it? - What query did they run?

Snowflake provides:

-- See query history
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE USER_NAME = 'john@company.com'
ORDER BY START_TIME DESC;

-- See access grants
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.GRANTS_TO_ROLES
WHERE GRANTEE_NAME = 'DATA_ANALYST';


3. Environment Separation

DEV and PROD have different permissions:

DEV (loose): - Developers can CREATE/DROP tables - Experiments are safe - Sample data only

PROD (strict): - Developers can only SELECT (read) - Only CI/CD can write - Real data, must be protected


4. Default Deny

Principle: Start with NO access, explicitly grant what's needed

Bad:

-- Give everyone access by default
GRANT ALL ON DATABASE CND_DB TO ROLE PUBLIC;
-- Then try to revoke specific things (error-prone)

Good:

-- Start with nothing
-- Explicitly grant only what's needed
GRANT USAGE ON DATABASE CND_DB TO ROLE DATA_ENGINEER;
GRANT SELECT ON SCHEMA CND_DB.GOLD TO ROLE DATA_ANALYST;


Next Steps: Applying to Your Project

Now that you understand the fundamentals, we'll design:

  1. Role hierarchy for your team
  2. Permission matrix (who can do what)
  3. Snowflake SQL to create roles and grants
  4. Migration files to deploy governance
  5. CI/CD integration for automated deployment

Ready to design your specific governance model?