Data Governance Fundamentals - From Basics to Implementation¶
Date: 2025-11-19 Audience: Data engineers learning governance Context: Setting up governance for Snowflake medallion architecture
What is Data Governance? (ELI5)¶
Simple definition: Rules about who can see, change, or use your data.
Think of it like a library: - Librarian (admin): Can add/remove books, manage everything - Staff (developer): Can organize books, update catalog - Members (analysts): Can read books, can't change them - Guest (viewer): Can only look at the catalog, not read books
In data terms: - Who can see data? (Read access) - Who can change data? (Write access) - Who can create new databases/tables? (Admin access) - Who can delete data? (Destructive access)
Why Governance Matters (Real-World Problems)¶
Without Governance:¶
Scenario 1: Accidental Deletion
Developer testing in "dev" → accidentally connected to prod →
runs DELETE FROM customers → deletes production data
Scenario 2: Data Leak
Junior analyst has access to all databases →
accidentally shares client data in Slack → compliance violation
Scenario 3: Messy Permissions
Cost: Impossible to debug, security nightmareWith Governance:¶
Scenario 1 Prevention:
Scenario 2 Prevention:
Scenario 3 Prevention:
Clear roles: admin, developer, analyst →
audit log shows "John (analyst) ran SELECT on GOLD.METRICS"
Core Concepts (The Building Blocks)¶
1. Authentication vs Authorization¶
Authentication = "Who are you?" - Login with username/password - SSO (Single Sign-On) - API keys
Authorization = "What can you do?" - Roles (admin, developer, analyst) - Permissions (read, write, delete) - Access control
Example:
Authentication: "Hi, I'm John (password: ****)"
Snowflake: "Confirmed, you're John"
Authorization: "Can John read PROD_DB?"
Snowflake: "John has ANALYST role → ANALYST can read GOLD → Yes"
2. Principle of Least Privilege¶
Definition: Give people the minimum access needed to do their job.
Bad Example (too much access):
Analyst needs to read reports →
Given ACCOUNTADMIN role →
Can delete databases, change permissions, see all data
Good Example (least privilege):
Analyst needs to read reports →
Given ANALYST role →
Can only SELECT from GOLD layer tables
Can't see BRONZE (raw PII), can't DELETE, can't create databases
Why it matters: - Limits damage from accidents - Limits damage from malicious actors - Limits data exposure (compliance)
3. Role-Based Access Control (RBAC)¶
Definition: Group permissions into "roles", assign roles to people.
Without RBAC (permission chaos):
John: Can read TABLE_A, TABLE_B, TABLE_C, write TABLE_D...
Sarah: Can read TABLE_A, TABLE_C, write TABLE_E...
Mike: Can read TABLE_B, TABLE_D...
(Nightmare to manage 100 tables × 50 people)
With RBAC (organized):
ANALYST role: Can read all GOLD tables
DEVELOPER role: Can read GOLD, write BRONZE/SILVER
ADMIN role: Can do everything
John → ANALYST
Sarah → DEVELOPER
Mike → ANALYST
(Easy to manage: 3 roles cover everyone)
4. Data Layers and Access Patterns¶
Different data = different sensitivity = different access
| Layer | Data Type | Sensitivity | Who Needs Access |
|---|---|---|---|
| BRONZE | Raw from APIs, all VARCHAR, has PII | HIGH | Only ETL pipelines + admins |
| SILVER | Cleaned, typed, still has PII | MEDIUM-HIGH | Developers + data engineers |
| GOLD | Aggregated metrics, no PII | LOW | Everyone (analysts, BI tools) |
Example - Customer Support Data:
BRONZE (HIGH sensitivity):
-- Raw customer data
SELECT
customer_email, -- PII!
customer_phone, -- PII!
conversation_text -- May contain sensitive info
FROM CND_DB.BRONZE.GLADLY_CONVERSATIONS
SILVER (MEDIUM sensitivity):
-- Cleaned data, still has identifiable info
SELECT
customer_id, -- Pseudonymized
agent_id,
conversation_id,
handle_time_seconds
FROM CND_DB.SILVER.WH_CONVERSATIONS
GOLD (LOW sensitivity):
-- Aggregated metrics, no PII
SELECT
date,
agent_name, -- OK, internal employee
total_conversations,
avg_handle_time
FROM CND_DB.GOLD.DM_METRICS_DAILY
Snowflake RBAC Hierarchy (How It Actually Works)¶
Snowflake Objects Hierarchy¶
ACCOUNT
├── USER (person or service)
├── ROLE (collection of privileges)
├── DATABASE
│ ├── SCHEMA
│ │ ├── TABLE
│ │ ├── VIEW
│ │ └── STORED PROCEDURE
│ └── SCHEMA
└── WAREHOUSE (compute)
How Permissions Flow¶
Rule: To access a table, you need permission on every level:
Example:
-- User wants to: SELECT * FROM CND_DB.GOLD.DM_SCORECARD
-- Snowflake checks:
1. Does user have a ROLE? → Check user's roles
2. Does ROLE have USAGE on CND_DB? → Check grants
3. Does ROLE have USAGE on GOLD schema? → Check grants
4. Does ROLE have SELECT on DM_SCORECARD? → Check grants
-- If ANY step fails → Permission denied
Standard Snowflake Roles (Built-in)¶
Snowflake comes with default roles:
| Role | Power Level | Use Case |
|---|---|---|
| ACCOUNTADMIN | God mode | Billing, account settings, create roles |
| SECURITYADMIN | Security | Manage users, roles, grants |
| SYSADMIN | Infrastructure | Create databases, warehouses |
| PUBLIC | Minimal | Default for everyone |
Best Practice: - ❌ Don't give ACCOUNTADMIN to developers - ❌ Don't use SYSADMIN for daily work - ✅ Create custom roles for your needs
Custom Role Strategy (What You Should Build)¶
Role Hierarchy Pattern¶
ACCOUNTADMIN (built-in, emergency only)
│
├─ SECURITYADMIN (built-in)
│ │
│ └─ DATA_GOVERNANCE_ADMIN (custom)
│
└─ SYSADMIN (built-in)
│
├─ DATA_ENGINEER (custom)
│ │
│ ├─ DATA_ENGINEER_DEV (custom)
│ └─ DATA_ENGINEER_PROD (custom)
│
├─ DATA_ANALYST (custom)
│ │
│ └─ DATA_ANALYST_VIEWER (custom)
│
└─ SERVICE_ACCOUNT (custom)
│
├─ AIRFLOW_LOADER (custom)
└─ DBT_TRANSFORMER (custom)
Hierarchy means: Parent roles inherit child permissions
Typical Role Definitions¶
1. DATA_ENGINEER¶
Purpose: Build and maintain data pipelines
Permissions: - ✅ USAGE on all databases (DEV and PROD) - ✅ CREATE TABLE in BRONZE schemas - ✅ SELECT, INSERT, UPDATE, DELETE on BRONZE tables - ✅ SELECT on SILVER/GOLD (to validate transformations) - ❌ DELETE databases - ❌ DROP tables in PROD (only in DEV)
Use case: Daily pipeline development and debugging
2. DATA_ANALYST¶
Purpose: Query data for insights, create reports
Permissions: - ✅ USAGE on databases - ✅ SELECT on GOLD schemas (aggregated, no PII) - ✅ Maybe SELECT on SILVER (if needed for advanced analysis) - ❌ No access to BRONZE (raw PII) - ❌ No INSERT/UPDATE/DELETE anywhere - ❌ No CREATE TABLE
Use case: Business analysts building dashboards in Tableau/Power BI
3. SERVICE_ACCOUNT (e.g., AIRFLOW_LOADER)¶
Purpose: Automated pipeline loading data
Permissions: - ✅ USAGE on specific databases only (e.g., *_DEV or *_PROD) - ✅ INSERT into BRONZE tables - ✅ SELECT on BRONZE (to check data loaded) - ❌ No UPDATE or DELETE (pipelines should append, not modify) - ❌ No access to SILVER/GOLD (dbt handles that)
Use case: Airflow DAG loading raw data from APIs to BRONZE
4. DBT_TRANSFORMER¶
Purpose: dbt transforming BRONZE → SILVER → GOLD
Permissions: - ✅ USAGE on databases - ✅ SELECT on BRONZE (read source data) - ✅ CREATE TABLE, VIEW in SILVER/GOLD schemas - ✅ INSERT, UPDATE, DELETE in SILVER/GOLD (build models) - ❌ No write access to BRONZE (read-only)
Use case: dbt running transformations
Access Patterns by Layer¶
BRONZE Layer (Restricted)¶
Who can access: - ✅ DATA_ENGINEER (read/write for pipeline dev) - ✅ AIRFLOW_LOADER (write only, load data) - ✅ DBT_TRANSFORMER (read only, source for transformations) - ❌ DATA_ANALYST (no access, has PII)
Why restricted: Raw PII, sensitive data, compliance
SILVER Layer (Developer Access)¶
Who can access: - ✅ DATA_ENGINEER (read/write for development) - ✅ DBT_TRANSFORMER (read/write for transformations) - ✅ Senior DATA_ANALYST (read only, advanced analysis) - ❌ Junior analysts (too sensitive)
Why semi-restricted: Still has some identifiable data, but cleaned
GOLD Layer (Open Access)¶
Who can access: - ✅ DATA_ENGINEER (read/write) - ✅ DATA_ANALYST (read only) - ✅ BI Tools (read only) - ✅ Executives (read only via BI tools)
Why open: Aggregated metrics, no PII, safe for broad access
Real-World Example: Your Use Case¶
Your Scenario¶
Company: BPO analytics for customer support Clients: CondeNast, Warby Parker, etc. Data: Customer support conversations, agent metrics Sensitive data: Customer emails, phone numbers, conversation text
Your Role Requirements¶
1. You (Data Engineer)¶
Need: - Build pipelines (Airflow) - Develop dbt models - Debug issues - Test in DEV, deploy to PROD
Role: DATA_ENGINEER
Access: - ✅ Full access to *_DEV databases (BRONZE/SILVER/GOLD) - ✅ Read access to *_PROD databases (to debug) - ✅ Write access to *_PROD BRONZE (if emergency pipeline fixes) - ⚠️ Ideally no direct write to PROD (use CI/CD)
2. Client Stakeholders (e.g., CondeNast Manager)¶
Need: - See their client's metrics only - Dashboard with daily/weekly reports - No access to other clients' data
Role: ANALYST_CONDENAST
Access: - ✅ SELECT on CND_DB.GOLD.* (CondeNast metrics only) - ❌ No access to WBP_DB (Warby Parker data) - ❌ No access to BRONZE/SILVER (raw data)
3. Your CEO/Leadership¶
Need: - See cross-client aggregate metrics - High-level dashboards - No client-specific details (privacy)
Role: ANALYST_EXECUTIVE
Access: - ✅ SELECT on CORE_DB.GOLD.* (cross-client aggregates) - ✅ SELECT on _DB.GOLD. (high-level metrics per client) - ❌ No BRONZE/SILVER
4. Airflow Service Account¶
Need: - Load raw data from APIs to BRONZE - Run on schedule - No human interaction
Role: AIRFLOW_LOADER_DEV and AIRFLOW_LOADER_PROD
Access: - ✅ INSERT into _DEV.BRONZE. (dev pipelines) - ✅ INSERT into _PROD.BRONZE. (prod pipelines) - ❌ No UPDATE or DELETE (append-only) - ❌ No SILVER/GOLD access
5. dbt Service Account¶
Need: - Read BRONZE - Transform to SILVER/GOLD - Create/replace tables
Role: DBT_TRANSFORMER_DEV and DBT_TRANSFORMER_PROD
Access: - ✅ SELECT on _DEV.BRONZE. - ✅ CREATE TABLE, INSERT, UPDATE on _DEV.SILVER. and _DEV.GOLD. - ✅ SELECT on _PROD.BRONZE. - ✅ CREATE TABLE, INSERT, UPDATE on _PROD.SILVER. and _PROD.GOLD.
Client Data Isolation (Multi-Tenancy)¶
Your Challenge: Multiple Clients in Same Account¶
Problem: CondeNast shouldn't see Warby Parker data
Solutions:
Option 1: Database-Level Isolation (What You're Doing) ✅¶
Grant strategy:
-- CondeNast analyst can only see their DB
GRANT USAGE ON DATABASE CND_DB TO ROLE ANALYST_CONDENAST;
REVOKE USAGE ON DATABASE WBP_DB FROM ROLE ANALYST_CONDENAST;
-- Warby Parker analyst can only see their DB
GRANT USAGE ON DATABASE WBP_DB TO ROLE ANALYST_WARBYPARKER;
REVOKE USAGE ON DATABASE CND_DB FROM ROLE ANALYST_WARBYPARKER;
This is the best approach for your architecture ✅
Option 2: Row-Level Security (Alternative, More Complex)¶
If you had single database:
-- Enable row-level security
CREATE ROW ACCESS POLICY client_isolation AS (client_code VARCHAR) RETURNS BOOLEAN ->
CASE
WHEN CURRENT_ROLE() = 'ANALYST_CONDENAST' THEN client_code = 'CND'
WHEN CURRENT_ROLE() = 'ANALYST_WARBYPARKER' THEN client_code = 'WBP'
ELSE FALSE
END;
ALTER TABLE conversations ADD ROW ACCESS POLICY client_isolation ON (client_code);
Not recommended for you: Database isolation is simpler and more secure
Key Governance Principles (Remember These)¶
1. Separation of Duties¶
Different people/systems have different access: - Developers can build, but can't approve - Airflow can load, but can't transform - dbt can transform, but can't load raw data - Analysts can read, but can't write
Why: Prevents single point of failure, limits blast radius
2. Audit Trail¶
Every action is logged: - Who accessed what data? - When did they access it? - What query did they run?
Snowflake provides:
-- See query history
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE USER_NAME = 'john@company.com'
ORDER BY START_TIME DESC;
-- See access grants
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.GRANTS_TO_ROLES
WHERE GRANTEE_NAME = 'DATA_ANALYST';
3. Environment Separation¶
DEV and PROD have different permissions:
DEV (loose): - Developers can CREATE/DROP tables - Experiments are safe - Sample data only
PROD (strict): - Developers can only SELECT (read) - Only CI/CD can write - Real data, must be protected
4. Default Deny¶
Principle: Start with NO access, explicitly grant what's needed
Bad:
-- Give everyone access by default
GRANT ALL ON DATABASE CND_DB TO ROLE PUBLIC;
-- Then try to revoke specific things (error-prone)
Good:
-- Start with nothing
-- Explicitly grant only what's needed
GRANT USAGE ON DATABASE CND_DB TO ROLE DATA_ENGINEER;
GRANT SELECT ON SCHEMA CND_DB.GOLD TO ROLE DATA_ANALYST;
Next Steps: Applying to Your Project¶
Now that you understand the fundamentals, we'll design:
- Role hierarchy for your team
- Permission matrix (who can do what)
- Snowflake SQL to create roles and grants
- Migration files to deploy governance
- CI/CD integration for automated deployment
Ready to design your specific governance model?