Extractor Testing Workflow¶
Overview¶
The xo-foundry test-extractor CLI command allows you to test extractors locally without needing to spin up Airflow. This significantly speeds up development and debugging.
What It Does¶
The test-extractor tool:
- Extracts data from the source API using your configuration
- Uploads raw data to S3
{bucket}/raw/{domain}/{report}/{date}/ - Standardizes column names for Snowflake compatibility
- Uploads clean data to S3
{bucket}/clean/{domain}/{report}/{date}/ - Provides detailed logging of the entire process
Prerequisites¶
1. AWS Credentials¶
Configure AWS credentials for S3 access:
# Option 1: AWS credentials file (~/.aws/credentials)
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
# Option 2: Environment variables
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1
2. API Credentials¶
Set environment variables for the API you're testing:
# For Sprout Social
export CONDENAST_SPROUT_API_KEY=your_api_key
export CONDENAST_SPROUT_CLIENT_ID=your_client_id
# For Gladly
export GLADLY_BASE_URL=https://warbyparker.gladly.com
export GLADLY_EMAIL=your_email@example.com
export GLADLY_TOKEN=your_token
Best Practice: Use a .env file:
# .env
CONDENAST_SPROUT_API_KEY=abc123
CONDENAST_SPROUT_CLIENT_ID=2105997
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=secret...
Then load it:
3. S3 Bucket¶
Ensure your S3 bucket exists:
Usage¶
Basic Usage¶
Test a Sprout messages extractor:
uv run xo-foundry test-extractor \
--config packages/xo-foundry/configs/test-sprout-messages.yaml \
--date 2025-12-01 \
--bucket dev-bucket
All Options¶
uv run xo-foundry test-extractor \
--config packages/xo-foundry/configs/test-sprout-messages.yaml \ # YAML config
--date 2025-12-01 \ # Date to extract
--source sprout_messages \ # Source name (optional)
--bucket dev-bucket \ # S3 bucket
--aws-profile dev \ # AWS profile (optional)
--dry-run \ # Skip S3 upload (optional)
--output-dir ./test-output # Local save (optional)
Command Options¶
| Option | Required | Default | Description |
|---|---|---|---|
--config |
✅ | - | Path to YAML configuration file |
--date |
✅ | - | Date to extract (YYYY-MM-DD) |
--source |
❌ | First source | Source name from config |
--bucket |
❌ | dev-bucket |
S3 bucket name |
--aws-profile |
❌ | default | AWS profile to use |
--dry-run |
❌ | false | Skip S3 upload (local testing) |
--output-dir |
❌ | - | Save files locally |
Example Workflows¶
1. Quick Test (Dry Run)¶
Test extraction without uploading to S3:
uv run xo-foundry test-extractor \
--config packages/xo-foundry/configs/test-sprout-messages.yaml \
--date 2025-12-01 \
--dry-run \
--output-dir ./test-output
Output:
================================================================================
XO-Foundry Extractor Testing Tool
================================================================================
✅ Loaded config: condenast_sprout_messages_test
Using first source: sprout_messages
Source type: sprout_api
Load strategy: full_refresh
--------------------------------------------------------------------------------
STEP 1: Extract Data
--------------------------------------------------------------------------------
Extracting Sprout messages for 2025-12-01...
✅ Extracted 150 messages from Sprout API
✅ Extracted 12345 bytes from Sprout API
--------------------------------------------------------------------------------
STEP 2: Upload Raw Data (SKIPPED - Dry Run)
--------------------------------------------------------------------------------
Would upload to: s3://dev-bucket/raw/condenast/sprout_messages/2025-12-01/sprout_messages_20251201.csv
--------------------------------------------------------------------------------
STEP 3: Standardize Data
--------------------------------------------------------------------------------
Original columns: 25
Standardized columns: 25
Example transformations:
'from.screen_name' → 'FROM_SCREEN_NAME'
'activity_metadata.first_reply.actor.id' → 'ACTIVITY_METADATA_FIRST_REPLY_ACTOR_ID'
✅ Standardized CSV: 25 columns, 150 rows
--------------------------------------------------------------------------------
STEP 4: Upload Clean Data (SKIPPED - Dry Run)
--------------------------------------------------------------------------------
Would upload to: s3://dev-bucket/clean/condenast/sprout_messages/2025-12-01/sprout_messages_20251201.csv
📁 Saved to local directory: ./test-output
Raw: ./test-output/raw_sprout_messages_20251201.csv
Clean: ./test-output/clean_sprout_messages_20251201.csv
================================================================================
✅ TESTING COMPLETE
================================================================================
Date: 2025-12-01
Source: sprout_messages (sprout_api)
Columns: 25
Rows: 150
================================================================================
2. Full Test with S3 Upload¶
Test and upload to dev bucket:
uv run xo-foundry test-extractor \
--config packages/xo-foundry/configs/test-sprout-messages.yaml \
--date 2025-12-01 \
--bucket dev-bucket
Then verify in S3:
# Check raw data
aws s3 ls s3://dev-bucket/raw/condenast/sprout_messages/2025-12-01/
# Download and inspect
aws s3 cp s3://dev-bucket/raw/condenast/sprout_messages/2025-12-01/sprout_messages_20251201.csv ./raw.csv
aws s3 cp s3://dev-bucket/clean/condenast/sprout_messages/2025-12-01/sprout_messages_20251201.csv ./clean.csv
# Compare headers
head -1 raw.csv
head -1 clean.csv
3. Test Multiple Dates¶
Use a loop to test multiple dates:
for date in 2025-11-28 2025-11-29 2025-11-30; do
echo "Testing $date..."
uv run xo-foundry test-extractor \
--config packages/xo-foundry/configs/test-sprout-messages.yaml \
--date $date \
--bucket dev-bucket
done
4. Test Different Sources¶
If your config has multiple sources:
Test each separately:
# Test messages
uv run xo-foundry test-extractor \
--config packages/xo-foundry/configs/multi-source.yaml \
--source sprout_messages \
--date 2025-12-01
# Test cases
uv run xo-foundry test-extractor \
--config packages/xo-foundry/configs/multi-source.yaml \
--source sprout_cases \
--date 2025-12-01
Configuration Files¶
Creating Test Configs¶
Test configs follow the same structure as production DAG configs:
dag:
domain: client_name
pipeline_name: test_pipeline
description: "Test configuration"
schedule: "0 10 * * *"
pipeline_type: snowflake_load
default_args:
start_date: "2025-01-01"
globals:
snowflake:
database: CLIENT_DB_DEV
schema: BRONZE
sources:
source_name:
source_type: sprout_api # or gladly_api
load_strategy: full_refresh
extractor:
# API credentials (read from environment)
api_key_var: "ENV_VAR_NAME"
client_id_var: "ENV_VAR_NAME"
# Extractor-specific config
group_id: "12345"
profile_ids: ["profile1"]
paths:
report_name: report_name
filename_pattern: "report_{date}.csv"
snowflake:
target_table: TABLE_NAME
deduplication:
strategy: single_field
unique_columns: [ID]
Minimal Config Example¶
dag:
domain: test
pipeline_name: quick_test
description: "Minimal test"
schedule: "0 0 * * *"
default_args:
start_date: "2025-01-01"
globals:
snowflake:
database: TEST_DB
schema: BRONZE
sources:
test_source:
source_type: sprout_api
load_strategy: full_refresh
extractor:
api_key_var: "SPROUT_API_KEY"
client_id_var: "SPROUT_CLIENT_ID"
group_id: "12345"
profile_ids: ["profile1"]
paths:
report_name: test
filename_pattern: "test_{date}.csv"
snowflake:
target_table: TEST_TABLE
Troubleshooting¶
Error: Missing required environment variable¶
❌ Missing required environment variable: CONDENAST_SPROUT_API_KEY
Set it in .env or export it: export CONDENAST_SPROUT_API_KEY=value
Solution: Set the required environment variable:
Error: S3 upload failed¶
Solution: Create the S3 bucket:
Error: Invalid date format¶
Solution: Use the correct date format:
Error: Source not found in config¶
Solution: Use the correct source name or omit --source to use the first one:
# Option 1: Use correct name
--source sprout_messages
# Option 2: Omit to use first source
# (no --source flag)
Best Practices¶
1. Use Separate Test Configs¶
Don't test with production configs. Create separate test-*.yaml configs:
packages/xo-foundry/configs/
├── warbyparker-timestamps.yaml # Production
├── test-sprout-messages.yaml # Testing
└── test-sprout-cases.yaml # Testing
2. Use dev-bucket for Testing¶
Always use a separate development bucket:
3. Test with Recent Dates¶
Use recent dates that have data:
4. Validate Output¶
After extraction, inspect the files:
# Check row count
wc -l test-output/clean_*.csv
# Check columns
head -1 test-output/clean_*.csv
# Spot check data
head -10 test-output/clean_*.csv
5. Iterate Quickly¶
Use --dry-run and --output-dir for fastest iteration:
# Fast iteration loop
uv run xo-foundry test-extractor \
--config test.yaml \
--date 2025-12-01 \
--dry-run \
--output-dir ./output
# Inspect results
cat output/clean_*.csv | head
Integration with Development Workflow¶
Typical Development Flow¶
- Create extractor in xo-core
- Create test config in
packages/xo-foundry/configs/test-*.yaml - Set credentials in
.env - Run test:
- Inspect output in
./test-output/ - Fix issues and repeat step 4
- Test with S3 upload:
- Verify in S3:
- Create production config and deploy DAG
Before Deploying to Airflow¶
Always test extractors with this tool before deploying to Airflow:
# 1. Test extraction
uv run xo-foundry test-extractor \
--config packages/xo-foundry/configs/test-sprout-messages.yaml \
--date 2025-12-01 \
--bucket dev-bucket
# 2. Verify data quality
aws s3 cp s3://dev-bucket/clean/condenast/sprout_messages/2025-12-01/sprout_messages_20251201.csv - | head -100
# 3. Check for common issues
# - Are all expected columns present?
# - Are column names Snowflake-compatible (UPPERCASE, no special chars)?
# - Is the data format correct?
# 4. If everything looks good, create production config and deploy DAG
S3 Bucket Structure¶
The tool creates the following structure in your S3 bucket:
dev-bucket/
├── raw/ # Raw data from API
│ └── {domain}/
│ └── {report_name}/
│ └── {date}/
│ └── {filename}
│
└── clean/ # Standardized data
└── {domain}/
└── {report_name}/
└── {date}/
└── {filename}
Example:
dev-bucket/
├── raw/
│ └── condenast/
│ ├── sprout_messages/
│ │ ├── 2025-12-01/
│ │ │ └── sprout_messages_20251201.csv
│ │ └── 2025-12-02/
│ │ └── sprout_messages_20251202.csv
│ └── sprout_cases/
│ └── 2025-12-01/
│ └── sprout_cases_20251201.csv
│
└── clean/
└── condenast/
├── sprout_messages/
│ ├── 2025-12-01/
│ │ └── sprout_messages_20251201.csv
│ └── 2025-12-02/
│ └── sprout_messages_20251202.csv
└── sprout_cases/
└── 2025-12-01/
└── sprout_cases_20251201.csv
Supported Extractors¶
Currently supported:
- ✅ Gladly API (
source_type: gladly_api) - ✅ Sprout Social API (
source_type: sprout_api) - Messages endpoint
- Cases endpoint
Coming soon:
- Gmail
- Google Sheets
- S3-to-S3 copy
Advanced Usage¶
Custom AWS Profile¶
If you have multiple AWS profiles:
# ~/.aws/credentials
[default]
aws_access_key_id = ...
aws_secret_access_key = ...
[dev]
aws_access_key_id = ...
aws_secret_access_key = ...
[prod]
aws_access_key_id = ...
aws_secret_access_key = ...
Use specific profile:
Save to Local Directory Only¶
Skip S3 entirely and only save locally:
uv run xo-foundry test-extractor \
--config test.yaml \
--date 2025-12-01 \
--dry-run \
--output-dir ./my-test-data
Summary¶
The test-extractor tool provides a fast, efficient way to:
- ✅ Test extractors without Airflow
- ✅ Validate API credentials and configurations
- ✅ Inspect raw and standardized data
- ✅ Iterate quickly during development
- ✅ Verify S3 integration before deployment
This significantly reduces the time from "I want to add a new extractor" to "I have validated data in S3".