Extractor Testing Workflow¶

Overview¶

The xo-foundry test-extractor CLI command allows you to test extractors locally without needing to spin up Airflow. This significantly speeds up development and debugging.

What It Does¶

The test-extractor tool:

Extracts data from the source API using your configuration
Uploads raw data to S3 {bucket}/raw/{domain}/{report}/{date}/
Standardizes column names for Snowflake compatibility
Uploads clean data to S3 {bucket}/clean/{domain}/{report}/{date}/
Provides detailed logging of the entire process

Prerequisites¶

1. AWS Credentials¶

Configure AWS credentials for S3 access:

# Option 1: AWS credentials file (~/.aws/credentials)
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY

# Option 2: Environment variables
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1

2. API Credentials¶

Set environment variables for the API you're testing:

# For Sprout Social
export CONDENAST_SPROUT_API_KEY=your_api_key
export CONDENAST_SPROUT_CLIENT_ID=your_client_id

# For Gladly
export GLADLY_BASE_URL=https://warbyparker.gladly.com
export GLADLY_EMAIL=your_email@example.com
export GLADLY_TOKEN=your_token

Best Practice: Use a .env file:

# .env
CONDENAST_SPROUT_API_KEY=abc123
CONDENAST_SPROUT_CLIENT_ID=2105997
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=secret...

Then load it:

# Load .env file
export $(cat .env | xargs)

3. S3 Bucket¶

Ensure your S3 bucket exists:

# Check if bucket exists
aws s3 ls s3://dev-bucket/

# Create if needed
aws s3 mb s3://dev-bucket

Usage¶

Basic Usage¶

Test a Sprout messages extractor:

uv run xo-foundry test-extractor \
  --config packages/xo-foundry/configs/test-sprout-messages.yaml \
  --date 2025-12-01 \
  --bucket dev-bucket

All Options¶

uv run xo-foundry test-extractor \
  --config packages/xo-foundry/configs/test-sprout-messages.yaml \  # YAML config
  --date 2025-12-01 \                                               # Date to extract
  --source sprout_messages \                                        # Source name (optional)
  --bucket dev-bucket \                                             # S3 bucket
  --aws-profile dev \                                               # AWS profile (optional)
  --dry-run \                                                       # Skip S3 upload (optional)
  --output-dir ./test-output                                        # Local save (optional)

Command Options¶

Option	Required	Default	Description
`--config`	✅	-	Path to YAML configuration file
`--date`	✅	-	Date to extract (YYYY-MM-DD)
`--source`	❌	First source	Source name from config
`--bucket`	❌	`dev-bucket`	S3 bucket name
`--aws-profile`	❌	default	AWS profile to use
`--dry-run`	❌	false	Skip S3 upload (local testing)
`--output-dir`	❌	-	Save files locally

Example Workflows¶

1. Quick Test (Dry Run)¶

Test extraction without uploading to S3:

uv run xo-foundry test-extractor \
  --config packages/xo-foundry/configs/test-sprout-messages.yaml \
  --date 2025-12-01 \
  --dry-run \
  --output-dir ./test-output

Output:

================================================================================
XO-Foundry Extractor Testing Tool
================================================================================
✅ Loaded config: condenast_sprout_messages_test
Using first source: sprout_messages
Source type: sprout_api
Load strategy: full_refresh
--------------------------------------------------------------------------------
STEP 1: Extract Data
--------------------------------------------------------------------------------
Extracting Sprout messages for 2025-12-01...
✅ Extracted 150 messages from Sprout API
✅ Extracted 12345 bytes from Sprout API
--------------------------------------------------------------------------------
STEP 2: Upload Raw Data (SKIPPED - Dry Run)
--------------------------------------------------------------------------------
Would upload to: s3://dev-bucket/raw/condenast/sprout_messages/2025-12-01/sprout_messages_20251201.csv
--------------------------------------------------------------------------------
STEP 3: Standardize Data
--------------------------------------------------------------------------------
Original columns: 25
Standardized columns: 25
Example transformations:
  'from.screen_name' → 'FROM_SCREEN_NAME'
  'activity_metadata.first_reply.actor.id' → 'ACTIVITY_METADATA_FIRST_REPLY_ACTOR_ID'
✅ Standardized CSV: 25 columns, 150 rows
--------------------------------------------------------------------------------
STEP 4: Upload Clean Data (SKIPPED - Dry Run)
--------------------------------------------------------------------------------
Would upload to: s3://dev-bucket/clean/condenast/sprout_messages/2025-12-01/sprout_messages_20251201.csv
📁 Saved to local directory: ./test-output
   Raw: ./test-output/raw_sprout_messages_20251201.csv
   Clean: ./test-output/clean_sprout_messages_20251201.csv
================================================================================
✅ TESTING COMPLETE
================================================================================
Date: 2025-12-01
Source: sprout_messages (sprout_api)
Columns: 25
Rows: 150
================================================================================

2. Full Test with S3 Upload¶

Test and upload to dev bucket:

uv run xo-foundry test-extractor \
  --config packages/xo-foundry/configs/test-sprout-messages.yaml \
  --date 2025-12-01 \
  --bucket dev-bucket

Then verify in S3:

# Check raw data
aws s3 ls s3://dev-bucket/raw/condenast/sprout_messages/2025-12-01/

# Download and inspect
aws s3 cp s3://dev-bucket/raw/condenast/sprout_messages/2025-12-01/sprout_messages_20251201.csv ./raw.csv
aws s3 cp s3://dev-bucket/clean/condenast/sprout_messages/2025-12-01/sprout_messages_20251201.csv ./clean.csv

# Compare headers
head -1 raw.csv
head -1 clean.csv

3. Test Multiple Dates¶

Use a loop to test multiple dates:

for date in 2025-11-28 2025-11-29 2025-11-30; do
  echo "Testing $date..."
  uv run xo-foundry test-extractor \
    --config packages/xo-foundry/configs/test-sprout-messages.yaml \
    --date $date \
    --bucket dev-bucket
done

4. Test Different Sources¶

If your config has multiple sources:

sources:
  sprout_messages:
    source_type: sprout_api
    ...
  sprout_cases:
    source_type: sprout_api
    ...

Test each separately:

# Test messages
uv run xo-foundry test-extractor \
  --config packages/xo-foundry/configs/multi-source.yaml \
  --source sprout_messages \
  --date 2025-12-01

# Test cases
uv run xo-foundry test-extractor \
  --config packages/xo-foundry/configs/multi-source.yaml \
  --source sprout_cases \
  --date 2025-12-01

Configuration Files¶

Creating Test Configs¶

Test configs follow the same structure as production DAG configs:

dag:
  domain: client_name
  pipeline_name: test_pipeline
  description: "Test configuration"
  schedule: "0 10 * * *"
  pipeline_type: snowflake_load
  default_args:
    start_date: "2025-01-01"

globals:
  snowflake:
    database: CLIENT_DB_DEV
    schema: BRONZE

sources:
  source_name:
    source_type: sprout_api  # or gladly_api
    load_strategy: full_refresh
    extractor:
      # API credentials (read from environment)
      api_key_var: "ENV_VAR_NAME"
      client_id_var: "ENV_VAR_NAME"

      # Extractor-specific config
      group_id: "12345"
      profile_ids: ["profile1"]

    paths:
      report_name: report_name
      filename_pattern: "report_{date}.csv"

    snowflake:
      target_table: TABLE_NAME
      deduplication:
        strategy: single_field
        unique_columns: [ID]

Minimal Config Example¶

dag:
  domain: test
  pipeline_name: quick_test
  description: "Minimal test"
  schedule: "0 0 * * *"
  default_args:
    start_date: "2025-01-01"

globals:
  snowflake:
    database: TEST_DB
    schema: BRONZE

sources:
  test_source:
    source_type: sprout_api
    load_strategy: full_refresh
    extractor:
      api_key_var: "SPROUT_API_KEY"
      client_id_var: "SPROUT_CLIENT_ID"
      group_id: "12345"
      profile_ids: ["profile1"]
    paths:
      report_name: test
      filename_pattern: "test_{date}.csv"
    snowflake:
      target_table: TEST_TABLE

Troubleshooting¶

Error: Missing required environment variable¶

❌ Missing required environment variable: CONDENAST_SPROUT_API_KEY
Set it in .env or export it: export CONDENAST_SPROUT_API_KEY=value

Solution: Set the required environment variable:

export CONDENAST_SPROUT_API_KEY=your_key

Error: S3 upload failed¶

❌ S3 upload failed: NoSuchBucket

Solution: Create the S3 bucket:

aws s3 mb s3://dev-bucket

Error: Invalid date format¶

❌ Invalid date format: 12-01-2025. Use YYYY-MM-DD

Solution: Use the correct date format:

--date 2025-12-01  # ✅ Correct
--date 12-01-2025  # ❌ Wrong

Error: Source not found in config¶

❌ Source 'wrong_name' not found in config. Available: ['sprout_messages']

Solution: Use the correct source name or omit --source to use the first one:

# Option 1: Use correct name
--source sprout_messages

# Option 2: Omit to use first source
# (no --source flag)

Best Practices¶

1. Use Separate Test Configs¶

Don't test with production configs. Create separate test-*.yaml configs:

packages/xo-foundry/configs/
├── warbyparker-timestamps.yaml      # Production
├── test-sprout-messages.yaml        # Testing
└── test-sprout-cases.yaml           # Testing

2. Use dev-bucket for Testing¶

Always use a separate development bucket:

--bucket dev-bucket  # ✅ Good
--bucket prod-data   # ❌ Bad

3. Test with Recent Dates¶

Use recent dates that have data:

--date 2025-12-01  # ✅ Recent date with data
--date 2020-01-01  # ❌ Old date, might have no data

4. Validate Output¶

After extraction, inspect the files:

# Check row count
wc -l test-output/clean_*.csv

# Check columns
head -1 test-output/clean_*.csv

# Spot check data
head -10 test-output/clean_*.csv

5. Iterate Quickly¶

Use --dry-run and --output-dir for fastest iteration:

# Fast iteration loop
uv run xo-foundry test-extractor \
  --config test.yaml \
  --date 2025-12-01 \
  --dry-run \
  --output-dir ./output

# Inspect results
cat output/clean_*.csv | head

Integration with Development Workflow¶

Typical Development Flow¶

Create extractor in xo-core
Create test config in packages/xo-foundry/configs/test-*.yaml
Set credentials in .env

Run test:

uv run xo-foundry test-extractor \
  --config packages/xo-foundry/configs/test-sprout-messages.yaml \
  --date 2025-12-01 \
  --dry-run \
  --output-dir ./test-output

Inspect output in ./test-output/
Fix issues and repeat step 4

Test with S3 upload:

uv run xo-foundry test-extractor \
  --config packages/xo-foundry/configs/test-sprout-messages.yaml \
  --date 2025-12-01 \
  --bucket dev-bucket

Verify in S3:

aws s3 ls s3://dev-bucket/raw/
aws s3 ls s3://dev-bucket/clean/

Create production config and deploy DAG

Before Deploying to Airflow¶

Always test extractors with this tool before deploying to Airflow:

# 1. Test extraction
uv run xo-foundry test-extractor \
  --config packages/xo-foundry/configs/test-sprout-messages.yaml \
  --date 2025-12-01 \
  --bucket dev-bucket

# 2. Verify data quality
aws s3 cp s3://dev-bucket/clean/condenast/sprout_messages/2025-12-01/sprout_messages_20251201.csv - | head -100

# 3. Check for common issues
# - Are all expected columns present?
# - Are column names Snowflake-compatible (UPPERCASE, no special chars)?
# - Is the data format correct?

# 4. If everything looks good, create production config and deploy DAG

S3 Bucket Structure¶

The tool creates the following structure in your S3 bucket:

dev-bucket/
├── raw/                              # Raw data from API
│   └── {domain}/
│       └── {report_name}/
│           └── {date}/
│               └── {filename}
│
└── clean/                            # Standardized data
    └── {domain}/
        └── {report_name}/
            └── {date}/
                └── {filename}

Example:

dev-bucket/
├── raw/
│   └── condenast/
│       ├── sprout_messages/
│       │   ├── 2025-12-01/
│       │   │   └── sprout_messages_20251201.csv
│       │   └── 2025-12-02/
│       │       └── sprout_messages_20251202.csv
│       └── sprout_cases/
│           └── 2025-12-01/
│               └── sprout_cases_20251201.csv
│
└── clean/
    └── condenast/
        ├── sprout_messages/
        │   ├── 2025-12-01/
        │   │   └── sprout_messages_20251201.csv
        │   └── 2025-12-02/
        │       └── sprout_messages_20251202.csv
        └── sprout_cases/
            └── 2025-12-01/
                └── sprout_cases_20251201.csv

Supported Extractors¶

Currently supported:

✅ Gladly API (source_type: gladly_api)
✅ Sprout Social API (source_type: sprout_api)
Messages endpoint
Cases endpoint

Coming soon:

Gmail
Google Sheets
S3-to-S3 copy

Advanced Usage¶

Custom AWS Profile¶

If you have multiple AWS profiles:

# ~/.aws/credentials
[default]
aws_access_key_id = ...
aws_secret_access_key = ...

[dev]
aws_access_key_id = ...
aws_secret_access_key = ...

[prod]
aws_access_key_id = ...
aws_secret_access_key = ...

Use specific profile:

uv run xo-foundry test-extractor \
  --config test.yaml \
  --date 2025-12-01 \
  --aws-profile dev

Save to Local Directory Only¶

Skip S3 entirely and only save locally:

uv run xo-foundry test-extractor \
  --config test.yaml \
  --date 2025-12-01 \
  --dry-run \
  --output-dir ./my-test-data

Summary¶

The test-extractor tool provides a fast, efficient way to:

✅ Test extractors without Airflow
✅ Validate API credentials and configurations
✅ Inspect raw and standardized data
✅ Iterate quickly during development
✅ Verify S3 integration before deployment

This significantly reduces the time from "I want to add a new extractor" to "I have validated data in S3".