Skill v1.0.1
currentAutomated scan100/1003 files
version: "1.0.1" name: warehouse-init description: Initialize warehouse schema discovery. Generates .astro/warehouse.md with all table metadata for instant lookups. Run once per project, refresh when schema changes. Use when user says "/astronomer-data:warehouse-init" or asks to set up data discovery.
Initialize Warehouse Schema
Generate a comprehensive, user-editable schema reference file for the data warehouse.
Scripts: ../analyzing-data/scripts/ — All CLI commands below are relative to the analyzing-data skill's directory. Before running any scripts/cli.py command, cd to ../analyzing-data/ relative to this file.
What This Does
- Discovers all databases, schemas, tables, and columns from the warehouse
- Enriches with codebase context (dbt models, gusty SQL, schema docs)
- Records row counts and identifies large tables
- Generates
.astro/warehouse.md- a version-controllable, team-shareable reference - Enables instant concept→table lookups without warehouse queries
Process
Step 1: Read Warehouse Configuration
cat ~/.astro/agents/warehouse.yml
Get the list of databases to discover (e.g., databases: [HQ, ANALYTICS, RAW]).
Step 2: Search Codebase for Context (Parallel)
Launch a subagent to find business context in code:
Task(subagent_type="Explore",prompt="""Search for data model documentation in the codebase:1. dbt models: **/models/**/*.yml, **/schema.yml- Extract table descriptions, column descriptions- Note primary keys and tests2. Gusty/declarative SQL: **/dags/**/*.sql with YAML frontmatter- Parse frontmatter for: description, primary_key, tests- Note schema mappings3. AGENTS.md or CLAUDE.md files with data layer documentationReturn a mapping of:table_name -> {description, primary_key, important_columns, layer}""")
Step 3: Parallel Warehouse Discovery
Launch one subagent per database using the Task tool:
For each database in configured_databases:Task(subagent_type="general-purpose",prompt="""Discover all metadata for database {DATABASE}.Use the CLI to run SQL queries:# Scripts are relative to ../analyzing-data/uv run scripts/cli.py exec "df = run_sql('...')"uv run scripts/cli.py exec "print(df)"1. Query schemas:SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA2. Query tables with row counts:SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENTFROM {DATABASE}.INFORMATION_SCHEMA.TABLESORDER BY TABLE_SCHEMA, TABLE_NAME3. For important schemas (MODEL_*, METRICS_*, MART_*), query columns:SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENTFROM {DATABASE}.INFORMATION_SCHEMA.COLUMNSWHERE TABLE_SCHEMA = 'X'Return a structured summary:- Database name- List of schemas with table counts- For each table: name, row_count, key columns- Flag any tables with >100M rows as "large"""")
Run all subagents in parallel (single message with multiple Task calls).
Step 4: Discover Categorical Value Families
For key categorical columns (like OPERATOR, STATUS, TYPE, FEATURE), discover value families:
uv run cli.py exec "df = run_sql('''SELECT DISTINCT column_name, COUNT(*) as occurrencesFROM tableWHERE column_name IS NOT NULLGROUP BY column_nameORDER BY occurrences DESCLIMIT 50''')"uv run cli.py exec "print(df)"
Group related values into families by common prefix/suffix (e.g., Export* for ExportCSV, ExportJSON, ExportParquet).
Step 5: Merge Results
Combine warehouse metadata + codebase context:
- Quick Reference table - concept → table mappings (pre-populated from code if found)
- Categorical Columns - value families for key filter columns
- Database sections - one per database
- Schema subsections - tables grouped by schema
- Table details - columns, row counts, descriptions from code, warnings
Step 6: Generate warehouse.md
Write the file to:
.astro/warehouse.md(default - project-specific, version-controllable)~/.astro/agents/warehouse.md(if--globalflag)
Output Format
# Warehouse Schema> Generated by `/astronomer-data:warehouse-init` on {DATE}. Edit freely to add business context.## Quick Reference| Concept | Table | Key Column | Date Column ||---------|-------|------------|-------------|| customers | HQ.MODEL_ASTRO.ORGANIZATIONS | ORG_ID | CREATED_AT |<!-- Add your concept mappings here -->## Categorical ColumnsWhen filtering on these columns, explore value families first (values often have variants):| Table | Column | Value Families ||-------|--------|----------------|| {TABLE} | {COLUMN} | `{PREFIX}*` ({VALUE1}, {VALUE2}, ...) |<!-- Populated by /astronomer-data:warehouse-init from actual warehouse data -->## Data Layer HierarchyQuery downstream first: `reporting` > `mart_*` > `metric_*` > `model_*` > `IN_*`| Layer | Prefix | Purpose ||-------|--------|---------|| Reporting | `reporting.*` | Dashboard-optimized || Mart | `mart_*` | Combined analytics || Metric | `metric_*` | KPIs at various grains || Model | `model_*` | Cleansed sources of truth || Raw | `IN_*` | Source data - avoid |## {DATABASE} Database### {SCHEMA} Schema#### {TABLE_NAME}{DESCRIPTION from code if found}| Column | Type | Description ||--------|------|-------------|| COL1 | VARCHAR | {from code or inferred} |-**Rows:** {ROW_COUNT}-**Key column:** {PRIMARY_KEY from code or inferred}{IF ROW_COUNT > 100M: - **⚠️ WARNING:** Large table - always add date filters}## Relationships
{Inferred relationships based on column names like *_ID}
Command Options
| Option | Effect | |
|---|---|---|
/astronomer-data:warehouse-init | Generate .astro/warehouse.md | |
/astronomer-data:warehouse-init --refresh | Regenerate, preserving user edits | |
/astronomer-data:warehouse-init --database HQ | Only discover specific database | |
/astronomer-data:warehouse-init --global | Write to ~/.astro/agents/ instead |
Step 7: Pre-populate Cache
After generating warehouse.md, populate the concept cache:
# Scripts are relative to ../analyzing-data/uv run cli.py concept import -p .astro/warehouse.mduv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
Step 8: Offer CLAUDE.md Integration (Ask User)
Ask the user:
Would you like to add the Quick Reference table to your CLAUDE.md file?This ensures the schema mappings are always in context for data queries, improving accuracy from ~25% to ~100% for complex queries.Options:1. Yes, add to CLAUDE.md (Recommended) - Append Quick Reference section2. No, skip - Use warehouse.md and cache only
If user chooses Yes:
- Check if
.claude/CLAUDE.mdorCLAUDE.mdexists - If exists, append the Quick Reference section (avoid duplicates)
- If not exists, create
.claude/CLAUDE.mdwith just the Quick Reference
Quick Reference section to add:
## Data Warehouse Quick ReferenceWhen querying the warehouse, use these table mappings:| Concept | Table | Key Column | Date Column ||---------|-------|------------|-------------|{rows from warehouse.md Quick Reference}**Large tables (always filter by date):** {list tables with >100M rows}> Auto-generated by `/astronomer-data:warehouse-init`. Run `/astronomer-data:warehouse-init --refresh` to update.
If yes: Append the Quick Reference section to .claude/CLAUDE.md or CLAUDE.md.
After Generation
Tell the user:
Generated .astro/warehouse.mdSummary:- {N} databases, {N} schemas, {N} tables- {N} tables enriched with code descriptions- {N} concepts cached for instant lookupNext steps:1. Edit .astro/warehouse.md to add business context2. Commit to version control3. Run /astronomer-data:warehouse-init --refresh when schema changes
Refresh Behavior
When --refresh is specified:
- Read existing warehouse.md
- Preserve all HTML comments (
<!-- ... -->) - Preserve Quick Reference table entries (user-added)
- Preserve user-added descriptions
- Update row counts and add new tables
- Mark removed tables with
<!-- REMOVED -->comment
Cache Staleness & Schema Drift
The runtime cache has a 7-day TTL by default. After 7 days, cached entries expire and will be re-discovered on next use.
When to Refresh
Run /astronomer-data:warehouse-init --refresh when:
- Schema changes: Tables added, renamed, or removed
- Column changes: New columns added or types changed
- After deployments: If your data pipeline deploys schema migrations
- Weekly: As a good practice, even if no known changes
Signs of Stale Cache
Watch for these indicators:
- Queries fail with "table not found" errors
- Results seem wrong or outdated
- New tables aren't being discovered
Manual Cache Reset
If you suspect cache issues:
# Scripts are relative to ../analyzing-data/uv run scripts/cli.py cache statusuv run scripts/cli.py cache clear --stale-onlyuv run scripts/cli.py cache clear
Codebase Patterns Recognized
| Pattern | Source | What We Extract | |
|---|---|---|---|
**/models/**/*.yml | dbt | table/column descriptions, tests | |
**/dags/**/*.sql | gusty | YAML frontmatter (description, primary_key) | |
AGENTS.md, CLAUDE.md | docs | data layer hierarchy, conventions | |
**/docs/**/*.md | docs | business context |
Example Session
User: /astronomer-data:warehouse-initAgent:→ Reading warehouse configuration...→ Found 1 warehouse with databases: HQ, PRODUCT→ Searching codebase for data documentation...Found: AGENTS.md with data layer hierarchyFound: 45 SQL files with YAML frontmatter in dags/declarative/→ Launching parallel warehouse discovery...[Database: HQ] Discovering schemas...[Database: PRODUCT] Discovering schemas...→ HQ: Found 29 schemas, 401 tables→ PRODUCT: Found 1 schema, 0 tables→ Merging warehouse metadata with code context...Enriched 45 tables with descriptions from code→ Generated .astro/warehouse.mdSummary:- 2 databases- 30 schemas- 401 tables- 45 tables enriched with code descriptions- 8 large tables flagged (>100M rows)Next steps:1. Review .astro/warehouse.md2. Add concept mappings to Quick Reference3. Commit to version control4. Run /astronomer-data:warehouse-init --refresh when schema changes