Architecture — 3-Tier DWH Audit Architecture¶

dwh-auditor is designed with a three-tier architecture that strictly enforces Separation of Concerns. This design centralizes BigQuery-specific code and allows you to expand to DWHs like Snowflake and Redshift in the future with minimal changes.

Overall Processing Flow¶

┌─────────────────────────────────────────────┐
│  CLI (main.py / Typer)                      │
│  dwh-auditor analyze --project ... --days 30  │
└────────────────────┬────────────────────────┘
                     │
           ┌─────────▼──────────┐
           │  Extractor Tier       │  ← google.cloud.bigquery EXCLUSIVELY here
           │  extractor/         │
           │  bigquery.py        │
           └─────────┬──────────┘
                     │ [QueryJob, TableStorage]
           ┌─────────▼──────────┐
           │  Analyzer Tier        │  ← Pure Python, zero external API dependencies
           │  analyzer/          │
           │  cost.py / scan.py  │
           │  zombie.py          │
           └─────────┬──────────┘
                     │ [AuditResult]
           ┌─────────▼──────────┐
           │  Reporter Tier        │
           │  reporter/          │
           │  console.py         │  ← Rich terminal output
           │  markdown.py        │  ← Generates report.md
           └────────────────────┘

Responsibilities of Each Tier¶

Extractor Tier (`src/dwh_auditor/extractor/`)¶

This is the only point of contact with BigQuery. It is only responsible for issuing queries to INFORMATION_SCHEMA and converting the results to Pydantic models.

Note

Importing the google.cloud.bigquery library is allowed only in extractor/bigquery.py. It should never be imported from other modules.

Data Sources Retrieved:

INFORMATION_SCHEMA Views	Content Retrieved
`INFORMATION_SCHEMA.JOBS`	Query execution history over the past N days
`INFORMATION_SCHEMA.TABLE_STORAGE`	Storage usage per table

Provided Function Interfaces:

extractor = BigQueryExtractor(project_id="my-project", region="region-us")
jobs: list[QueryJob] = extractor.get_job_history(days=30)
tables: list[TableStorage] = extractor.get_table_storage()

Analyzer Tier (`src/dwh_auditor/analyzer/`)¶

It is pure Python logic that performs diagnostics by matching the Pydantic model received from Extractor with the thresholds in config.yaml.

It doesn’t involve any external API communication, so unit tests complete in milliseconds.

Module	Diagnostic Logic
`analyzer/cost.py`	Detection of high cost queries (scan bytes → USD conversion, Top-N ranking)
`analyzer/scan.py`	Full scan detection (determines the presence or absence of WHERE clause/partition filter using regular expressions)
`analyzer/zombie.py`	Zombie table detection (match with job history reference table)
`analyzer/runner.py`	Call the above three and aggregate them into `AuditResult`

Reporter Tier (`src/dwh_auditor/reporter/`)¶

It takes an AuditResult and formats it to be shown to the user.

Module	Output Destination
`reporter/console.py`	Color output to terminal with Rich library
`reporter/markdown.py`	Generate `report.md` (can be saved as CI/CD Artifact)

Internal Data Model¶

We use the Pydantic model to pass data between each layer. Rather than passing the dict as is, having a type definition prevents bugs and enables editor completion.

Extractor → Analyzer:
  - QueryJob           (1 BQ job history record)
  - TableStorage       (1 table storage info record)

Analyzer → Reporter:
  - CostInsight        (High-cost query analysis result)
  - FullScanInsight    (Full scan detection result)
  - ZombieTableInsight (Zombie table detection result)
  - AuditResult        (Aggregated result of the above)

See dwh_auditor.models — Internal data model for DWH auditing (Pydantic) for details.

Testing Strategy — Why Fast Testing is Possible¶

The biggest advantage of the 3-tier separation is testability.

Testing the Analyzer Tier¶

Just put dummy data in the Pydantic model and call the function. No BigQuery mocks required and 46 tests completed in less than 0.35 seconds.

# Analyzer can be tested without connecting to BQ
jobs = [QueryJob(job_id="j1", user_email="u@e.com", ...)]
result = analyze_cost(jobs, config=AppConfig())
assert len(result) == 1

Testing the Extractor Tier¶

Just patch google.cloud.bigquery.Client with pytest-mock.

def test_get_job_history(mocker):
    mock_client = mocker.patch("dwh_auditor.extractor.bigquery.bq.Client")
    mock_client.return_value.query.return_value.result.return_value = [...]

    extractor = BigQueryExtractor(project_id="p", region="region-us")
    jobs = extractor.get_job_history(days=30)
    assert len(jobs) == 1

Extensibility — Supporting other DWHs¶

Extractor Tierを差し替えるだけで、他のデータウェアハウスに対応できます。

extractor/
├── bigquery.py    ← Current implementation
├── snowflake.py   ← Future extension example
└── redshift.py    ← Future extension example

Analyzer Tierと Reporter Tierは 変更不要 です。

Architecture — 3-Tier DWH Audit Architecture¶

Overall Processing Flow¶

Responsibilities of Each Tier¶

Extractor Tier (src/dwh_auditor/extractor/)¶

Analyzer Tier (src/dwh_auditor/analyzer/)¶

Reporter Tier (src/dwh_auditor/reporter/)¶

Internal Data Model¶

Testing Strategy — Why Fast Testing is Possible¶

Testing the Analyzer Tier¶

Testing the Extractor Tier¶

Extensibility — Supporting other DWHs¶

Extractor Tier (`src/dwh_auditor/extractor/`)¶

Analyzer Tier (`src/dwh_auditor/analyzer/`)¶

Reporter Tier (`src/dwh_auditor/reporter/`)¶