Architecture โ€” 3-Tier DWH Audit Architectureยถ

dwh-auditor is designed with a three-tier architecture that strictly enforces Separation of Concerns. This design centralizes BigQuery-specific code and allows you to expand to DWHs like Snowflake and Redshift in the future with minimal changes.

Overall Processing Flowยถ

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  CLI (main.py / Typer)                      โ”‚
โ”‚  dwh-auditor analyze --project ... --days 30  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚
           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ”‚  Extractor Tier       โ”‚  โ† google.cloud.bigquery EXCLUSIVELY here
           โ”‚  extractor/         โ”‚
           โ”‚  bigquery.py        โ”‚
           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚ [QueryJob, TableStorage]
           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ”‚  Analyzer Tier        โ”‚  โ† Pure Python, zero external API dependencies
           โ”‚  analyzer/          โ”‚
           โ”‚  cost.py / scan.py  โ”‚
           โ”‚  zombie.py          โ”‚
           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚ [AuditResult]
           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ”‚  Reporter Tier        โ”‚
           โ”‚  reporter/          โ”‚
           โ”‚  console.py         โ”‚  โ† Rich terminal output
           โ”‚  markdown.py        โ”‚  โ† Generates report.md
           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Responsibilities of Each Tierยถ

Extractor Tier (src/dwh_auditor/extractor/)ยถ

This is the only point of contact with BigQuery. It is only responsible for issuing queries to INFORMATION_SCHEMA and converting the results to Pydantic models.

Note

Importing the google.cloud.bigquery library is allowed only in extractor/bigquery.py. It should never be imported from other modules.

Data Sources Retrieved:

INFORMATION_SCHEMA Views

Content Retrieved

INFORMATION_SCHEMA.JOBS

Query execution history over the past N days

INFORMATION_SCHEMA.TABLE_STORAGE

Storage usage per table

Provided Function Interfaces:

extractor = BigQueryExtractor(project_id="my-project", region="region-us")
jobs: list[QueryJob] = extractor.get_job_history(days=30)
tables: list[TableStorage] = extractor.get_table_storage()

Analyzer Tier (src/dwh_auditor/analyzer/)ยถ

It is pure Python logic that performs diagnostics by matching the Pydantic model received from Extractor with the thresholds in config.yaml.

It doesnโ€™t involve any external API communication, so unit tests complete in milliseconds.

Module

Diagnostic Logic

analyzer/cost.py

Detection of high cost queries (scan bytes โ†’ USD conversion, Top-N ranking)

analyzer/scan.py

Full scan detection (determines the presence or absence of WHERE clause/partition filter using regular expressions)

analyzer/zombie.py

Zombie table detection (match with job history reference table)

analyzer/runner.py

Call the above three and aggregate them into AuditResult

Reporter Tier (src/dwh_auditor/reporter/)ยถ

It takes an AuditResult and formats it to be shown to the user.

Module

Output Destination

reporter/console.py

Color output to terminal with Rich library

reporter/markdown.py

Generate report.md (can be saved as CI/CD Artifact)

Internal Data Modelยถ

We use the Pydantic model to pass data between each layer. Rather than passing the dict as is, having a type definition prevents bugs and enables editor completion.

Extractor โ†’ Analyzer:
  - QueryJob           (1 BQ job history record)
  - TableStorage       (1 table storage info record)

Analyzer โ†’ Reporter:
  - CostInsight        (High-cost query analysis result)
  - FullScanInsight    (Full scan detection result)
  - ZombieTableInsight (Zombie table detection result)
  - AuditResult        (Aggregated result of the above)

See dwh_auditor.models โ€” Internal data model for DWH auditing (Pydantic) for details.

Testing Strategy โ€” Why Fast Testing is Possibleยถ

The biggest advantage of the 3-tier separation is testability.

Testing the Analyzer Tierยถ

Just put dummy data in the Pydantic model and call the function. No BigQuery mocks required and 46 tests completed in less than 0.35 seconds.

# Analyzer can be tested without connecting to BQ
jobs = [QueryJob(job_id="j1", user_email="u@e.com", ...)]
result = analyze_cost(jobs, config=AppConfig())
assert len(result) == 1

Testing the Extractor Tierยถ

Just patch google.cloud.bigquery.Client with pytest-mock.

def test_get_job_history(mocker):
    mock_client = mocker.patch("dwh_auditor.extractor.bigquery.bq.Client")
    mock_client.return_value.query.return_value.result.return_value = [...]

    extractor = BigQueryExtractor(project_id="p", region="region-us")
    jobs = extractor.get_job_history(days=30)
    assert len(jobs) == 1

Extensibility โ€” Supporting other DWHsยถ

Extractor Tierใ‚’ๅทฎใ—ๆ›ฟใˆใ‚‹ใ ใ‘ใงใ€ไป–ใฎใƒ‡ใƒผใ‚ฟใ‚ฆใ‚งใ‚ขใƒใ‚ฆใ‚นใซๅฏพๅฟœใงใใพใ™ใ€‚

extractor/
โ”œโ”€โ”€ bigquery.py    โ† Current implementation
โ”œโ”€โ”€ snowflake.py   โ† Future extension example
โ””โ”€โ”€ redshift.py    โ† Future extension example

Analyzer Tierใจ Reporter Tierใฏ ๅค‰ๆ›ดไธ่ฆ ใงใ™ใ€‚