Architecture โ 3-Tier DWH Audit Architectureยถ
dwh-auditor is designed with a three-tier architecture that strictly enforces Separation of Concerns. This design centralizes BigQuery-specific code and allows you to expand to DWHs like Snowflake and Redshift in the future with minimal changes.
Overall Processing Flowยถ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CLI (main.py / Typer) โ
โ dwh-auditor analyze --project ... --days 30 โ
โโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโผโโโโโโโโโโโ
โ Extractor Tier โ โ google.cloud.bigquery EXCLUSIVELY here
โ extractor/ โ
โ bigquery.py โ
โโโโโโโโโโโฌโโโโโโโโโโโ
โ [QueryJob, TableStorage]
โโโโโโโโโโโผโโโโโโโโโโโ
โ Analyzer Tier โ โ Pure Python, zero external API dependencies
โ analyzer/ โ
โ cost.py / scan.py โ
โ zombie.py โ
โโโโโโโโโโโฌโโโโโโโโโโโ
โ [AuditResult]
โโโโโโโโโโโผโโโโโโโโโโโ
โ Reporter Tier โ
โ reporter/ โ
โ console.py โ โ Rich terminal output
โ markdown.py โ โ Generates report.md
โโโโโโโโโโโโโโโโโโโโโโ
Responsibilities of Each Tierยถ
Extractor Tier (src/dwh_auditor/extractor/)ยถ
This is the only point of contact with BigQuery. It is only responsible for issuing queries to INFORMATION_SCHEMA and converting the results to Pydantic models.
Note
Importing the google.cloud.bigquery library is allowed only in extractor/bigquery.py. It should never be imported from other modules.
Data Sources Retrieved:
INFORMATION_SCHEMA Views |
Content Retrieved |
|---|---|
|
Query execution history over the past N days |
|
Storage usage per table |
Provided Function Interfaces:
extractor = BigQueryExtractor(project_id="my-project", region="region-us")
jobs: list[QueryJob] = extractor.get_job_history(days=30)
tables: list[TableStorage] = extractor.get_table_storage()
Analyzer Tier (src/dwh_auditor/analyzer/)ยถ
It is pure Python logic that performs diagnostics by matching the Pydantic model received from Extractor with the thresholds in config.yaml.
It doesnโt involve any external API communication, so unit tests complete in milliseconds.
Module |
Diagnostic Logic |
|---|---|
|
Detection of high cost queries (scan bytes โ USD conversion, Top-N ranking) |
|
Full scan detection (determines the presence or absence of WHERE clause/partition filter using regular expressions) |
|
Zombie table detection (match with job history reference table) |
|
Call the above three and aggregate them into |
Reporter Tier (src/dwh_auditor/reporter/)ยถ
It takes an AuditResult and formats it to be shown to the user.
Module |
Output Destination |
|---|---|
|
Color output to terminal with Rich library |
|
Generate |
Internal Data Modelยถ
We use the Pydantic model to pass data between each layer. Rather than passing the dict as is, having a type definition prevents bugs and enables editor completion.
Extractor โ Analyzer:
- QueryJob (1 BQ job history record)
- TableStorage (1 table storage info record)
Analyzer โ Reporter:
- CostInsight (High-cost query analysis result)
- FullScanInsight (Full scan detection result)
- ZombieTableInsight (Zombie table detection result)
- AuditResult (Aggregated result of the above)
See dwh_auditor.models โ Internal data model for DWH auditing (Pydantic) for details.
Testing Strategy โ Why Fast Testing is Possibleยถ
The biggest advantage of the 3-tier separation is testability.
Testing the Analyzer Tierยถ
Just put dummy data in the Pydantic model and call the function. No BigQuery mocks required and 46 tests completed in less than 0.35 seconds.
# Analyzer can be tested without connecting to BQ
jobs = [QueryJob(job_id="j1", user_email="u@e.com", ...)]
result = analyze_cost(jobs, config=AppConfig())
assert len(result) == 1
Testing the Extractor Tierยถ
Just patch google.cloud.bigquery.Client with pytest-mock.
def test_get_job_history(mocker):
mock_client = mocker.patch("dwh_auditor.extractor.bigquery.bq.Client")
mock_client.return_value.query.return_value.result.return_value = [...]
extractor = BigQueryExtractor(project_id="p", region="region-us")
jobs = extractor.get_job_history(days=30)
assert len(jobs) == 1
Extensibility โ Supporting other DWHsยถ
Extractor Tierใๅทฎใๆฟใใใ ใใงใไปใฎใใผใฟใฆใงใขใใฆในใซๅฏพๅฟใงใใพใใ
extractor/
โโโ bigquery.py โ Current implementation
โโโ snowflake.py โ Future extension example
โโโ redshift.py โ Future extension example
Analyzer Tierใจ Reporter Tierใฏ ๅคๆดไธ่ฆ ใงใใ