Popular Now
Infographic illustrating production‑ready GKE architecture, showing Google Cloud services, Kubernetes clusters, DevOps/GitOps workflows, SRE practices, observability, security, and disaster recovery components.

Production-Ready GKE: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Infographic showing best practices for production‑ready EKS deployments, illustrating AWS cloud architecture, Kubernetes clusters, GitOps automation, observability, security, and disaster recovery principles.

Production-Ready EKS: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Data Quality Gates for Regulatory Reporting — What Great Expectations Catches That dbt Tests Miss

dbt tests catch transformation bugs. Great Expectations catches bad source data before dbt runs. Build two quality layers for COREP regulatory reporting with full Python code.
📅 Day 8 of 18  ·  COREP Governance Pipeline Series  ·  Data Quality

You finished Day 7 with four dbt mart models that map your bank data to EBA DPM 4.0 templates. dbt tests verify that cet1_ratio >= 0.045 and lcr_ratio >= 1.0 — and they run inside the transformation step. That sounds thorough. It isn’t.

A Basel III capital ratio test failing inside dbt means the problem surfaces only after you’ve already transformed potentially tainted data into your mart. The raw data has already been loaded, staged, and joined. The bad number is already in the lineage graph.

Regulators don’t accept “we caught it after transformation.” BCBS 239 Principle 3 requires completeness and accuracy at the point of origination. You need a quality gate before dbt runs — and a separate one after the mart is built.

This post builds both layers using Great Expectations (GX) and wires them into your pipeline via quality.py.

1. Why You Need Two Quality Layers

Think of data quality in a regulatory pipeline the same way a manufacturer thinks about quality control on a production line: you inspect raw materials before they enter the machine, and you inspect finished goods before they ship. Inspecting only at the end doesn’t tell you which raw material batch was the problem.

LayerWhen it runsWhat it validatesFrameworkFailure effect
Layer 1 — Raw gateAfter ingest.py loads CSVs into raw.*Column presence, nulls, domain values, numeric ranges on source dataGreat ExpectationsPipeline halts before dbt runs
Layer 2 — Mart gateAfter dbt run completesRegulatory floors (Basel III ratios), cross-table consistency, XBRL decimal precisionGreat Expectations + dbt testsPipeline halts before XBRL generation
⚠ What Happens Without Layer 1

Without a raw gate, a single CSV row where risk_weight_pct = 1500 (a data-entry typo) would pass through stg_rwa_exposures.sql filter (BETWEEN 0 AND 12.5 is a cast to decimal — the string “1500” casts and then fails the filter, silently dropping the row), causing your RWA total to be understated. The mart model produces a number. dbt tests pass. You file an incorrect COREP report.

2. What dbt Tests Do — and Where They Stop

dbt ships four built-in generic tests plus a rich ecosystem of packages (dbt_utils, dbt_expectations). They are excellent for structural invariants on transformed models. Here is what they do well:

dbt testWhat it checksWhere it runs
not_nullColumn has no NULL rowsOn any model, any layer
uniqueNo duplicate valuesOn any model
accepted_valuesValue in a fixed listOn staging / mart
relationshipsFK referential integrityBetween models
dbt_utils.expression_is_trueArbitrary SQL expressionOn any model

Here is what dbt tests cannot do:

Requirementdbt can do it?Why not
Statistical distribution check (mean ± 3σ on capital ratios)NoNo built-in stats; dbt_utils has no distribution tests
Row count within expected band (e.g. 50–5000 rows)Partialdbt_utils.expression_is_true can do it but is awkward
Column-level completeness % (≥ 95% populated)NoBuilt-in not_null is all-or-nothing
Cross-column conditional logic (if tier=CET1 then amount must be positive)HardRequires custom macro + SQL injection risk
Profiling: min, max, mean, p5, p95 stored as metadataNodbt tests are pass/fail, no metrics storage
HTML data docs for human audit reviewNodbt docs don’t include per-column statistics
Expectation suites versioned separately from SQL modelsNodbt tests live in schema.yml tied to model versions
Re-run quality check without re-running transformationNodbt test queries the transformed table but has no separate validation run
✓ The Right Mental Model

dbt tests = transformation correctness. They confirm your SQL logic is right. Great Expectations = data correctness. It confirms your data — independent of your SQL — meets regulatory business rules. Both are mandatory. Neither replaces the other.

3. Great Expectations Architecture in the COREP Pipeline

  CSV files                            PostgreSQL
  /data/source/                        corep database
  ┌──────────────────┐                 ┌────────────────────────────────────┐
  │ capital_          │   ingest.py     │  raw.*          staging.*           │
  │ instruments.csv   │ ──────────────► │  capital_       stg_capital_        │
  │ rwa_exposures.csv │                 │  instruments    instruments         │
  │ ...               │                 │  rwa_exposures  stg_rwa_exposures   │
  └──────────────────┘                 │  ...            ...                 │
                                        │                                    │
  ◄── GX LAYER 1 (raw gate) runs here  │  intermediate.* mart.*             │
      Suite: raw_capital_suite           │  int_capital_   corep_c0100         │
      Suite: raw_rwa_suite               │  by_tier        corep_c0200         │
      Suite: raw_liquidity_suite         │  ...            ...                 │
                                        │                                    │
      PASS → dbt run proceeds◄── GX LAYER 2 (mart gate)FAIL → pipeline halts             │      Suite: mart_corep_suite       │
             audit log written           │                                    │
             Airflow branch: skip_xbrl  └────────────────────────────────────┘
                                                        │
                                          PASS → xbrl_gen.py runs
                                          FAIL → BranchPythonOperator
                                                 routes to quarantine

GX uses three concepts you need to understand:

GX conceptWhat it isAnalogy
ExpectationA single assertion: “column X has values between A and B”A single test case
Expectation SuiteA named collection of expectations stored as JSONA test class / spec file
CheckpointBinds a suite to a data source and runs it, producing a ValidationResultA test runner invocation
Data DocsHTML report generated from ValidationResults — human-readable audit evidenceTest coverage report
Data ContextRoot configuration object — knows where your suites, results, and docs livepytest configuration + fixtures

4. Install Great Expectations

# Inside your Python virtual environment
pip install great-expectations==0.18.19 sqlalchemy psycopg2-binary
⚠ Version Pin Is Not Optional

GX made breaking API changes between 0.17, 0.18, and 1.x. Pin to 0.18.19 which is the last stable 0.18 release. The 1.x rewrite renamed most classes. This post uses the 0.18 API throughout.

Add it to requirements.txt:

great-expectations==0.18.19   # Day 8 — data quality gates

5. GX Project Structure

corep-governance-pipeline/
└── gx/
    ├── great_expectations.yml     # Data Context config
    ├── expectations/
    │   ├── raw_capital_suite.json
    │   ├── raw_rwa_suite.json
    │   ├── raw_liquidity_suite.json
    │   └── mart_corep_suite.json
    ├── checkpoints/
    │   ├── raw_capital_checkpoint.yml
    │   ├── raw_rwa_checkpoint.yml
    │   ├── raw_liquidity_checkpoint.yml
    │   └── mart_corep_checkpoint.yml
    └── uncommitted/
        └── data_docs/
            └── local_site/        # HTML output → also uploaded to MinIO

6. great_expectations.yml — Data Context

# gx/great_expectations.yml
config_version: 3.0

datasources:
  corep_postgres:
    class_name: Datasource
    execution_engine:
      class_name: SqlAlchemyExecutionEngine
      connection_string: ${COREP_GX_DB_URL}         # injected from .env
    data_connectors:
      default_inferred_data_connector_name:
        class_name: InferredAssetSqlDataConnector
        include_schema_name: true

stores:
  expectations_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: expectations/

  validations_store:
    class_name: ValidationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/validations/

  evaluation_parameter_store:
    class_name: EvaluationParameterStore

  checkpoint_store:
    class_name: CheckpointStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: checkpoints/

data_docs_sites:
  local_site:
    class_name: SiteBuilder
    show_how_to_buttons: false
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/data_docs/local_site/
    site_index_builder:
      class_name: DefaultSiteIndexPageRenderer

expectations_store_name: expectations_store
validations_store_name: validations_store
evaluation_parameter_store_name: evaluation_parameter_store
checkpoint_store_name: checkpoint_store
🔒 Security Note — Connection String

The ${COREP_GX_DB_URL} substitution tells GX to read from the environment variable. Never hardcode credentials in great_expectations.yml. Your .env file already contains COREP_GX_DB_URL=postgresql+psycopg2://corep_admin:${POSTGRES_PASSWORD}@localhost:5432/corep.

7. Layer 1 — Raw Expectation Suites

7.1 Capital Instruments Suite

# gx/expectations/raw_capital_suite.json (abbreviated — full file below)
# Build programmatically in Python, then save to JSON

import great_expectations as gx

context = gx.get_context(context_root_dir="gx")
suite = context.add_expectation_suite("raw_capital_suite")

validator = context.get_validator(
    datasource_name="corep_postgres",
    data_connector_name="default_inferred_data_connector_name",
    data_asset_name="raw.capital_instruments",
    expectation_suite_name="raw_capital_suite",
)

# ── Table-level ──────────────────────────────────────────
validator.expect_table_row_count_to_be_between(min_value=10, max_value=50_000)
validator.expect_table_columns_to_match_set(
    column_set=[
        "instrument_id", "name", "tier", "amount",
        "currency", "issuance_date", "maturity_date"
    ],
    exact_match=False   # allow extra columns — forward compatible
)

# ── Column: instrument_id ─────────────────────────────────
validator.expect_column_values_to_not_be_null("instrument_id")
validator.expect_column_values_to_be_unique("instrument_id")

# ── Column: tier ─────────────────────────────────────────
validator.expect_column_values_to_not_be_null("tier")
validator.expect_column_values_to_be_in_set(
    "tier", value_set=["CET1", "AT1", "T2"]
)

# ── Column: amount ───────────────────────────────────────
validator.expect_column_values_to_not_be_null("amount")
validator.expect_column_values_to_be_between(
    "amount", min_value=0, max_value=1_000_000_000_000,
    mostly=0.99   # allow 1% outliers — GX's "mostly" parameter
)
validator.expect_column_mean_to_be_between(
    "amount", min_value=1_000, max_value=100_000_000
)

# ── Column: currency ─────────────────────────────────────
validator.expect_column_values_to_match_regex(
    "currency", regex=r"^[A-Z]{3}$"   # ISO 4217 format
)

# ── Column: issuance_date ────────────────────────────────
validator.expect_column_values_to_match_regex(
    "issuance_date", regex=r"^\d{4}-\d{2}-\d{2}$"
)

# ── Cross-column: CET1 tier amounts must be positive ─────
# GX SQL expression expectation — dbt has no equivalent
validator.expect_column_values_to_not_be_null(
    "amount",
    row_condition='tier == "CET1"',
    condition_parser="pandas"
)

validator.save_expectation_suite(discard_failed_expectations=False)
print("raw_capital_suite saved.")

7.2 What “mostly” Unlocks

The mostly parameter is GX’s most powerful feature for regulatory data. It lets you express completeness requirements as a percentage threshold rather than an all-or-nothing assertion.

Fielddbt not_null testGX with mostly=0.95Regulatory interpretation
amountFails on first NULLPasses if ≥ 95% populatedSome instruments may legitimately have no amount at reporting date (e.g., contingent instruments)
maturity_dateFails on first NULLPasses if ≥ 80% populatedPerpetual instruments (AT1) have no maturity — NULL is valid
leiFails on first NULLPasses if ≥ 90% populatedLEI registration may be in-progress for new counterparties
instrument_id"Correct: 100% requiredCorrect: mostly=1.0Primary key — zero tolerance

7.3 RWA Exposures Suite

# Key expectations for raw.rwa_exposures

validator.expect_table_row_count_to_be_between(min_value=5, max_value=100_000)

validator.expect_column_values_to_be_in_set(
    "exposure_class",
    value_set=[
        "central_governments", "institutions", "corporates",
        "retail", "real_estate", "equity", "other"
    ]
)

# EBA CORR: risk_weight_pct is a decimal (0–1250%), not a percentage integer
# This catches the common ETL error of loading 75.0 as 7500 (×100 shift)
validator.expect_column_values_to_be_between(
    "risk_weight_pct", min_value=0.0, max_value=12.5,  # 1250% = 12.5 in decimal
    mostly=0.999
)

validator.expect_column_mean_to_be_between(
    "risk_weight_pct", min_value=0.1, max_value=3.0
)
# If mean > 3.0 something has gone catastrophically wrong in the source system

validator.expect_column_values_to_be_between(
    "ead", min_value=0, max_value=500_000_000_000, mostly=0.99
)

# Cross-column: rwa must not exceed ead × 12.5 (maximum risk weight)
validator.expect_column_pair_values_to_be_in_set(
    column_A="rwa", column_B="ead",
    value_pairs_set=None   # not a set check — use custom SQL expectation below
)

# Custom SQL expectation — impossible in dbt without a macro
validator.expect_column_values_to_not_be_null(
    "rwa",
    row_condition='exposure_class == "corporates"',
    condition_parser="pandas"
)

7.4 Liquidity Assets Suite

# Key expectations for raw.liquidity_assets

validator.expect_column_values_to_be_in_set(
    "hqla_level", value_set=["1", "2A", "2B"]
)

# haircut_rate must be a decimal fraction, not a whole number percent
validator.expect_column_values_to_be_between(
    "haircut_rate", min_value=0.0, max_value=1.0   # 0% to 100% as fraction
)
# EBA Delegated Regulation 2015/61 specifies specific haircut levels:
# Level 1: 0%, Level 2A: 15%, Level 2B: 25-50%
# This check ensures haircut is sane before LCR calculation

validator.expect_column_quantile_values_to_be_between(
    "market_value",
    quantile_ranges={
        "quantiles": [0.05, 0.50, 0.95],
        "value_ranges": [[1_000, None], [100_000, None], [None, 50_000_000_000]]
    }
)
# Quantile expectations — entirely impossible in dbt built-in tests

8. Layer 2 — Mart Expectation Suite

After dbt runs, the mart tables contain the numbers that will go into your XBRL instance document. These are your last line of defence before regulatory submission.

# gx/expectations/mart_corep_suite.json — built programmatically

validator = context.get_validator(
    datasource_name="corep_postgres",
    data_connector_name="default_inferred_data_connector_name",
    data_asset_name="mart.corep_c0300",  # Capital Ratios template
    expectation_suite_name="mart_corep_suite",
)

# ── Basel III regulatory minimums ────────────────────────────────────────
# CRR Article 92(1)(a): CET1 ≥ 4.5%
validator.expect_column_values_to_be_between(
    "cet1_ratio", min_value=0.045, max_value=1.0
)

# CRR Article 92(1)(b): Tier 1 ≥ 6.0%
validator.expect_column_values_to_be_between(
    "tier1_ratio", min_value=0.06, max_value=1.0
)

# CRR Article 92(1)(c): Total Capital ≥ 8.0%
validator.expect_column_values_to_be_between(
    "total_capital_ratio", min_value=0.08, max_value=1.0
)

# CRR2 Article 429: Leverage ratio ≥ 3.0%
validator.expect_column_values_to_be_between(
    "leverage_ratio", min_value=0.03, max_value=1.0
)

# ── Cross-ratio consistency ───────────────────────────────────────────────
# tier1_ratio ≥ cet1_ratio always (CET1 ⊂ Tier1)
# total_capital_ratio ≥ tier1_ratio always (T1 ⊂ Total Capital)
# These are impossible to express in a single dbt test without a macro
validator.expect_column_pair_values_A_to_be_greater_than_or_equal_to_B(
    "tier1_ratio", "cet1_ratio"
)
validator.expect_column_pair_values_A_to_be_greater_than_or_equal_to_B(
    "total_capital_ratio", "tier1_ratio"
)

# ── XBRL decimal precision check ─────────────────────────────────────────
# EBA DPM requires monetary values in thousands (decimals=-3)
# Ratios need 6 decimal places (decimals=4 in XBRL = 4 significant figures)
# This ensures no floating-point garbage makes it into the XBRL document
validator.expect_column_values_to_match_regex(
    "cet1_ratio", regex=r"^\d+\.\d{6}$",
    meta={"notes": "EBA DPM requires 6dp for ratio values per xbrl decimals=4"}
)

# ── Table completeness ───────────────────────────────────────────────────
validator.expect_table_row_count_to_equal(1)
# C 03.00 always produces exactly one row — the reporting period totals

validator.save_expectation_suite(discard_failed_expectations=False)
📋 The Cross-Ratio Consistency Check Is Your Most Important Gate

The check tier1_ratio ≥ cet1_ratio is a mathematical identity that must hold because CET1 is a subset of Tier 1 capital. If this fails it means one of three things: a bug in your dbt aggregation logic, a sign error in the source data, or a schema mismatch between two mart models. Any of these would produce an incorrect COREP report. dbt tests cannot express this because it is a cross-column constraint, not a single-column assertion.

9. quality.py — The Module Implementation

"""
modules/quality.py — Run GX expectation suites as pipeline quality gates.

Layer 1 (raw gate)  : runs after ingest, before dbt
Layer 2 (mart gate) : runs after dbt, before xbrl_gen
"""

import logging
import os
from pathlib import Path

import great_expectations as gx
from great_expectations.core.batch import BatchRequest
from great_expectations.checkpoint import SimpleCheckpoint

from modules.base import BaseModule

log = logging.getLogger(__name__)

GX_DIR = Path(os.environ.get("GX_DIR", "gx"))


class QualityGateError(RuntimeError):
    """Raised when a GX checkpoint produces any failed expectations."""
    pass


class QualityModule(BaseModule):
    MODULE_NAME = "quality"

    # Ordered list of (checkpoint_name, suite_name, asset_name)
    # Layer 1 = raw tables, Layer 2 = mart tables
    _CHECKPOINTS = [
        # Layer 1 — run before dbt
        ("raw_capital_checkpoint",   "raw_capital_suite",   "raw.capital_instruments"),
        ("raw_rwa_checkpoint",        "raw_rwa_suite",        "raw.rwa_exposures"),
        ("raw_liquidity_checkpoint",  "raw_liquidity_suite",  "raw.liquidity_assets"),
        ("raw_outflows_checkpoint",   "raw_outflows_suite",   "raw.liquidity_outflows"),
        # Layer 2 — run after dbt mart build
        ("mart_corep_checkpoint",     "mart_corep_suite",     "mart.corep_c0300"),
    ]

    def input_check(self) -> None:
        """Verify GX context directory and expectation suite JSON files exist."""
        if not GX_DIR.exists():
            raise RuntimeError(
                f"[quality] GX directory not found: {GX_DIR}. "
                "Run: great_expectations init"
            )
        suites_dir = GX_DIR / "expectations"
        missing = [
            suite
            for _, suite, _ in self._CHECKPOINTS
            if not (suites_dir / f"{suite}.json").exists()
        ]
        if missing:
            raise RuntimeError(
                "[quality] Missing expectation suite JSON files: "
                + ", ".join(missing)
            )
        log.info("[quality] All %d expectation suites present.", len(self._CHECKPOINTS))

    def _execute(self) -> None:
        """Run all GX checkpoints. Raise QualityGateError on any failure."""
        context = gx.get_context(context_root_dir=str(GX_DIR))
        failed_suites = []
        results_summary = []

        for checkpoint_name, suite_name, asset_name in self._CHECKPOINTS:
            log.info(
                "[quality] Running checkpoint: %s → asset: %s",
                checkpoint_name, asset_name
            )
            result = context.run_checkpoint(
                checkpoint_name=checkpoint_name,
                batch_request=BatchRequest(
                    datasource_name="corep_postgres",
                    data_connector_name="default_inferred_data_connector_name",
                    data_asset_name=asset_name,
                ),
            )

            passed = result.success
            stats  = result.get_statistics()
            evaluated  = stats.get("evaluated_expectations", 0)
            successful = stats.get("successful_expectations", 0)
            pct        = stats.get("success_percent", 0.0)

            results_summary.append({
                "suite":      suite_name,
                "asset":      asset_name,
                "evaluated":  evaluated,
                "passed":     successful,
                "pct":        pct,
                "status":     "PASS" if passed else "FAIL",
            })

            log.info(
                "[quality] %s: %s (%d/%d expectations, %.1f%%)",
                suite_name, "PASS" if passed else "FAIL",
                successful, evaluated, pct
            )

            if not passed:
                failed_suites.append(suite_name)

        # Build data docs (HTML report)
        context.build_data_docs()
        self._upload_data_docs_to_minio(context)
        self._write_audit(results_summary)

        if failed_suites:
            raise QualityGateError(
                f"[quality] {len(failed_suites)} suite(s) failed: "
                + ", ".join(failed_suites)
                + ". Pipeline halted — see GX data docs for details."
            )

    def _upload_data_docs_to_minio(self, context) -> None:
        """Upload HTML data docs to MinIO for persistent audit evidence."""
        try:
            from minio import Minio
            docs_dir = GX_DIR / "uncommitted" / "data_docs" / "local_site"
            client = Minio(
                os.environ.get("MINIO_ENDPOINT", "minio:9000"),
                access_key=os.environ.get("MINIO_ROOT_USER", "minioadmin"),
                secret_key=os.environ.get("MINIO_ROOT_PASSWORD", "minioadmin"),
                secure=False,
            )
            bucket = "corep-gx-reports"
            if not client.bucket_exists(bucket):
                client.make_bucket(bucket)

            for html_file in docs_dir.rglob("*.html"):
                object_name = str(html_file.relative_to(docs_dir))
                client.fput_object(bucket, object_name, str(html_file), content_type="text/html")
                log.info("[quality] Uploaded data doc → minio://%s/%s", bucket, object_name)
        except Exception as exc:
            log.warning("[quality] MinIO data docs upload failed (non-fatal): %s", exc)

    def _write_audit(self, results: list) -> None:
        """Persist quality run summary to audit.pipeline_run_log."""
        import json
        from modules.base import _pg_conn
        conn = _pg_conn()
        try:
            cur = conn.cursor()
            cur.execute(
                """
                INSERT INTO audit.pipeline_run_log
                    (run_id, module_name, status, metadata, ran_at)
                VALUES (%s, 'quality', %s, %s, now())
                """,
                (
                    self._run_id,
                    "FAIL" if any(r["status"] == "FAIL" for r in results) else "PASS",
                    json.dumps(results),
                ),
            )
            conn.commit()
        finally:
            conn.close()

    def emit_lineage(self) -> None:
        # Quality runs are validation-only — no data written, no lineage event needed
        log.info("[quality] No lineage event emitted (read-only validation step).")

    def output_check(self) -> None:
        # If _execute completed without raising, all checkpoints passed
        log.info("[quality] output_check: all suites passed (no QualityGateError raised).")

10. Airflow Branching on Quality Failure

Your DAG from Day 14 uses BranchPythonOperator. Here is how the quality gate hooks into the branch logic:

# dags/corep_pipeline_dag.py — quality gate branch logic

def _quality_branch(**context) -> str:
    """Return next task ID based on quality gate result stored in XCom."""
    quality_status = context["task_instance"].xcom_pull(
        task_ids="run_quality_gates", key="quality_status"
    )
    if quality_status == "PASS":
        return "run_xbrl_generation"
    return "quarantine_failed_run"    # writes to audit, alerts ops team

run_quality_gates = PythonOperator(
    task_id="run_quality_gates",
    python_callable=_run_quality_module,
)

branch_on_quality = BranchPythonOperator(
    task_id="branch_on_quality",
    python_callable=_quality_branch,
)

# DAG flow
(
    run_ingest
    >> run_quality_layer1
    >> run_dbt_staging
    >> run_dbt_intermediate
    >> run_dbt_mart
    >> run_quality_layer2      # second gate after mart build
    >> branch_on_quality
)
branch_on_quality >> run_xbrl_generation
branch_on_quality >> quarantine_failed_run

11. dbt Tests vs Great Expectations — Full Comparison

Capabilitydbt built-in testsdbt_expectations packageGreat Expectations 0.18
Not nullYesYesYes
Unique valuesYesYesYes
Accepted value setYesYesYes
Value range (between)Via expression_is_trueYesYes
Regex pattern matchNoYesYes
Completeness % (mostly)NoNoYes — native
Column mean / std-dev rangeNoNoYes
Quantile value rangesNoNoYes
Cross-column pair orderingNoNoYes
Row-conditional check (IF tier=CET1 THEN…)NoNoYes — row_condition
Table row count bandAwkwardYesYes
HTML audit report outputNoNoYes — Data Docs
Runs independently of transformationNoNoYes — checkpoint
Suites versioned in JSONNo (schema.yml)NoYes
Runs on raw layer before dbtNoNoYes

12. Data Docs — Your Audit Evidence

Every time GX runs a checkpoint, it updates an HTML report in gx/uncommitted/data_docs/local_site/. This report is your regulatory audit evidence that quality was checked before submission.

📋 BCBS 239 Principle 3 — Completeness

“A BCIB should be able to capture and aggregate all material risk data across the banking group. Data should be available by business line, legal entity, asset type, industry, region and other groupings, as relevant for the risk in question.”

Your GX data docs prove that completeness was measured per column, per table, per pipeline run. The HTML file timestamped before submission is your evidence. Upload it to MinIO so it persists independently of the pipeline container.

To view the data docs locally after a pipeline run:

# From WSL / inside the pipeline container
great_expectations docs build --directory gx

# Open in browser (WSL path → Windows path)
explorer.exe "$(wslpath -w gx/uncommitted/data_docs/local_site/index.html)"

13. Mapping Quality Gates to BCBS 239 Principles

BCBS 239 PrincipleRequirementHow GX satisfies it
P3 — CompletenessCapture all material risk datamostly=0.95 completeness checks on all raw tables
P4 — TimelinessData available on time for risk decisionsCheckpoint runtime logged to audit.pipeline_run_log
P5 — AdaptabilityRisk data adaptable to varying scenariosSeparate suites per table — update one suite without touching others
P6 — AccuracyData reflects actual risk positionsCross-ratio pair checks, statistical distribution bounds, regex format checks
P7 — Completeness (reporting)All material risk positions in reportsRow count check on mart tables confirms all data points present
P8 — ClarityReconciliation between risk reportsCross-column pair checks: tier1_ratio ≥ cet1_ratio always holds

14. Run the Quality Gates

# Full pipeline run (includes quality layer 1 after ingest)
python pipeline.py --module quality

# Run only the raw gate (useful during development)
python pipeline.py --from ingest --to quality

# Run only the mart gate after dbt is already complete
python pipeline.py --from quality --to quality

# Check exit code — quality gate failures raise exit code 1
echo $?
# Expected output on PASS
INFO [quality] Running checkpoint: raw_capital_checkpoint → asset: raw.capital_instruments
INFO [quality] raw_capital_suite: PASS (12/12 expectations, 100.0%)
INFO [quality] Running checkpoint: raw_rwa_checkpoint → asset: raw.rwa_exposures
INFO [quality] raw_rwa_suite: PASS (9/9 expectations, 100.0%)
INFO [quality] Running checkpoint: raw_liquidity_checkpoint → asset: raw.liquidity_assets
INFO [quality] raw_liquidity_suite: PASS (11/11 expectations, 100.0%)
INFO [quality] Running checkpoint: mart_corep_checkpoint → asset: mart.corep_c0300
INFO [quality] mart_corep_suite: PASS (8/8 expectations, 100.0%)
INFO [quality] Data docs built at gx/uncommitted/data_docs/local_site/index.html
INFO [quality] output_check: all suites passed (no QualityGateError raised).

# Expected output on FAIL (e.g. risk weight typo in source data)
INFO [quality] raw_rwa_suite: FAIL (7/9 expectations, 77.8%)
ERROR [quality] 1 suite(s) failed: raw_rwa_suite. Pipeline halted — see GX data docs for details.
Traceback: QualityGateError: [quality] 1 suite(s) failed...

📚 Day 8 Key Takeaways

  • Two layers are mandatory — raw gate before dbt, mart gate before XBRL. dbt tests can only run on transformed tables.
  • mostly is the killer feature — it lets you express regulatory completeness requirements as a percentage rather than all-or-nothing, matching real-world data realities.
  • Cross-column pair checks (tier1_ratio ≥ cet1_ratio) are impossible in dbt tests but essential for Basel III consistency verification.
  • Statistical distribution expectations (mean, quantiles) catch scale/magnitude errors — the most common ETL bug with financial data (e.g. basis points vs decimal).
  • Data Docs are audit evidence — upload to MinIO, timestamp before submission. BCBS 239 auditors expect proof that quality was checked.
  • Quality module is read-only — no lineage event needed. OpenLineage lineage events are for data-writing steps only.
  • Next: Day 9 — OpenMetadata catalog: auto-discovery of all tables built so far, EBA glossary terms, PII tags, and lineage federation with Marquez.
Previous Post
Infographic showing how dbt SQL maps bank data to COREP templates, connecting PostgreSQL, Trino, and Marquez data sources through dbt transformations to produce CRR3‑compliant XBRL reports using Arelle, Great Expectations, and OpenLineage.

Mapping Bank Data to COREP Templates — the dbt SQL That Makes It Work

Next Post
Add a comment

Leave a Reply

Your email address will not be published. Required fields are marked *