Popular Now
Infographic illustrating production‑ready GKE architecture, showing Google Cloud services, Kubernetes clusters, DevOps/GitOps workflows, SRE practices, observability, security, and disaster recovery components.

Production-Ready GKE: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Infographic showing best practices for production‑ready EKS deployments, illustrating AWS cloud architecture, Kubernetes clusters, GitOps automation, observability, security, and disaster recovery principles.

Production-Ready EKS: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

How I Built a COREP Regulatory Reporting Pipeline in 18 Days — From Data Governance Theory to Valid XBRL

Full retrospective: how I built a COREP XBRL pipeline with dbt, Ranger, Arelle and Airflow in 18 days — architecture decisions, hardest problems, and what’s next.
Day 17 of 18 — COREP Governance Pipeline · Capstone

The retrospective: what was built, what broke, what the architecture decision log looks like in hindsight, and what a real bank would need to add before going live.

Eighteen days ago I set a constraint: build a complete COREP regulatory reporting pipeline using only open-source tools, learn EU banking governance from scratch, and write a blog post every day documenting both the theory and the implementation. Today is the retrospective.

The end product is a working pipeline that ingests synthetic bank data, transforms it through a governed dbt layer, validates it against the EBA XBRL taxonomy with Arelle, and produces a submission-ready ZIP file — all orchestrated by Apache Airflow with column-level lineage to Marquez, row/column security via Apache Ranger, and a regulatory dashboard in Apache Superset. Every component is open source. The total cost of the infrastructure is the electricity to run Docker Desktop.

This post covers: what was actually built, the six governance pillars the project was designed around, the one architecture decision that changed everything (Trino as the single query gateway), the three hardest technical problems encountered, and what a production deployment at a real bank would need on top of all this.

What Was Actually Built

SOURCE DATA (CSV) │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ INGESTION LAYER │ │ IngestModule → raw.* schema (PostgreSQL) │ │ OpenLineage COMPLETE event → Marquez │ │ Source CSVs archived → MinIO corep-eba-source │ └──────────────────────────┬──────────────────────────────────────────┘ │ QualityModule(layer=”raw”) — GX gate │ FAIL → quarantine_raw_failure (Airflow) ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ TRANSFORMATION LAYER │ │ dbt staging → views (stg_capital_instruments, stg_rwa …) │ │ dbt intermediate → tables (int_capital_by_tier …) │ │ dbt mart → tables (corep_c0100, c0200, c0300, c4700) │ │ dbt-openlineage → column-level lineage → Marquez on every run │ └──────────────────────────┬──────────────────────────────────────────┘ │ QualityModule(layer=”mart”) — GX gate │ FAIL → quarantine_mart_failure (Airflow) ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ GOVERNANCE LAYER │ │ OpenMetadata 1.3.4 — catalog + EBA glossary + PII tags │ │ Apache Ranger — RBAC + column masks + row filters on Trino │ │ pg_audit — DDL/DML backstop at PostgreSQL level │ └──────────────────────────┬──────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ REPORTING LAYER │ │ XbrlGenModule — reads mart via Trino, generates XBRL instance │ │ XbrlValidModule — Arelle structural + calc + formula linkbase │ │ FAIL → quarantine_xbrl_failure (Airflow) │ │ PASS → submission ZIP (XBRL + validation report + GX data docs) │ │ → MinIO corep-xbrl-output/submissions/ │ └──────────────────────────┬──────────────────────────────────────────┘ │ ▼ Superset Dashboards (via Trino only) Capital Ratios + Governance Audit view

The complete codebase: 17 Docker services, 8 Python modules, 4 dbt mart models, 10 Ranger policies, 4 GX expectation suites, and a pytest suite with 15 adversarial test cases.

The Six Governance Pillars

The 18-day plan was structured around six pillars that the EBA and ECB specifically audit when reviewing a bank’s COREP data quality framework. Here is how each pillar was addressed:

1 · Data Lineage

Three layers: pipeline lineage (openlineage-airflow → Marquez), transformation lineage (dbt-openlineage → Marquez column facets), catalog lineage (OpenMetadata federating Marquez). A regulator can trace any XBRL fact to its source CSV row.

2 · Data Quality

Great Expectations 0.18 with four expectation suites. Two hard gates in the Airflow DAG: raw layer (structural validation) and mart layer (business rules including CET1 floor). Failures quarantine the run before any data reaches XBRL generation.

3 · Data Catalog

OpenMetadata 1.3.4 with auto-discovery of PostgreSQL schemas, dbt manifest ingestion for column descriptions, EBA COREP glossary with concept IDs, and GDPR PII tags on borrower name/LEI columns in the raw layer.

4 · Data Security

Apache Ranger with five roles (corep_reporting, risk_analyst, data_engineer, auditor, pipeline_service). Column masking on LEI/counterparty name, row filters on jurisdiction, DML restriction to pipeline_service only. Enforced at Trino, not at the application layer.

5 · Data Auditability

Every pipeline run writes a structured JSON audit record to audit.pipeline_runs (PostgreSQL). pg_audit captures DDL/DML at the database level. Ranger audit logs stream to PostgreSQL and MinIO. XBRL validation reports and GX HTML data docs archived to MinIO with object retention.

6 · Regulatory Output

Arelle 2.30.3 generates and validates XBRL instances against the EBA taxonomy. Three validation passes: structural (lxml), calculation linkbase (parent = sum of children), formula linkbase (EBA business rules including capital floor checks). Output is a submission ZIP file.

The Architecture Decision That Changed Everything: Trino as the Single Query Gateway

The most consequential architecture decision made on Day 4 was this: every consumer of data — dbt, GX, Superset, XbrlGenModule, all test code — connects to Trino, never directly to PostgreSQL. At the time it felt like a slightly awkward constraint. In retrospect it is the single decision that makes the entire governance stack coherent.

Here is what it actually delivered:

ConcernWithout Trino GatewayWith Trino Gateway
Ranger column maskingMust be implemented in every application (dbt, Superset, Python)Enforced once in Trino — all consumers get it automatically
Adding a new data source (Iceberg on MinIO)Every consumer needs a new connection + driverAdd a Trino catalog — all consumers query it with the same SQL
Audit trail completenesspg_audit only captures direct PostgreSQL connectionsTrino + Ranger audit logs capture every query regardless of consumer
Superset securitySuperset RLS must mirror Ranger policies — two systems to maintainSuperset inherits Ranger decisions via Trino — single enforcement point
Future query federationRequires application-level JOIN across two connectionsTrino federated JOIN across PostgreSQL + Iceberg in a single SQL statement

The cost was one additional hop in the query path (application → Trino → PostgreSQL) and the need to qualify every table as postgresql.schema.table rather than schema.table. That is a trivial cost relative to what it buys.

The principle: In a regulated environment, every security control that lives in the application layer is a control that can be bypassed by a developer connecting directly to the database. The only controls that count are the ones enforced at the data layer — and Trino is that layer.

The Three Hardest Technical Problems

1 · XBRL Formula Validation — Parents Must Equal Children

The calculation linkbase problem was not obvious until Day 12. The EBA taxonomy specifies that c0010 (Own Funds) = c0020 (CET1) + c0030 (AT1) + c0040 (T2) via a calculation linkbase arc with weight +1. Arelle enforces this with tolerance decimals=-3, meaning a EUR 1,000 rounding difference is acceptable but a EUR 1,001 difference is an xbrlCalcs:inconsistency error that fails validation.

The root cause: the dbt mart models computed each tier independently using separate CTEs, introducing rounding divergence at the EUR thousandths position. The fix required two changes:

  1. In dbt: all four COREP concepts (c0010, c0020, c0030, c0040) must be derived from the same CTE using a single SUM(amount_eur) grouped by tier, then assembled with FILTER(WHERE tier = 'CET1'). Parent = sum of same-CTE children by construction.
  2. In xbrl_gen.py: rounding (ROUND_HALF_UP, decimals=-3) is applied exactly once, terminally, at the point of writing the XBRL fact. Never round in intermediate dbt models — let the mart values carry full precision and round only when writing to XML.

The lesson: XBRL calculation linkbase validation is not a reporting concern — it is a data modelling concern. The constraint must be enforced in the mart model, not patched in the XBRL generator.

2 · Ranger Tag Synchronisation with OpenMetadata

The plan was to tag columns as GDPR.PII in OpenMetadata and have Ranger automatically pick up those tags via the Ranger Tag Sync service to enforce column masking. In practice, the open-source version of Ranger Tag Sync (Apache Atlas integration) does not support OpenMetadata as a tag source out of the box — that integration requires Ranger 2.4+ with a custom service definition, or the commercial Cloudera version.

The workaround adopted here: OpenMetadata PII tags are informational (they drive the catalog UI and column-level lineage annotations), while Ranger policies are defined explicitly by column name in column_mask_policies.json. The two systems agree on what is PII — but they do not sync automatically. A CI step in the pipeline validates that every column tagged GDPR.PII in OpenMetadata has a corresponding Ranger mask policy, and fails the build if the two diverge.

# tools/check_pii_ranger_sync.py
# Run in CI — compares OM PII tags against Ranger mask policy coverage

import requests, json, sys

OM_URL    = "http://localhost:8585"
RANGER_URL = "http://localhost:6080"

# 1. Collect all columns tagged GDPR.PII in OpenMetadata
resp = requests.get(
    f"{OM_URL}/api/v1/search/query",
    params={"q": "tag:GDPR.PII", "index": "column_search_index", "limit": 100},
    headers={"Authorization": "Bearer <token>"},
)
pii_columns = {
    f"{hit['_source']['table']}.{hit['_source']['name']}"
    for hit in resp.json()["hits"]["hits"]
}

# 2. Collect all columns covered by Ranger mask policies
ranger_resp = requests.get(
    f"{RANGER_URL}/service/plugins/policies",
    params={"serviceName": "corep-trino", "policyType": "1"},   # type 1 = masking
    auth=("admin", "<password>"),
)
masked_columns = {
    f"{res['table']}.{col}"
    for policy in ranger_resp.json()["policies"]
    for res in policy["resources"].get("table", {}).get("values", [])
    for col in res.get("column", {}).get("values", [])
}

unprotected = pii_columns - masked_columns
if unprotected:
    print(f"[FAIL] PII columns with no Ranger mask: {unprotected}")
    sys.exit(1)
print("[PASS] All PII columns covered by Ranger mask policies.")

3 · Synthetic Data Consistency Across Six Tables

The six source CSV tables (capital_instruments, rwa_exposures, loans, counterparties, liquidity_assets, liquidity_outflows) are individually easy to generate. Making them internally consistent — so that the COREP templates computed from them produce plausible ratios that exercise all dbt code paths without triggering the GX regulatory floor checks — required careful reverse-engineering.

The constraints that had to be simultaneously satisfied:

ConstraintValue UsedWhy
CET1 ratio ≥ 4.5%12.8%Clears GX mart gate; realistic for a mid-size European bank
Total Capital ratio ≥ 8%16.4%Basel III Pillar 1 minimum
LCR ≥ 100%143%CRR Article 412 — EBA formula linkbase will check this
RWA exposure classes cover all 7 standard classesAll populatedGX expects no NULL exposure_class values; exercises all dbt GROUP BY branches
c0010 = c0020 + c0030 + c0040 exactlyBuilt from bottom upCalculation linkbase — generate tiers first, sum to own funds
No counterparty.name or LEI in mart schemaDropped in stg_counterpartiesGDPR — staging model explicitly excludes PII columns

The resolution: generate the data bottom-up. Start with the tier amounts (CET1, AT1, T2), compute Own Funds as their exact sum, derive RWA from a target CET1 ratio, then generate loans and exposures consistent with that RWA. A single Python script enforces all constraints at generation time so the pipeline never has to work around bad input data.

Running the Complete Pipeline — What the End-to-End Looks Like

$ docker compose up -d # 17 services start $ python pipeline.py –module ingest # Load raw CSVs $ dbt run –profiles-dir dbt/ # Build mart tables $ python pipeline.py –module quality # GX gates $ python pipeline.py –module xbrl_gen # Generate XBRL instance $ python pipeline.py –module xbrl_valid # Validate with Arelle # → output/xbrl/corep_c0100_2025-12-31.xbrl # → output/xbrl/validation_report_2025-12-31.json # → MinIO: corep-xbrl-output/submissions/# Or trigger the full orchestrated run via Airflow: $ airflow dags trigger corep_governance_pipeline \ –conf ‘{“reporting_date”: “2025-12-31”}’ # DAG runs 19 tasks over ~45 minutes # Final artefact: corep-xbrl-output/submissions/corep_submission_2025-12-31.zip

The submission ZIP contains:

corep_submission_2025-12-31.zip
├── corep_c0100_2025-12-31.xbrl        ← XBRL instance (Own Funds)
├── corep_c0200_2025-12-31.xbrl        ← XBRL instance (RWA)
├── corep_c0300_2025-12-31.xbrl        ← XBRL instance (Capital Ratios)
├── corep_c4700_2025-12-31.xbrl        ← XBRL instance (LCR)
├── validation_report_2025-12-31.json  ← Arelle pass/fail with all error codes
├── gx_data_docs/                      ← GX HTML reports for raw + mart gates
└── manifest.json                      ← Pipeline run metadata, versions, hashes

What a Real Bank Would Need on Top of This

This project is a complete proof-of-concept. Moving it to a production environment at a regulated institution would require the following additions:

GapWhat to AddWhy It’s Out of Scope Here
Real EBA taxonomy downloadAutomate taxonomy update from EBA XBRL website on each release (quarterly)Using a pinned static copy; taxonomy changes require re-validation of all mart models
Multi-entity reportingAdd a legal_entity_id dimension to all mart models and XBRL contextsSynthetic data covers a single bank entity
Source system connectivityReplace CSV ingest with Kafka connectors or REST API extractors from core banking / risk systemsReal source systems require vendor contracts and VPN access
Kerberos / LDAP authenticationReplace Trino HTTP auth with Kerberos for Ranger to evaluate real user identitiesDev environment passes username as a plain string
Ranger tag syncBuild or buy the OpenMetadata → Ranger tag sync bridge (see hardest problems)Requires custom Ranger service definition or commercial Cloudera CDP
4-eyes principleAdd an approval step in the Airflow DAG (HumanInTheLoop sensor) before submission ZIP is finalisedRegulatory requirement in several EU jurisdictions — trivial to add
Immutable audit logStream audit.pipeline_runs to an append-only store (MinIO object lock, or a separate audit database with no UPDATE/DELETE rights)pg_audit provides database-level immutability; application-level audit table is mutable by admins
Secrets managementReplace .env file with HashiCorp Vault or Azure Key Vault; rotate credentials without redeployingDevelopment convenience; .env is explicitly in .gitignore

18-Day Retrospective: What the Journey Looked Like

DaysPhaseKey Output
1–4Theory foundationData governance framework, EU regulatory alphabet, DORA, open-source stack selection
5–7Data layerSynthetic data generation, dbt models (staging→intermediate→mart), column-level lineage
8–9Quality + catalogGX expectation suites, Great Expectations HTML data docs, OpenMetadata with EBA glossary
10SecurityApache Ranger RBAC, column masking, row filters, DML restrictions via Trino
11–12XBRL generation + validationArelle taxonomy parsing, XBRL instance generation, three-pass validation
13OrchestrationFull Airflow DAG with 19 tasks, branching on quality/XBRL failure, SLA callbacks
14End-to-end lineageMarquez lineage snapshot, ECB audit query answerer, 5-layer lineage chain
15DashboardsSuperset via Trino, 5 virtual datasets, capital + governance audit dashboards
16Adversarial testing15 pytest tests — bad data, bad roles, broken XBRL, branch routing
17CapstoneThis post
18Production checklistTomorrow: what it takes to go from this Docker stack to a real bank deployment

By the Numbers

MetricValue
Docker services17
Python modules8 (ingest, quality, catalog, security, xbrl_gen, xbrl_valid + lineage + audit scripts)
dbt models13 (5 staging + 4 intermediate + 4 mart)
Ranger policies10 (4 access-allow, 3 column-mask, 3 row-filter)
GX expectation suites4 (raw_capital, raw_rwa, raw_liquidity, mart_corep)
Airflow tasks19 (including 3 quarantine branches)
XBRL templates covered4 (C 01.00, C 02.00, C 03.00, C 47.00)
EBA XBRL concepts mapped26 (across all 4 templates)
Adversarial test cases15 (5 data quality, 7 security, 3 XBRL integrity)
Lines of application code~2,400
Infrastructure cost£0 (all open source)

Why Open Source Is Sufficient for Regulatory Reporting

The most common objection to this stack in the banking industry is: “we can’t use open-source tools for regulatory reporting — we need vendor support.” That objection conflates two different concerns.

Vendor support is a procurement and risk management concern — if Arelle has a bug that generates invalid XBRL, who do you call? The answer is: the EBA’s own XBRL rendering service runs Arelle. Every major European bank uses Arelle for XBRL generation. The EBA itself publishes the taxonomy in Arelle-compatible format. For dbt, the company (dbt Labs) offers enterprise support contracts. For Airflow, Astronomer does. Apache Ranger and OpenMetadata both have commercial distributions (Cloudera CDP, Collibra).

The tools themselves are fit for purpose. What this project demonstrates is that the open-source stack can produce a submission-ready, regulator-auditable XBRL package with full end-to-end lineage, role-based access control, and adversarially-tested governance gates — at zero infrastructure cost. A real bank adds vendor support contracts and production hardening on top of the same architecture. The architecture itself does not change.

Capstone Takeaways

  • The Trino-as-gateway pattern is the single most important architecture decision — it makes Ranger enforcement universal and eliminates per-application security logic
  • XBRL calculation linkbase integrity is a data modelling constraint, not an output formatting concern — build it into the mart model CTEs from day one
  • The Ranger/OpenMetadata tag sync gap is the most significant open-source limitation — bridge it with a CI sync check script until a native integration exists
  • Synthetic data must be generated bottom-up from business constraints (capital ratios, LCR thresholds) — not top-down from column definitions
  • The six governance pillars (lineage, quality, catalog, security, auditability, regulatory output) are not independent — a control gap in any one pillar undermines the others
  • An adversarial test suite that proves controls reject bad input is more valuable to a regulator than a test suite that proves the happy path works
  • Every governance control that lives in the application layer can be bypassed by a developer with direct database access — enforce at the data layer (Ranger + pg_audit)
  • The end product — a submission ZIP with XBRL instances, validation reports, GX data docs, and a manifest — is what a regulator actually sees: not your pipeline architecture, but the artefacts it produces

Tomorrow, the final blog: Day 18 — the production readiness checklist. What it takes to move from this Docker Compose proof-of-concept to a deployment that a Chief Data Officer would sign off on before a real ECB submission.


Previous Post

Testing a Regulatory Pipeline — Bad Data, Bad Roles, and What Should Fail

Next Post
Add a comment

Leave a Reply

Your email address will not be published. Required fields are marked *