How I Built a COREP Regulatory Reporting Pipeline in 18 Days — From Data Governance Theory to Valid XBRL

Full retrospective: how I built a COREP XBRL pipeline with dbt, Ranger, Arelle and Airflow in 18 days — architecture decisions, hardest problems, and what’s next.

Bhanu

May 8, 2026

10 min read

Day 17 of 18 — COREP Governance Pipeline · Capstone

The retrospective: what was built, what broke, what the architecture decision log looks like in hindsight, and what a real bank would need to add before going live.

Eighteen days ago I set a constraint: build a complete COREP regulatory reporting pipeline using only open-source tools, learn EU banking governance from scratch, and write a blog post every day documenting both the theory and the implementation. Today is the retrospective.

The end product is a working pipeline that ingests synthetic bank data, transforms it through a governed dbt layer, validates it against the EBA XBRL taxonomy with Arelle, and produces a submission-ready ZIP file — all orchestrated by Apache Airflow with column-level lineage to Marquez, row/column security via Apache Ranger, and a regulatory dashboard in Apache Superset. Every component is open source. The total cost of the infrastructure is the electricity to run Docker Desktop.

This post covers: what was actually built, the six governance pillars the project was designed around, the one architecture decision that changed everything (Trino as the single query gateway), the three hardest technical problems encountered, and what a production deployment at a real bank would need on top of all this.

What Was Actually Built

SOURCE DATA (CSV) │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ INGESTION LAYER │ │ IngestModule → raw.* schema (PostgreSQL) │ │ OpenLineage COMPLETE event → Marquez │ │ Source CSVs archived → MinIO corep-eba-source │ └──────────────────────────┬──────────────────────────────────────────┘ │ QualityModule(layer=”raw”) — GX gate │ FAIL → quarantine_raw_failure (Airflow) ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ TRANSFORMATION LAYER │ │ dbt staging → views (stg_capital_instruments, stg_rwa …) │ │ dbt intermediate → tables (int_capital_by_tier …) │ │ dbt mart → tables (corep_c0100, c0200, c0300, c4700) │ │ dbt-openlineage → column-level lineage → Marquez on every run │ └──────────────────────────┬──────────────────────────────────────────┘ │ QualityModule(layer=”mart”) — GX gate │ FAIL → quarantine_mart_failure (Airflow) ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ GOVERNANCE LAYER │ │ OpenMetadata 1.3.4 — catalog + EBA glossary + PII tags │ │ Apache Ranger — RBAC + column masks + row filters on Trino │ │ pg_audit — DDL/DML backstop at PostgreSQL level │ └──────────────────────────┬──────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ REPORTING LAYER │ │ XbrlGenModule — reads mart via Trino, generates XBRL instance │ │ XbrlValidModule — Arelle structural + calc + formula linkbase │ │ FAIL → quarantine_xbrl_failure (Airflow) │ │ PASS → submission ZIP (XBRL + validation report + GX data docs) │ │ → MinIO corep-xbrl-output/submissions/ │ └──────────────────────────┬──────────────────────────────────────────┘ │ ▼ Superset Dashboards (via Trino only) Capital Ratios + Governance Audit view

The complete codebase: 17 Docker services, 8 Python modules, 4 dbt mart models, 10 Ranger policies, 4 GX expectation suites, and a pytest suite with 15 adversarial test cases.

The Six Governance Pillars

The 18-day plan was structured around six pillars that the EBA and ECB specifically audit when reviewing a bank’s COREP data quality framework. Here is how each pillar was addressed:

1 · Data Lineage

Three layers: pipeline lineage (openlineage-airflow → Marquez), transformation lineage (dbt-openlineage → Marquez column facets), catalog lineage (OpenMetadata federating Marquez). A regulator can trace any XBRL fact to its source CSV row.

2 · Data Quality

Great Expectations 0.18 with four expectation suites. Two hard gates in the Airflow DAG: raw layer (structural validation) and mart layer (business rules including CET1 floor). Failures quarantine the run before any data reaches XBRL generation.

3 · Data Catalog

OpenMetadata 1.3.4 with auto-discovery of PostgreSQL schemas, dbt manifest ingestion for column descriptions, EBA COREP glossary with concept IDs, and GDPR PII tags on borrower name/LEI columns in the raw layer.

4 · Data Security

Apache Ranger with five roles (corep_reporting, risk_analyst, data_engineer, auditor, pipeline_service). Column masking on LEI/counterparty name, row filters on jurisdiction, DML restriction to pipeline_service only. Enforced at Trino, not at the application layer.

5 · Data Auditability

Every pipeline run writes a structured JSON audit record to audit.pipeline_runs (PostgreSQL). pg_audit captures DDL/DML at the database level. Ranger audit logs stream to PostgreSQL and MinIO. XBRL validation reports and GX HTML data docs archived to MinIO with object retention.

6 · Regulatory Output

Arelle 2.30.3 generates and validates XBRL instances against the EBA taxonomy. Three validation passes: structural (lxml), calculation linkbase (parent = sum of children), formula linkbase (EBA business rules including capital floor checks). Output is a submission ZIP file.

The Architecture Decision That Changed Everything: Trino as the Single Query Gateway

The most consequential architecture decision made on Day 4 was this: every consumer of data — dbt, GX, Superset, XbrlGenModule, all test code — connects to Trino, never directly to PostgreSQL. At the time it felt like a slightly awkward constraint. In retrospect it is the single decision that makes the entire governance stack coherent.

Here is what it actually delivered:

Concern	Without Trino Gateway	With Trino Gateway
Ranger column masking	Must be implemented in every application (dbt, Superset, Python)	Enforced once in Trino — all consumers get it automatically
Adding a new data source (Iceberg on MinIO)	Every consumer needs a new connection + driver	Add a Trino catalog — all consumers query it with the same SQL
Audit trail completeness	pg_audit only captures direct PostgreSQL connections	Trino + Ranger audit logs capture every query regardless of consumer
Superset security	Superset RLS must mirror Ranger policies — two systems to maintain	Superset inherits Ranger decisions via Trino — single enforcement point
Future query federation	Requires application-level JOIN across two connections	Trino federated JOIN across PostgreSQL + Iceberg in a single SQL statement

The cost was one additional hop in the query path (application → Trino → PostgreSQL) and the need to qualify every table as postgresql.schema.table rather than schema.table. That is a trivial cost relative to what it buys.

The principle: In a regulated environment, every security control that lives in the application layer is a control that can be bypassed by a developer connecting directly to the database. The only controls that count are the ones enforced at the data layer — and Trino is that layer.

The Three Hardest Technical Problems

1 · XBRL Formula Validation — Parents Must Equal Children

The calculation linkbase problem was not obvious until Day 12. The EBA taxonomy specifies that c0010 (Own Funds) = c0020 (CET1) + c0030 (AT1) + c0040 (T2) via a calculation linkbase arc with weight +1. Arelle enforces this with tolerance decimals=-3, meaning a EUR 1,000 rounding difference is acceptable but a EUR 1,001 difference is an xbrlCalcs:inconsistency error that fails validation.

The root cause: the dbt mart models computed each tier independently using separate CTEs, introducing rounding divergence at the EUR thousandths position. The fix required two changes:

In dbt: all four COREP concepts (c0010, c0020, c0030, c0040) must be derived from the same CTE using a single SUM(amount_eur) grouped by tier, then assembled with FILTER(WHERE tier = 'CET1'). Parent = sum of same-CTE children by construction.
In xbrl_gen.py: rounding (ROUND_HALF_UP, decimals=-3) is applied exactly once, terminally, at the point of writing the XBRL fact. Never round in intermediate dbt models — let the mart values carry full precision and round only when writing to XML.

The lesson: XBRL calculation linkbase validation is not a reporting concern — it is a data modelling concern. The constraint must be enforced in the mart model, not patched in the XBRL generator.

2 · Ranger Tag Synchronisation with OpenMetadata

The plan was to tag columns as GDPR.PII in OpenMetadata and have Ranger automatically pick up those tags via the Ranger Tag Sync service to enforce column masking. In practice, the open-source version of Ranger Tag Sync (Apache Atlas integration) does not support OpenMetadata as a tag source out of the box — that integration requires Ranger 2.4+ with a custom service definition, or the commercial Cloudera version.

The workaround adopted here: OpenMetadata PII tags are informational (they drive the catalog UI and column-level lineage annotations), while Ranger policies are defined explicitly by column name in column_mask_policies.json. The two systems agree on what is PII — but they do not sync automatically. A CI step in the pipeline validates that every column tagged GDPR.PII in OpenMetadata has a corresponding Ranger mask policy, and fails the build if the two diverge.

# tools/check_pii_ranger_sync.py
# Run in CI — compares OM PII tags against Ranger mask policy coverage

import requests, json, sys

OM_URL    = "http://localhost:8585"
RANGER_URL = "http://localhost:6080"

# 1. Collect all columns tagged GDPR.PII in OpenMetadata
resp = requests.get(
    f"{OM_URL}/api/v1/search/query",
    params={"q": "tag:GDPR.PII", "index": "column_search_index", "limit": 100},
    headers={"Authorization": "Bearer <token>"},
)
pii_columns = {
    f"{hit['_source']['table']}.{hit['_source']['name']}"
    for hit in resp.json()["hits"]["hits"]
}

# 2. Collect all columns covered by Ranger mask policies
ranger_resp = requests.get(
    f"{RANGER_URL}/service/plugins/policies",
    params={"serviceName": "corep-trino", "policyType": "1"},   # type 1 = masking
    auth=("admin", "<password>"),
)
masked_columns = {
    f"{res['table']}.{col}"
    for policy in ranger_resp.json()["policies"]
    for res in policy["resources"].get("table", {}).get("values", [])
    for col in res.get("column", {}).get("values", [])
}

unprotected = pii_columns - masked_columns
if unprotected:
    print(f"[FAIL] PII columns with no Ranger mask: {unprotected}")
    sys.exit(1)
print("[PASS] All PII columns covered by Ranger mask policies.")

3 · Synthetic Data Consistency Across Six Tables

The six source CSV tables (capital_instruments, rwa_exposures, loans, counterparties, liquidity_assets, liquidity_outflows) are individually easy to generate. Making them internally consistent — so that the COREP templates computed from them produce plausible ratios that exercise all dbt code paths without triggering the GX regulatory floor checks — required careful reverse-engineering.

The constraints that had to be simultaneously satisfied:

Constraint	Value Used	Why
CET1 ratio ≥ 4.5%	12.8%	Clears GX mart gate; realistic for a mid-size European bank
Total Capital ratio ≥ 8%	16.4%	Basel III Pillar 1 minimum
LCR ≥ 100%	143%	CRR Article 412 — EBA formula linkbase will check this
RWA exposure classes cover all 7 standard classes	All populated	GX expects no NULL exposure_class values; exercises all dbt GROUP BY branches
c0010 = c0020 + c0030 + c0040 exactly	Built from bottom up	Calculation linkbase — generate tiers first, sum to own funds
No counterparty.name or LEI in mart schema	Dropped in stg_counterparties	GDPR — staging model explicitly excludes PII columns

The resolution: generate the data bottom-up. Start with the tier amounts (CET1, AT1, T2), compute Own Funds as their exact sum, derive RWA from a target CET1 ratio, then generate loans and exposures consistent with that RWA. A single Python script enforces all constraints at generation time so the pipeline never has to work around bad input data.

Running the Complete Pipeline — What the End-to-End Looks Like

$ docker compose up -d # 17 services start $ python pipeline.py –module ingest # Load raw CSVs $ dbt run –profiles-dir dbt/ # Build mart tables $ python pipeline.py –module quality # GX gates $ python pipeline.py –module xbrl_gen # Generate XBRL instance $ python pipeline.py –module xbrl_valid # Validate with Arelle # → output/xbrl/corep_c0100_2025-12-31.xbrl # → output/xbrl/validation_report_2025-12-31.json # → MinIO: corep-xbrl-output/submissions/# Or trigger the full orchestrated run via Airflow: $ airflow dags trigger corep_governance_pipeline \ –conf ‘{“reporting_date”: “2025-12-31”}’ # DAG runs 19 tasks over ~45 minutes # Final artefact: corep-xbrl-output/submissions/corep_submission_2025-12-31.zip

The submission ZIP contains:

corep_submission_2025-12-31.zip
├── corep_c0100_2025-12-31.xbrl        ← XBRL instance (Own Funds)
├── corep_c0200_2025-12-31.xbrl        ← XBRL instance (RWA)
├── corep_c0300_2025-12-31.xbrl        ← XBRL instance (Capital Ratios)
├── corep_c4700_2025-12-31.xbrl        ← XBRL instance (LCR)
├── validation_report_2025-12-31.json  ← Arelle pass/fail with all error codes
├── gx_data_docs/                      ← GX HTML reports for raw + mart gates
└── manifest.json                      ← Pipeline run metadata, versions, hashes

What a Real Bank Would Need on Top of This

This project is a complete proof-of-concept. Moving it to a production environment at a regulated institution would require the following additions:

Gap	What to Add	Why It’s Out of Scope Here
Real EBA taxonomy download	Automate taxonomy update from EBA XBRL website on each release (quarterly)	Using a pinned static copy; taxonomy changes require re-validation of all mart models
Multi-entity reporting	Add a `legal_entity_id` dimension to all mart models and XBRL contexts	Synthetic data covers a single bank entity
Source system connectivity	Replace CSV ingest with Kafka connectors or REST API extractors from core banking / risk systems	Real source systems require vendor contracts and VPN access
Kerberos / LDAP authentication	Replace Trino HTTP auth with Kerberos for Ranger to evaluate real user identities	Dev environment passes username as a plain string
Ranger tag sync	Build or buy the OpenMetadata → Ranger tag sync bridge (see hardest problems)	Requires custom Ranger service definition or commercial Cloudera CDP
4-eyes principle	Add an approval step in the Airflow DAG (HumanInTheLoop sensor) before submission ZIP is finalised	Regulatory requirement in several EU jurisdictions — trivial to add
Immutable audit log	Stream audit.pipeline_runs to an append-only store (MinIO object lock, or a separate audit database with no UPDATE/DELETE rights)	pg_audit provides database-level immutability; application-level audit table is mutable by admins
Secrets management	Replace `.env` file with HashiCorp Vault or Azure Key Vault; rotate credentials without redeploying	Development convenience; .env is explicitly in .gitignore

18-Day Retrospective: What the Journey Looked Like

Days	Phase	Key Output
1–4	Theory foundation	Data governance framework, EU regulatory alphabet, DORA, open-source stack selection
5–7	Data layer	Synthetic data generation, dbt models (staging→intermediate→mart), column-level lineage
8–9	Quality + catalog	GX expectation suites, Great Expectations HTML data docs, OpenMetadata with EBA glossary
10	Security	Apache Ranger RBAC, column masking, row filters, DML restrictions via Trino
11–12	XBRL generation + validation	Arelle taxonomy parsing, XBRL instance generation, three-pass validation
13	Orchestration	Full Airflow DAG with 19 tasks, branching on quality/XBRL failure, SLA callbacks
14	End-to-end lineage	Marquez lineage snapshot, ECB audit query answerer, 5-layer lineage chain
15	Dashboards	Superset via Trino, 5 virtual datasets, capital + governance audit dashboards
16	Adversarial testing	15 pytest tests — bad data, bad roles, broken XBRL, branch routing
17	Capstone	This post
18	Production checklist	Tomorrow: what it takes to go from this Docker stack to a real bank deployment

By the Numbers

Metric	Value
Docker services	17
Python modules	8 (ingest, quality, catalog, security, xbrl_gen, xbrl_valid + lineage + audit scripts)
dbt models	13 (5 staging + 4 intermediate + 4 mart)
Ranger policies	10 (4 access-allow, 3 column-mask, 3 row-filter)
GX expectation suites	4 (raw_capital, raw_rwa, raw_liquidity, mart_corep)
Airflow tasks	19 (including 3 quarantine branches)
XBRL templates covered	4 (C 01.00, C 02.00, C 03.00, C 47.00)
EBA XBRL concepts mapped	26 (across all 4 templates)
Adversarial test cases	15 (5 data quality, 7 security, 3 XBRL integrity)
Lines of application code	~2,400
Infrastructure cost	£0 (all open source)

Why Open Source Is Sufficient for Regulatory Reporting

The most common objection to this stack in the banking industry is: “we can’t use open-source tools for regulatory reporting — we need vendor support.” That objection conflates two different concerns.

Vendor support is a procurement and risk management concern — if Arelle has a bug that generates invalid XBRL, who do you call? The answer is: the EBA’s own XBRL rendering service runs Arelle. Every major European bank uses Arelle for XBRL generation. The EBA itself publishes the taxonomy in Arelle-compatible format. For dbt, the company (dbt Labs) offers enterprise support contracts. For Airflow, Astronomer does. Apache Ranger and OpenMetadata both have commercial distributions (Cloudera CDP, Collibra).

The tools themselves are fit for purpose. What this project demonstrates is that the open-source stack can produce a submission-ready, regulator-auditable XBRL package with full end-to-end lineage, role-based access control, and adversarially-tested governance gates — at zero infrastructure cost. A real bank adds vendor support contracts and production hardening on top of the same architecture. The architecture itself does not change.

Capstone Takeaways

The Trino-as-gateway pattern is the single most important architecture decision — it makes Ranger enforcement universal and eliminates per-application security logic
XBRL calculation linkbase integrity is a data modelling constraint, not an output formatting concern — build it into the mart model CTEs from day one
The Ranger/OpenMetadata tag sync gap is the most significant open-source limitation — bridge it with a CI sync check script until a native integration exists
Synthetic data must be generated bottom-up from business constraints (capital ratios, LCR thresholds) — not top-down from column definitions
The six governance pillars (lineage, quality, catalog, security, auditability, regulatory output) are not independent — a control gap in any one pillar undermines the others
An adversarial test suite that proves controls reject bad input is more valuable to a regulator than a test suite that proves the happy path works
Every governance control that lives in the application layer can be bypassed by a developer with direct database access — enforce at the data layer (Ranger + pg_audit)
The end product — a submission ZIP with XBRL instances, validation reports, GX data docs, and a manifest — is what a regulator actually sees: not your pipeline architecture, but the artefacts it produces

Tomorrow, the final blog: Day 18 — the production readiness checklist. What it takes to move from this Docker Compose proof-of-concept to a deployment that a Chief Data Officer would sign off on before a real ECB submission.

Published: May 08, 2026

Updated: May 08, 2026

Testing a Regulatory Pipeline — Bad Data, Bad Roles, and What Should Fail

May 8, 2026

12 min read

From Docker Compose to Production: The CDO’s Checklist Before Your First Real ECB Submission

May 11, 2026

14 min read

Add a comment

You May Be Interested

Infographic explaining CRR, BCBS 239, DORA, and GDPR with icons for banking, data lineage, resilience, and privacy, showing how EU banking regulations connect and impact data engineering

The EU Banking Regulatory Alphabet Explained in Plain English

May 2, 2026

14 min read

“Illustration showing a banker, secure data shield, and compliance icons symbolizing data governance challenges in banks.

What is Data Governance and Why Do Banks Lose Sleep Over It?

May 1, 2026

7 min read

Navigating Table Formats in Data Engineering: A Practical Guide for 2026

February 11, 2026

6 min read

Production-Ready GKE: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Production-Ready GKE: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Production-Ready GKE: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Production-Ready GKE: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Production-Ready EKS: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

How I Built a COREP Regulatory Reporting Pipeline in 18 Days — From Data Governance Theory to Valid XBRL

What Was Actually Built

The Six Governance Pillars

1 · Data Lineage

2 · Data Quality

3 · Data Catalog

4 · Data Security

5 · Data Auditability

6 · Regulatory Output

The Architecture Decision That Changed Everything: Trino as the Single Query Gateway

The Three Hardest Technical Problems

1 · XBRL Formula Validation — Parents Must Equal Children

2 · Ranger Tag Synchronisation with OpenMetadata

3 · Synthetic Data Consistency Across Six Tables

Running the Complete Pipeline — What the End-to-End Looks Like

What a Real Bank Would Need on Top of This

18-Day Retrospective: What the Journey Looked Like

By the Numbers

Why Open Source Is Sufficient for Regulatory Reporting

Capstone Takeaways

Testing a Regulatory Pipeline — Bad Data, Bad Roles, and What Should Fail

From Docker Compose to Production: The CDO’s Checklist Before Your First Real ECB Submission

Leave a Reply Cancel reply

You May Be Interested

The EU Banking Regulatory Alphabet Explained in Plain English

What is Data Governance and Why Do Banks Lose Sleep Over It?

Navigating Table Formats in Data Engineering: A Practical Guide for 2026