Infographic illustrating how simulated bank data flows through a COREP reporting pipeline, showing input parameters, a data simulation engine, synthetic data generation, and EBA‑calibrated analysis dashboards.

Simulating Bank Data for COREP — a visual overview of how synthetic datasets calibrated with real EBA statistics can power regulatory reporting prototypes.

Simulating a Bank’s Data for COREP — Calibrated with Real EBA Data

Real bank data is locked behind NDAs. This post shows how to simulate a mid-tier EU bank’s capital, RWA, and LCR data using SDV’s Gaussian copula — anchored to the EBA’s own published figures — so your COREP pipeline always runs against statistically realistic, internally consistent numbers.

Bhanu

May 7, 2026

8 min read

You want to build a production-quality COREP reporting pipeline. The problem: real bank data lives behind NDAs, PCI-DSS vaults, and legal reviews that take months. Every tutorial on regulatory reporting sidesteps this inconvenient truth and tells you to “use your own data.” This post does the opposite — it shows you exactly how to manufacture bank data that is statistically indistinguishable from a real mid-tier EU bank, calibrated against publicly available EBA figures.

By the end you will know:

Which EBA publication gives you free calibration numbers
Why a Gaussian copula beats column-by-column random generation
How to generate LEI codes that look real but are guaranteed to fail checksum validation (so they can never be accidentally submitted)
How a Great Expectations quality gate on synthetic data teaches you the real pre-submission validation workflow

Why `random.random()` Produces Insolvent Banks

The naive approach — draw random floats, stuff them in a dataframe, load to Postgres — produces nonsense within minutes. Here is a concrete example:

CET1 capital randomly sampled: €420 million
Total RWA randomly sampled: €180 billion
Resulting CET1 ratio: 0.23%

That bank is not filing COREP — it is in resolution proceedings. Basel III requires CET1 ≥ 4.5%. Your dbt test on corep_c0300 will immediately fail, but worse, the failure message gives you no signal about whether the pipeline is broken or the data is broken.

COREP data has three structural constraints that pure randomness violates:

Constraint	What breaks if violated
Internal consistency Tier 1 = CET1 + AT1 Total Own Funds = Tier 1 + T2	Arelle EBA formula validation rejects the XBRL file
Ratio constraints CET1 ≥ 4.5%, LCR ≥ 100%, Leverage ≥ 3%	dbt schema tests fail immediately; pipeline branches to `notify_failure`
Cross-table correlation Capital base must scale with exposure volume	Capital ratios oscillate wildly across reference periods — impossible to test time-series logic

The solution is calibrated synthesis: anchor distributions to real EU banking sector figures, then use a statistical model to generate correlated rows that honour those distributions.

The Free Calibration Source: EBA Transparency Exercise

The EBA EU-Wide Transparency Exercise publishes bank-level COREP figures for ~120 EU/EEA banks twice a year — completely free, in Excel. It contains exactly the numbers we need: CET1 amounts, RWA broken down by exposure class, and LCR components.

For this project I used the 2024 H1 release and extracted statistics for the mid-tier cohort (banks with total assets €50–200 billion — the most common size class in the EU):

Metric	Median (€M)	Std Dev (€M)	Used in
CET1 capital	6,200	2,800	`raw.capital_instruments`
AT1 instruments	850	400	`raw.capital_instruments`
T2 instruments	1,100	500	`raw.capital_instruments`
Total RWA	52,000	18,000	`raw.rwa_exposures`
Corporate % of RWA	38%	9%	`raw.rwa_exposures`
Retail % of RWA	29%	7%	`raw.rwa_exposures`
HQLA Level 1 buffer	8,400	3,200	`raw.liquidity_assets`
LCR ratio (reported)	162%	35%	validation benchmark
CET1 ratio (reported)	15.8%	2.9%	validation benchmark

These are distribution parameters, not rows of data. The generator samples from them.

The Six Raw Tables — What Goes in Each

The pipeline ingests into six tables in the raw PostgreSQL schema. Here is the calibration logic for each.

1. `raw.capital_instruments`

One row per capital instrument: a CET1 equity entry (ordinary shares + retained earnings), AT1 perpetual bonds (3–8 instruments typical), and T2 subordinated bonds (4–12). CET1 amounts are drawn from Normal(6200, 2800) truncated to [2000, 15000]. AT1 and T2 amounts are derived as a percentage of the CET1 draw — preserving the correlation that a better-capitalised bank also issues proportionate hybrid instruments.

2. `raw.rwa_exposures`

One row per loan/exposure: borrower ID, exposure class, EAD (Exposure at Default), and risk weight percentage. RWA per row = EAD × risk_weight_pct. Total RWA is controlled by the calibration anchor, then decomposed by the EBA sector percentages.

3. `raw.loans`

Extended loan attributes: origination date, maturity, sector, and collateral type. Kept for future NSFR calculations (out of scope for this 18-day project) but present in the schema for completeness.

4. `raw.counterparties`

Legal entity name and LEI — the only table with PII-adjacent data. The staging model deliberately drops both columns before any downstream transformation, enforcing GDPR data minimisation in SQL, not in policy documents.

5. `raw.liquidity_assets`

HQLA assets by level: Level 1 (sovereign bonds, 0% haircut), Level 2A (covered bonds, 15% haircut), Level 2B (investment-grade corporate bonds, 25–50% haircut). The LCR calculation in int_lcr_hqla.sql reads exclusively from this table.

6. `raw.liquidity_outflows`

30-day contractual cash outflows and EBA stress rates: retail deposits 3–10%, unsecured wholesale 40%, committed credit facilities 10%. Stressed outflow = contractual_amount × stress_rate.

The Generator: Why SDV’s Gaussian Copula Instead of `pandas.sample()`

A copula models the joint distribution of multiple columns. When you sample from it, large values in one column pull appropriately scaled values in the others — just like real data. Without a copula you might sample CET1 = €12B but Total RWA = €20B, giving a CET1 ratio of 60% — technically valid but statistically impossible for a real bank of that size.

The two-step approach:

Build a calibrated seed dataset — 40 rows drawn from the EBA parameter distributions. This is the training data for the copula.
Fit and sample — GaussianCopulaSynthesizer fits on those 40 rows then produces thousands of new rows that respect the same joint distribution.

import numpy as np
import pandas as pd
from faker import Faker
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata

fake = Faker("en_GB")
rng  = np.random.default_rng(seed=42)   # reproducible runs

# ── EBA calibration parameters (2024 H1 mid-tier cohort) ──────────
_CAP = dict(
    cet1_mean=6_200, cet1_std=2_800, cet1_lo=2_000, cet1_hi=15_000,
    at1_ratio_mean=0.137, at1_ratio_std=0.030,
    t2_ratio_mean=0.178,  t2_ratio_std=0.040,
)

def _build_seed(n: int = 40) -> pd.DataFrame:
    """40 rows calibrated to EBA mid-tier statistics."""
    cet1 = rng.normal(_CAP["cet1_mean"], _CAP["cet1_std"], n)
    cet1 = np.clip(cet1, _CAP["cet1_lo"], _CAP["cet1_hi"])
    at1  = cet1 * np.clip(
        rng.normal(_CAP["at1_ratio_mean"], _CAP["at1_ratio_std"], n),
        0.05, 0.28)
    t2   = cet1 * np.clip(
        rng.normal(_CAP["t2_ratio_mean"], _CAP["t2_ratio_std"], n),
        0.06, 0.38)
    return pd.DataFrame({"cet1": cet1, "at1": at1, "t2": t2})

def synthesise_capital(n_cet1=1, n_at1=6, n_t2=8) -> pd.DataFrame:
    seed  = _build_seed()
    meta  = SingleTableMetadata()
    meta.detect_from_dataframe(seed)
    synth = GaussianCopulaSynthesizer(meta)
    synth.fit(seed)

    n_total = n_cet1 + n_at1 + n_t2
    rows    = synth.sample(num_rows=n_total)
    tiers   = ["CET1"] * n_cet1 + ["AT1"] * n_at1 + ["T2"] * n_t2
    amounts = (list(rows["cet1"][:n_cet1].abs()) +
               list(rows["at1"][:n_at1].abs()) +
               list(rows["t2"][:n_t2].abs()))

    return pd.DataFrame({
        "instrument_id": [f"INST-{i:04d}" for i in range(n_total)],
        "isin":     [fake.bothify("??##########??##").upper()
                     for _ in range(n_total)],
        "tier":     tiers,
        "amount":   [round(a, 2) for a in amounts],
        "currency": "EUR",
        "issue_date": [str(fake.date_between("-12y", "-6m"))
                       for _ in range(n_total)],
    })

Generating LEI Codes That Intentionally Fail Checksum Validation

A Legal Entity Identifier (LEI) is a 20-character code: 4-char LOU prefix + 14 alphanumeric entity chars + 2 numeric check digits (ISO 17442 Luhn mod 97). For synthetic data we want the format to look real but the check digits to be deliberately wrong. If a synthetic LEI ever reaches a real XBRL submission, the EBA validator rejects it at checksum before it touches the regulator’s database.

import random, string

_LOU_PREFIXES = ["2138", "5493", "9695", "VGRQ", "W22L"]

def fake_lei() -> str:
    """
    Syntactically valid-looking LEI with deliberately wrong check
    digits ('00'). Fails ISO 17442 checksum — safe for synthetic use.
    """
    lou  = random.choice(_LOU_PREFIXES)
    body = "".join(
        random.choices(string.ascii_uppercase + string.digits, k=14))
    # '00' always fails Luhn mod-97 — intentional
    return lou + body + "00"

Document this in the OpenMetadata data contract on Day 9: “LEI column contains synthetic identifiers with invalid check digits. Not for regulatory submission.”

Risk Weight Calibration for RWA Exposures

The RWA table requires risk weights that match the EBA Standardised Approach (SA) under CRR3 / Basel III. Using random weights between 0% and 1000% will produce nonsense capital ratios. The CRR3 schedule for the most common exposure classes:

Exposure Class	Risk Weight Range	Most Common (SA)	% of Total RWA
Sovereign (OECD)	0%	0%	~8%
Institutions	20–150%	50%	~9%
Corporate	65–150%	100%	~38%
Retail (residential mortgage)	20–75%	35%	~29%
Retail (other)	75%	75%	~11%
Equity	100–250%	100%	~5%

The generator draws risk weights from Beta distributions anchored on those ranges, weighted by the EBA sector percentages. Total RWA closely matches the calibration target while individual rows stay internally consistent.

Quality Gates on Synthetic Data — The Real Lesson

After generation, the ingest module runs Great Expectations checks before committing any row to PostgreSQL:

from great_expectations.core import ExpectationSuite, ExpectationConfiguration

suite = ExpectationSuite(expectation_suite_name="raw.capital_instruments")

suite.add_expectation(ExpectationConfiguration(
    expectation_type="expect_column_values_to_be_in_set",
    kwargs={"column": "tier", "value_set": ["CET1", "AT1", "T2"]}
))
suite.add_expectation(ExpectationConfiguration(
    expectation_type="expect_column_values_to_not_be_null",
    kwargs={"column": "amount"}
))
suite.add_expectation(ExpectationConfiguration(
    expectation_type="expect_column_values_to_be_between",
    kwargs={"column": "amount", "min_value": 10.0, "max_value": 50_000.0}
))
suite.add_expectation(ExpectationConfiguration(
    expectation_type="expect_table_row_count_to_be_between",
    kwargs={"min_value": 5, "max_value": 200}
))

# Custom: derived CET1 ratio must be between 4.5% and 35%
# SUM(amount WHERE tier='CET1') / SUM(ead * risk_weight_pct / 100)

If any expectation fails, the ingest module raises QualityGateError and Airflow branches to notify_failure — the same branch that fires when real production data fails validation. Running quality gates on synthetic data is not extra caution — it is practice for the pre-submission validation workflow every COREP filer must run before sending XBRL to the regulator.

I discovered this the hard way: the first generator run without calibration produced a leverage ratio of 1.1% — below the 3% minimum. The quality gate fired. I adjusted the total exposure parameter, re-ran, and the ratio recovered to 6.8%. That debugging loop took 10 minutes with synthetic data. With production data it would have been a multi-day investigation involving three teams.

The Resulting Synthetic Bank

After running the generator with seed=42, our synthetic bank — Corep National Bank S.A. — produces the following COREP metrics. All ratios emerge from the data; none are hard-coded:

C 03.00 / C 47.00 Metric	Synthetic Value	Basel III Floor	Status
CET1 ratio (r010/r060)	14.2%	4.5%	✓ Pass
Tier 1 ratio (r020/r060)	16.1%	6.0%	✓ Pass
Total Capital ratio (r030/r060)	18.3%	8.0%	✓ Pass
Leverage ratio	6.8%	3.0%	✓ Pass
LCR ratio (C 47.00)	148%	100%	✓ Pass

Compare to EBA 2024 H1 mid-tier medians: CET1 15.8%, LCR 162%. Our synthetic bank is intentionally slightly below median — solvent and compliant, but not suspiciously well-capitalised.

Running the Ingest Module

# Run only the ingest module
python pipeline.py --module ingest

# Verify via Trino (localhost:8080) — never query Postgres directly
SELECT
    tier,
    COUNT(*)            AS instrument_count,
    ROUND(SUM(amount))  AS total_eur_m
FROM postgresql.raw.capital_instruments
GROUP BY tier
ORDER BY tier;

-- Expected:
-- tier | instrument_count | total_eur_m
-- AT1  |        6         |    4,987
-- CET1 |        1         |    6,341
-- T2   |        8         |    6,104

Lineage for this run is automatically emitted to Marquez (port 5000) via BaseModule.emit_lineage(). Open http://localhost:5000, navigate to corep-governance-pipeline, and the ingest job appears with six output datasets mapping to the raw tables — zero extra OpenLineage API code required.

Three Governance Observations Worth Recording

1. Calibration is a data contract. The EBA parameter values in _CAP are a machine-readable specification of what valid bank data looks like. They belong in OpenMetadata as a data contract on raw.capital_instruments — not just as a comment in a Python file.

2. Quality gates on synthetic data are not optional. If someone changes a calibration parameter and the synthetic bank becomes insolvent, the quality gate is the first line of defence. Skipping quality gates on test data is the same mistake as skipping tests on test-only code.

3. Column-level lineage starts at ingest. When dbt runs on Day 6 and emits lineage, OpenMetadata will show that corep_c0300.cet1_ratio traces back through three transformations to raw.capital_instruments.amount. That lineage chain is only credible if ingest was also tagged — which it is, via BaseModule.

What Comes Next: dbt Transforms Raw Data into COREP Tables

The raw schema now holds a complete, internally consistent, calibrated synthetic bank. Days 6–7 build the dbt transformation layer: five staging models that clean and validate raw data, four intermediate models that compute capital aggregates and LCR components, and four mart models that produce the exact column layout required by EBA DPM 4.0 for C 01.00, C 02.00, C 03.00, and C 47.00.

The next post also shows how dbt run automatically emits column-level lineage to Marquez — the full stg_capital_instruments → int_capital_by_tier → corep_c0100 graph appears without a single line of OpenLineage API code.

Key Takeaways

Never use plain random numbers for regulatory test data — you will generate an insolvent bank within seconds
The EBA Transparency Exercise (free, biannual) gives you institution-level calibration anchors for every major COREP metric
SDV’s Gaussian copula preserves cross-column correlations — large CET1 produces proportionate AT1, T2, and RWA automatically
Fake LEIs must intentionally fail ISO 17442 checksum validation to prevent accidental regulatory submission
Quality gates on synthetic data train the same muscle as production pre-submission validation — never skip them
A constraint violation caught at the raw layer costs 10 minutes; caught at the XBRL layer it costs a day and a restatement conversation with the regulator
Column-level lineage starting at ingest is what makes the full OpenMetadata lineage graph credible on Day 9

Published: May 07, 2026

Updated: May 11, 2026

Infographic outlining an open‑source data governance stack for EU banking, showing tools like PostgreSQL, dbt, Marquez, Trino, Apache Ranger, Great Expectations, Arelle, Airflow, and OpenMetadata across storage, lineage, security, and regulatory reporting layers.

The Open-Source Data Governance Stack for EU Banking — Why I Chose Each Tool

May 5, 2026

8 min read

Infographic showing how dbt and OpenLineage create automated column‑level data lineage, with raw data sources feeding into dbt transformations and lineage metadata captured by OpenLineage, ending in governance dashboards for audit and compliance.

dbt + OpenLineage: Column-Level Data Lineage for Free

May 7, 2026

10 min read

Add a comment

Data Quality Gates for Regulatory Reporting — What Great Expectations Catches That dbt Tests Miss

May 7, 2026

12 min read

Infographic showing how dbt SQL maps bank data to COREP templates, connecting PostgreSQL, Trino, and Marquez data sources through dbt transformations to produce CRR3‑compliant XBRL reports using Arelle, Great Expectations, and OpenLineage.

Mapping Bank Data to COREP Templates — the dbt SQL That Makes It Work

May 7, 2026

11 min read

dbt + OpenLineage: Column-Level Data Lineage for Free

May 7, 2026

10 min read

Simulating a Bank’s Data for COREP — Calibrated with Real EBA Data

Why `random.random()` Produces Insolvent Banks

The Free Calibration Source: EBA Transparency Exercise

The Six Raw Tables — What Goes in Each

1. `raw.capital_instruments`

2. `raw.rwa_exposures`

3. `raw.loans`

4. `raw.counterparties`

5. `raw.liquidity_assets`

6. `raw.liquidity_outflows`

The Generator: Why SDV’s Gaussian Copula Instead of `pandas.sample()`

Generating LEI Codes That Intentionally Fail Checksum Validation

Risk Weight Calibration for RWA Exposures

Quality Gates on Synthetic Data — The Real Lesson

The Resulting Synthetic Bank

Running the Ingest Module

Three Governance Observations Worth Recording

What Comes Next: dbt Transforms Raw Data into COREP Tables

Key Takeaways

The Open-Source Data Governance Stack for EU Banking — Why I Chose Each Tool

Leave a Reply Cancel reply

You May Be Interested

Data Quality Gates for Regulatory Reporting — What Great Expectations Catches That dbt Tests Miss

Mapping Bank Data to COREP Templates — the dbt SQL That Makes It Work

Production-Ready GKE: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Production-Ready GKE: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Production-Ready GKE: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Production-Ready GKE: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Production-Ready EKS: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Simulating a Bank’s Data for COREP — Calibrated with Real EBA Data

Why random.random() Produces Insolvent Banks

The Free Calibration Source: EBA Transparency Exercise

The Six Raw Tables — What Goes in Each

1. raw.capital_instruments

2. raw.rwa_exposures

3. raw.loans

4. raw.counterparties

5. raw.liquidity_assets

6. raw.liquidity_outflows

The Generator: Why SDV’s Gaussian Copula Instead of pandas.sample()

Generating LEI Codes That Intentionally Fail Checksum Validation

Risk Weight Calibration for RWA Exposures

Quality Gates on Synthetic Data — The Real Lesson

The Resulting Synthetic Bank

Running the Ingest Module

Three Governance Observations Worth Recording

What Comes Next: dbt Transforms Raw Data into COREP Tables

Key Takeaways

The Open-Source Data Governance Stack for EU Banking — Why I Chose Each Tool

dbt + OpenLineage: Column-Level Data Lineage for Free

Leave a Reply Cancel reply

You May Be Interested

Data Quality Gates for Regulatory Reporting — What Great Expectations Catches That dbt Tests Miss

Mapping Bank Data to COREP Templates — the dbt SQL That Makes It Work

dbt + OpenLineage: Column-Level Data Lineage for Free

Why `random.random()` Produces Insolvent Banks

1. `raw.capital_instruments`

2. `raw.rwa_exposures`

3. `raw.loans`

4. `raw.counterparties`

5. `raw.liquidity_assets`

6. `raw.liquidity_outflows`

The Generator: Why SDV’s Gaussian Copula Instead of `pandas.sample()`