What is Data Governance and Why Do Banks Lose Sleep Over It?

This is Day 1 of an 18-day series building an open-source EU banking data governance pipeline. The end product is a valid EBA XBRL COREP submission produced by a fully governed, independently restartable data pipeline using only open-source tools.

I am spending the next 18 days building an end-to-end EU banking data governance pipeline that produces a real COREP regulatory report — the kind EU banks submit to the EBA every quarter. Before writing a single line of code, I want to answer the question that every non-banking data engineer asks when they first encounter this domain:

Why do banks treat data governance like a matter of life and death?

The short answer: because for them, it is.

What Data Governance Actually Is

Strip away the consulting jargon and data governance answers six questions about every piece of data in an organisation:

Who owns it? — which executive is accountable if it is wrong
Who maintains it? — which analyst monitors its quality day to day
What does it mean? — a single agreed definition, not seventeen variations
Is it good enough to use? — quality checks with defined thresholds
Who can see it? — access controls, masking, row-level restrictions
What happened to it? — an immutable audit trail of every read and write

In most industries these questions are answered informally, inconsistently, or not at all. Teams work around gaps. Problems get fixed quietly. Nobody goes to jail.

In banking, informal is not an option.

The Five Reasons Banks Are Different

1. Their Data Is Their Product

A bank does not manufacture anything physical. Its entire balance sheet — loans, deposits, derivatives, capital, liquidity positions — exists only as data. A loan is a number in a database. A capital ratio is a calculated figure. Liquidity is a model output.

When the data is wrong, the product is wrong. There is no physical inventory to count as a sanity check.

When Barclays reported incorrect capital ratios in 2012, the problem was not that they had less capital — it was that their data systems calculated it incorrectly. The restatement cost them billions in market capitalisation in a single day. The data was the capital, for all practical purposes.

2. Regulators Read Their Data Directly

In most industries, regulators audit companies periodically. In EU banking, regulators receive machine-readable data submissions on a fixed schedule and run automated validation checks against them immediately on receipt.

COREP — Common Reporting — requires EU banks to submit structured capital and liquidity data to the EBA and national regulators (ECB, BaFin, FCA, DNB) every quarter, in XBRL format, within 12 business days of quarter-end.

The regulator is not reading a PDF. They are running automated validation rules against your numbers. If the CET1 ratio in template C 03.00 does not mathematically equal the CET1 capital in C 01.00 divided by the risk-weighted assets in C 02.00, their system flags it immediately.

Your data quality is not internal hygiene. It is a live examination that happens every quarter, automatically, with no margin for “we will fix it in the next submission.”

3. The Penalties Are Existential

Data quality failures in banking do not result in a polite letter. They result in:

Capital add-ons: Under CRD IV, if supervisors do not trust your data, they impose additional Pillar 2 capital requirements. Capital is expensive. A 1% add-on on a €100B risk-weighted asset balance sheet costs €1B in tied-up capital permanently.
Dividend restrictions: Supervisors can block distributions to shareholders until data quality issues are remediated.
Management accountability: Individual senior managers face personal liability under BCBS 239 Principle 1. When the ECB’s Joint Supervisory Team asks “who is responsible for the accuracy of your RWA data?” — a named person must stand up.
Criminal liability: Under German KWG §56, management board members face criminal charges for persistent regulatory reporting failures.

Between 2015 and 2021, Deutsche Bank paid over $3.5 billion in global fines. A significant proportion traced back not to the underlying transactions being wrong, but to the data describing those transactions being inaccurate, incomplete, or reported late.

4. The 2008 Crisis Made It a Formal Requirement

Before 2013, banks governed data however they saw fit. The financial crisis exposed how badly most of them were doing it.

Regulators discovered that major global banks could not answer basic questions about their own risk exposure — not because they were hiding information, but because the data genuinely did not exist in a reliable, aggregated form. Getting a consolidated view of total counterparty exposure required weeks of manual spreadsheet work. By the time the number was ready, it was already stale.

The Basel Committee’s response was BCBS 239, published January 2013: 14 binding principles requiring banks to have robust data infrastructure, governance frameworks, and reporting capabilities. The ECB now assesses BCBS 239 compliance as part of its annual SREP review. A low maturity score directly increases Pillar 2 capital requirements.

Poor data governance is not just a compliance risk. It costs real money every single year through higher capital requirements.

5. Their Data Is Legally Complicated

Banks handle more categories of sensitive, regulated data than almost any other type of organisation — simultaneously:

Personal data (GDPR) — customer names, credit history, income
Counterparty data (EMIR, CRR) — legal entity identifiers, exposure amounts, collateral
Prudential supervisory data (CRR/CRD) — capital, liquidity, risk figures submitted to regulators
AML/CTF data (AMLD6) — suspicious transaction reports, beneficial ownership
Market-sensitive data (MAR) — pre-trade information that cannot cross trading desks

Each category has different retention periods, access restrictions, and breach notification timelines. GDPR says minimise data. CRR Art. 74 says keep it for 7 years. AMLD6 says keep AML records for 5 years. All three clocks run simultaneously from different trigger events.

Managing this without a formal governance framework is not just operationally difficult. It is a compliance impossibility.

What Failure Looks Like in Practice

The Spreadsheet Problem

A mid-sized EU bank runs its quarterly COREP calculation in Excel. The regulatory reporting analyst manually copies figures from twelve source systems over 8 days. On Day 7 she finds a broken cell reference in the RWA tab. Nobody knows when the error was introduced. Three previous quarterly submissions may be wrong. The question is now whether to quietly fix it or formally restate — and under CRD IV, the correct answer is restate, which triggers a supervisory investigation, a capital add-on, and a management accountability review.

Root governance failure: No data lineage, no automated quality checks, no audit trail.

The Silo Problem

A large bank has 47 core banking systems from 23 acquisitions over 30 years. The term “corporate loan” means different things across 6 of those systems — different risk weight methodologies, different maturity conventions, different collateral valuation approaches. When the aggregation team tries to calculate total corporate exposure for the large exposures report, the numbers are incompatible. Three people who knew the manual adjustment factors left last year. The institutional memory left with them.

Root governance failure: No business glossary, no data ownership, no metadata management.

The Lineage Problem

The EBA issues a Q&A reclassifying a covered bond type from HQLA Level 2A to Level 2B. The haircut changes from 15% to 25%. The bank’s LCR drops from 118% to 97% — below the 100% regulatory minimum. The risk team needs to know within 24 hours which assets are affected and what the precise LCR impact is. It takes 11 days of manual investigation to find out. During those 11 days, the bank may be in regulatory breach without knowing it.

Root governance failure: No data lineage, no data catalog.

Data Governance as an Engineering Problem

For a data engineer, governance is not a management framework written in a PDF. It is a set of concrete, implementable engineering requirements:

Governance Need	Engineering Solution
What does this data mean?	OpenMetadata glossary + dbt column descriptions
Where did this data come from?	OpenLineage + Marquez lineage graph
Is this data accurate enough?	Great Expectations suites + dbt tests
Who is allowed to see it?	Apache Ranger RBAC + column masking
What happened and when?	pg_audit + pipeline audit log table

This is exactly the stack I am building over the next 18 days. Every governance pillar becomes a pipeline module. Every regulation becomes a data quality rule or an access policy. The COREP XBRL submission at the end is the proof that it all works.

The Test That Matters

A well-governed bank can answer these five questions in under one hour, without manual investigation:

What is our current CET1 ratio and which exact source transactions contributed to it?
When was the LCR last calculated and is it based on today’s market values?
Which columns contain personal data and who accessed them in the last 30 days?
If the EBA changes the HQLA Level 2B definition tomorrow, which models would need to change and what is the estimated LCR impact?
Who approved the exposure classification of our 10 largest counterparties?

These are not hypothetical questions. They are the actual questions an ECB Joint Supervisory Team examiner asks on Day 1 of an on-site inspection. Banks that cannot answer quickly receive a supervisory finding. Findings have capital consequences.

The pipeline I am building can answer all five — through Marquez lineage, OpenMetadata ownership records, pipeline audit timestamps, Ranger access logs, and dbt model dependency graphs.

That is what data governance looks like when it is built properly.

What’s Next

Tomorrow I go deep on the regulatory framework itself — CRR, CRD IV/V, BCBS 239, DORA, and GDPR — and exactly what data each regulation demands from a bank. By Day 3 I will be downloading the actual EBA COREP templates and walking through what source data feeds every row of a capital adequacy submission.

All code for this project will be open source on GitHub. Follow along for 17 more days.

What is Data Governance and Why Do Banks Lose Sleep Over It?

What Data Governance Actually Is