The Dark Data You Can't See. Part 1: How Can You Manage a Data Risk You Can't See?

David Houghton
May 6
8 min read

Part 1 of 2: The Dark Data You Can't See | BearingNode

David Houghton | Senior Consultant, BearingNode APAC

Most financial institutions have invested significantly in understanding their data. They have data warehouses, governance programmes, quality controls, and reporting frameworks. And yet, in my experience working with organisations across the region, there is a question that almost none of them can answer completely: what data do we actually hold, where is it, and what is it doing?

Not the data in the databases and the data warehouses that is usually reasonably well understood. The data in the email archives, the document repositories, the contract management systems, the compliance knowledge bases, the AI tools now embedded in everyday workflows. That data, which according to IDC and Gartner accounts for approximately 80% of everything an organisation holds, is for most institutions effectively invisible. It is collected, stored, and relied upon. But it is not seen, not governed, and not managed with anything close to the rigour applied to structured data.

That is not a technical footnote. It is one of the most significant and least acknowledged risk exposures in modern financial services.

In this first article, I want to set out the nature and scale of that risk: why the divide between visible and invisible data exists, why it is accelerating, and why financial institutions in particular face material consequences if they continue to treat unstructured data as beyond the reach of proper data management. In Part 2, I will turn to what a credible response looks like and how building genuine visibility across the full data estate restores the ability to manage, govern, and derive value from data assets of every kind.

More than just format

When we talk about structured versus unstructured data, we are not simply discussing file formats or database schemas. We are addressing fundamentally different approaches to data creation, storage, consumption, and governance each demanding a distinct management strategy.

Structured data lives in our traditional comfort zone: relational databases, data warehouses, and well-defined schemas where every field has a purpose, every relationship is mapped, and every transformation is documented. The tools we use to monitor and manage this data excel here, tracking data lineage, monitoring quality metrics, and alerting on anomalies with precision.

Unstructured data, however, represents the wild frontier: emails, contracts, regulatory filings, audio recordings, images, and increasingly, AI-generated content. This data frequently contains the most commercially sensitive insights and the highest regulatory risks, yet it operates largely beyond the reach of conventional data management tools. The absence of predefined schema means that the systems built to monitor structured, typed data have no natural point of entry into the unstructured estate.

The result is a fundamental asymmetry. Organisations invest heavily in managing the 20% of their data they can see clearly, while the remaining 80% accumulates: unmonitored, ungoverned, and increasingly consequential.

Why this matters now: three converging forces

The pace of change in this area has accelerated markedly, and the numbers reflect it. Gartner reports that unstructured data is currently growing at three times the rate of structured data, with annual growth rates of 55 to 65%. The volume of unstructured data globally is expected to triple between 2023 and 2026. IDC puts the compound annual growth rate of unstructured data at 61%, with total global data projected to reach 180 zettabytes by 2025.

The deeper problem is that the gap between data collected and data understood is widening simultaneously. IBM research estimates that for most organisations, 80% of their data is dark: collected, but never analysed, governed, or fully understood. IDC puts the figure for unanalysed unstructured data even higher, at up to 90%. Institutions are accumulating data at extraordinary speed while their ability to see and manage that data stagnates.

Three forces are converging to make this invisible data a significantly more dangerous liability than it was even two years ago.

1. The regulatory perimeter is expanding

Regulatory obligations across every major financial services jurisdiction now extend well beyond the structured data estates where most compliance programmes have been focused, and the gap between what regulators expect and what most institutions can demonstrate is widening.

Despite BCBS 239 having been in force since 2016, industry surveys suggest that fewer than 10% of global banks are fully compliant with its principles. The standard requires comprehensive data lineage and quality monitoring across all data used in risk management not just structured reporting data, but the unstructured data that increasingly informs credit decisions, operational risk assessments, and regulatory examinations. The Basel Committee's own progress reports have consistently identified lineage and aggregation capability as the most persistent gaps across systemically important financial institutions.

The EU AI Act, now in force, classifies AI models used in financial services for credit scoring, KYC, and fraud detection as "high-risk", triggering explicit requirements on training data governance, data provenance, and audit logging that extend directly into unstructured data territory. Across the APAC region, operational resilience frameworks including APRA CPS 230 require institutions to maintain comprehensive visibility into all critical business services and their data dependencies, many of which are now substantially unstructured in nature.

GDPR and equivalent privacy regulations demand complete visibility into personal data processing, regardless of format. If a customer exercises their right to erasure, can you identify and remediate every reference to them across structured databases, email archives, document repositories, and AI model training datasets? For most institutions, the honest answer is: not reliably.

2. Agentic AI has changed the risk profile

Financial institutions are now deploying AI systems that do not simply read data: they act on it. Unlike earlier machine learning models that consumed data passively, agentic AI workflows actively read, write, and modify unstructured content across document repositories, email archives, regulatory knowledge bases, and compliance records. They do so continuously, often autonomously, and at a speed and scale that no manual oversight process can match.

The implications for data risk are of a different order of magnitude. When an autonomous AI agent modifies a compliance knowledge base, incorporates a corrupted external feed into a credit decision, or acts on an outdated regulatory document, no conventional monitoring system will raise an alarm because conventional monitoring systems are not watching that data. The risk is live, not theoretical, and it is growing with every new AI deployment.

The SR 11-7 model risk management guidance from the US Federal Reserve establishes that all model inputs must be subject to robust validation and ongoing monitoring. As unstructured data inputs become central to model performance, meeting that obligation requires a level of data visibility that most institutions do not yet have.

3. RAG deployments have created an invisible dependency

The proliferation of large language model deployments inside financial institutions most commonly using Retrieval-Augmented Generation (RAG) to query internal document stores has created a new class of data dependency that sits almost entirely outside conventional data management frameworks.

RAG systems work by retrieving relevant documents from a knowledge base and using them to inform model responses. The quality and integrity of that knowledge base directly determines the reliability of model outputs. Research has documented that corrupted, outdated, or deliberately manipulated documents in a RAG knowledge base can propagate directly into model outputs, a risk that is difficult to detect without active monitoring of the underlying content.

Emerging architectural approaches such as GraphRAG, which builds structured knowledge graphs over unstructured content to enable more relationship-aware retrieval, represent a more robust direction of travel. But GraphRAG itself depends on the quality and completeness of the content it indexes. Without active visibility into that content, the problem simply moves upstream.

Institutions are building consequential business processes, from regulatory Q&A tools to client advisory platforms and compliance monitoring systems on data foundations they cannot clearly see.

The consequences of invisibility

You cannot trace what you cannot see

Good data management requires the ability to trace data from its origin through every transformation and use: to know where it came from, how it has been used, and what decisions it has influenced. For structured data in a well-managed warehouse, this is achievable. For unstructured data, it is rarely attempted.

Can your organisation trace how a customer complaint email influenced a risk assessment model? How a regulatory document update propagated through a compliance knowledge base? How an AI agent's recent modifications to an internal knowledge base affected the outputs it subsequently generated?

For most institutions, the answer is no. IBM research estimates that 80% of organisational data is dark: never analysed, traced, or governed. IDC puts the proportion of unstructured data that goes entirely unanalysed at up to 90%. The data is there. The ability to see it clearly is not.

You cannot monitor quality you cannot reach

Effective data management requires the ability to detect when data quality degrades when something has changed in a way that matters. For structured data, this is well-established: automated alerts fire when a negative account balance appears, a date format is invalid, or a required field is missing.

For unstructured data, that capability is largely absent. Most organisations have no reliable means of detecting when:

Critical regulatory documents are outdated or have been incorrectly modified
Email and communications archives contain sensitive data that should be classified or restricted
AI training datasets or RAG knowledge bases contain biased, conflicting, or inappropriate content
Document repositories hold duplicate or contradictory information that is influencing downstream decisions

Research on AI reliability in financial applications confirms that the quality of source documents is a primary driver of output failure, and that unmonitored, degraded, or poisoned source content is a leading cause of model hallucination and error.

You cannot govern access you cannot track

Organisations can log and audit every query against a data warehouse. The equivalent capability for unstructured data repositories is frequently absent. Who is accessing these repositories? What are they extracting? How is that information being used, and by which automated systems?

This gap carries direct consequences under GDPR, CCPA, and emerging AI governance frameworks. It also means that governance and oversight functions are working without a complete picture.

Data that should be an asset is becoming a liability

The unstructured data estate is not neutral. Left unmanaged, it does not simply sit quietly, it actively generates risk. Regulatory exposure accumulates in ungoverned archives. AI systems make consequential decisions based on content that has never been validated. Personal data that should have been deleted remains in repositories that nobody is monitoring. Contradictory information shapes decisions without anyone knowing it is contradictory.

This is the core of the problem. Data that carries genuine value, including insights locked in contracts, intelligence embedded in communications, and knowledge encoded in regulatory filings, cannot be extracted or trusted without visibility. And data that carries genuine risk, including outdated documents, biased training data, and improperly retained personal information, cannot be managed without visibility either.

Without the ability to discover, monitor, and understand the full data estate, the distinction between data as asset and data as liability becomes impossible to manage. Both conditions exist simultaneously, invisibly, in the same repositories, and organisations have limited means of telling them apart.

What comes next

This is the shape of the problem. In Part 2 of this series, I explore what a credible response looks like: why conventional tools are architecturally unsuited to closing this gap, what a strategic framework for building genuine data visibility across the full estate looks like, and what the evidence says about the business case for acting now.

The central argument of Part 2 is straightforward. The ability to discover and monitor data across all formats, what practitioners call Data and Information Observability, is not a specialist technical discipline. It is the foundation on which every other data management capability depends. You cannot effectively track, govern, comply, or extract value from data you cannot see.

David Houghton is a Senior Consultant at BearingNode, a specialist data, analytics, risk and AI consultancy. BearingNode helps financial institutions build the data visibility and control capabilities they need to manage risk, meet regulatory obligations, and realise value from their full data estate.

References

[1] IDC, Global DataSphere Forecast

[2] Gartner, Market Research on Augmented Data Management

[4] IBM Research, The Problem of Dark Data

[5] European Parliament, EU Artificial Intelligence Act (Regulation 2024/1689)

[6] Yaron et al., RAG Security: Risks and Mitigation Strategies, Lasso Security, 2024

[7] MDPI Mathematics, Hallucination Mitigation for Retrieval-Augmented Large Language Models, 2025

[8] Industry surveys on BCBS 239 compliance rates (via DataBahn)

[9] Basel Committee on Banking Supervision, Progress in adopting the Principles for effective risk data aggregation, 2023

[10] European Parliament, GDPR, Regulation 2016/679

[11] APRA, Prudential Standard CPS 230, effective 1 July 2025

[14] US Federal Reserve / OCC, SR 11-7, 2011

[15] Edge, D. et al., From Local to Global: A Graph RAG Approach to Query-Focused Summarization, Microsoft Research, arXiv:2404.16130, 2024

[16] Goldstein, P. et al., Comprehensive Review of AI Hallucinations, Preprints.org, 2025

[17] California Consumer Privacy Act (CCPA)