The Dark Data You Can't See. Part 2: How Visibility Turns Data Risk Into Data Value

David Houghton
May 6
8 min read

Part 2 of 2 — The Dark Data You Can't See | BearingNode

David Houghton | Senior Consultant, BearingNode APAC

In Part 1 of this series, I set out the scale and nature of an underappreciated data risk: that approximately 80% of enterprise data including the emails, contracts, documents, knowledge bases, and AI-generated content that sit outside formal structured data environments is effectively invisible to most organisations' data management functions. Left unmanaged, this invisible data does not simply sit quietly. It generates regulatory exposure, compromises AI reliability, and turns what should be a data asset into an unmanaged liability.

In this second article, I want to turn to the response. Specifically: why the tools most organisations already have cannot simply be extended to solve this problem, what a strategic framework for building genuine data visibility across the full estate looks like, and what the evidence says about the business case for acting now.

Why conventional tools cannot simply be extended

Before discussing what good looks like, it is worth being clear about why the visibility gap has persisted. This is not primarily a question of organisational will. Most data leaders understand intuitively that unstructured data carries risk. The challenge is that the tools built to manage structured data are architecturally unsuited to the unstructured estate and extending them is not straightforward.

Built for a different world

Most data management and monitoring platforms are built around the assumption of structured, schema-defined data. They excel at monitoring table structures, column relationships, and data type consistency. Unstructured data does not fit these paradigms. It is schema-less, format-diverse, and semantically complex. The question these platforms ask whether data conforms to its expected structure is simply not applicable to a contract, an email thread, or an AI-generated document summary.

Extending these tools to cover unstructured content is not a configuration exercise. It requires a different approach to data monitoring, one that can understand content, not just structure.

The missing metadata layer

Effective data management depends on metadata: the information about data that allows it to be understood, traced, and governed. For structured data, metadata is relatively straightforward, table names, column definitions, data types, and job run histories.

For unstructured data, a third dimension of metadata is needed: semantic metadata, specifically what a document contains, who created it, how it is being used, what its quality characteristics are, and whether it has changed in ways that matter. Without this layer, even sophisticated modern retrieval architectures such as RAG and GraphRAG are operating without a quality foundation.

As Microsoft's GraphRAG research demonstrates, knowledge-graph-based retrieval significantly improves on standard RAG for complex, relationship-dependent queries — but neither approach resolves the fundamental dependency on content quality. Visibility into the underlying content is a prerequisite for any retrieval architecture to be trusted.

A scalability challenge

Monitoring unstructured data quality requires content analysis, natural language processing, and pattern recognition: computationally intensive operations that do not scale within architectures designed for structured data. This is a solvable problem, but it demands deliberate architectural investment, not an extension of existing tooling.

Building data visibility across the full estate: a strategic framework

Closing the visibility gap requires progress across four dimensions simultaneously. Together, these form the foundation of what BearingNode calls Data and Information Observability (D/I O11y): a framework for developing the capability to discover, monitor, and manage data assets across their full range of formats, structures, and uses.

For organisations that are new to this concept, the framing is straightforward. D/I O11y is the organisational capability to see your data clearly: to know what you hold, whether it is reliable, who is using it, whether it is being used appropriately, and what value or risk it carries. It is not a technology product or a one-time project. It is a set of connected capabilities that together give institutions the visibility they need to manage data as both an asset and a risk.

Those five capabilities are Value, Discover, Track, Comply, and Govern. The structured-unstructured visibility gap undermines all five. The framework for closing that gap addresses each in turn.

1. Build the ability to find and catalogue everything: Discover

The first and most fundamental capability is the ability to identify and catalogue all data assets across the organisation. Not just structured databases and data warehouses, but email archives, document repositories, contract management systems, AI knowledge bases, communications records, and every other repository in which consequential data resides.

This Discover capability establishes the baseline: a comprehensive, continuously updated inventory of what data the organisation holds, where it lives, who owns it, and what it contains. Without this foundation, every other data management activity is working from an incomplete picture. You cannot monitor the quality of data you have not found. You cannot govern access to a repository you do not know exists. You cannot trace the lineage of a document that is not in your catalogue.

Extending discovery to the full unstructured estate means developing the ability to scan and classify content, not just register its existence, so that semantic metadata can be captured alongside technical and operational metadata.

2. Monitor continuously for changes that matter: Track

Once the full data estate is catalogued, the next capability is continuous monitoring: the ability to detect, in near real time, when something has changed in a way that matters.

The Track capability for structured data is well-established. Automated alerts fire when data violates expected rules, volumes deviate from normal patterns, or pipeline processes fail. For unstructured data, equivalent monitoring capability needs to be built, detecting when:

Regulatory documents are modified or become outdated
AI knowledge bases are updated with content that may be biased or contradictory
Sensitive data appears in repositories where it should not be
Access patterns suggest inappropriate use

This matters particularly for institutions running RAG-based AI systems. The SR 11-7 model risk management guidance establishes that all model inputs must be subject to robust validation and ongoing monitoring. Continuous monitoring of the unstructured content feeding AI systems is, in effect, a model risk management requirement — not an optional enhancement.

3. Turn visibility into compliance confidence: Comply

Visibility is only valuable if it can be translated into evidence. The Comply capability is the bridge between operational data monitoring and regulatory reporting: the ability to aggregate what Track is observing across the full estate and convert it into compliance insights, audit evidence, and risk assessments that satisfy regulators.

For most institutions today, compliance reporting for unstructured data is largely manual, incomplete, and reactive, assembled under pressure when an examination is imminent rather than maintained continuously. Extending the Comply capability to cover unstructured data means that when a regulator asks whether personal data has been properly managed, whether AI model inputs have been validated, or whether document retention policies have been followed, the answer is supported by continuous, auditable evidence rather than a retrospective reconstruction.

The EU AI Act's requirements on data provenance and audit logging for high-risk AI systems make this not just good practice but a direct regulatory obligation for any institution deploying AI in credit, fraud, or KYC decisions.

4. Measure whether governance is actually working: Govern

Governance frameworks including policies, controls, stewardship responsibilities, and retention schedules, are only as effective as the organisation's ability to verify that they are being followed. The Govern capability is the ability to observe and measure governance effectiveness in practice, not just on paper.

Without visibility into the unstructured estate, governance assessments are necessarily partial. An institution may have policies governing document retention, data classification, and access controls that are rigorously applied to structured systems and simply unenforced for unstructured repositories, not through negligence, but because the monitoring capability to verify compliance does not exist.

Cross-format governance, applying unified stewardship responsibilities, consistent classification standards, integrated access controls, and coordinated retention policies across both structured and unstructured assets is the goal. The Govern capability measures whether that framework is working, identifies where it is not, and provides the evidence base for continuous improvement.

5. Know what your data is worth and what it costs: Value

The final capability is arguably the most strategically significant: the ability to understand data as both an asset and a liability, and to manage it accordingly.

The Value capability is the ability to observe how effectively the organisation derives benefit from its data assets, and to identify where data is generating cost or risk rather than value. For unstructured data, this means understanding:

Which document repositories are actively used and trusted
Which AI knowledge bases are delivering reliable outputs
Which data assets carry regulatory or reputational risk that outweighs their utility
Where investment in better data management would generate a measurable return

This capability transforms the conversation from "how do we manage our data better?" to "how do we make better decisions about data as a strategic resource?", treating data explicitly and systematically as something that can be an asset or a liability depending on how well it is managed.

The business case for acting now

The evidence on the return from investing in data visibility is strong. An independent Total Economic Impact study by Forrester Consulting found that organisations implementing data and AI observability capabilities achieve a 357% return on investment over three years, with a payback period of under six months, and avoid an estimated $1.5 million annually in losses attributable to data quality failures and AI reliability incidents.

Beyond the direct financial return, institutions that build genuine visibility across their full data estate gain advantages that compound over time:

Risk management: Complete visibility enables more accurate risk assessment, faster detection of data quality issues, and more confident responses to regulatory enquiries.

AI programme integrity: A well-governed, actively monitored unstructured estate is the foundation on which reliable RAG systems, agentic AI workflows, and defensible model risk management can be built, not an optional enhancement to them.

Regulatory confidence: Continuous, auditable evidence of data lineage, quality, and access governance supports regulatory examinations and demonstrates the kind of governance maturity that regulators increasingly expect.

Strategic clarity: Institutions that can see their full data estate are better placed to make informed decisions about where data is generating value, where it is generating risk, and where investment will have the greatest impact.

From invisible risk to managed asset

Unstructured data already accounts for 80% of the enterprise data estate, is growing three times faster than structured data, and by IBM and IDC's estimates goes largely unanalysed and ungoverned. The gap between what organisations collect and what they can see is not closing. It is widening — driven by the pace of AI adoption, the proliferation of digital communications, and the expanding scope of regulatory obligations.

The solution is not to rebuild data management from scratch. Structured data management capabilities are mature, valuable, and worth protecting. The task is to extend the ability to see and manage data to encompass the full estate: to close, systematically and deliberately, the gap that most organisations have learned to work around.

That requires investment in the right tools, the development of new skills, and in many cases a redesign of governance frameworks to treat structured and unstructured data as what they actually are: a single, integrated estate carrying both value and risk in proportions that can only be understood through clear, comprehensive visibility.

The ability to discover data you cannot currently see is the starting point. From there, the ability to track it, govern it, comply with confidence, and understand its true value all follow. None of those capabilities are fully available to organisations operating with 80% of their data estate in the dark.

The question is not whether to build that visibility. It is how quickly you can do it and what it is costing you while you wait.

Working with BearingNode

BearingNode's Data and Information Observability (D/I O11y) framework provides a structured approach to building data visibility across the full estate. Across five core capabilities Value, Discover, Track, Comply, and Govern. The framework helps organisations identify where their data is, understand what it contains and what it is worth, monitor it continuously for quality and compliance, and govern it effectively regardless of format.

Whether your organisation is beginning to map its data estate for the first time or looking to extend existing structured data management capabilities into the unstructured world, BearingNode can help you build a programme that delivers genuine, measurable visibility and the ability to manage data risk with confidence.

contact@bearingnode.com

References

[1] IDC, Global DataSphere Forecast

[2] Gartner, Market Research on Augmented Data Management

[4] IBM Research, The Problem of Dark Data

[5] European Parliament, EU Artificial Intelligence Act (Regulation 2024/1689)

[10] European Parliament, GDPR, Regulation 2016/679

[14] US Federal Reserve / OCC, SR 11-7, 2011

[15] Edge, D. et al., From Local to Global: A Graph RAG Approach to Query-Focused Summarization, Microsoft Research, arXiv:2404.16130, 2024

[18] Forrester Consulting, The Total Economic Impact of Monte Carlo's Data and AI Observability Platform, 2024