How is AI data readiness different from classical data governance?

Classical data governance answers 'who owns this data, who is allowed to query it, and is it accurate'. AI data readiness extends those questions with three new ones that classical governance does not cover. What is the licensing posture of the training data your models will consume, and does CC-BY-4.0 versus restricted-license content matter for the use case you intend. What is the provenance of the vectors in your embedding store, and can you reproduce the embedding from the source document under audit. What was the actual prompt context any given model response was built from, and can you replay it later when a regulator asks. The classical data-governance vendors are extending into these surfaces with uneven success; the question worth asking in procurement is which surfaces are real and which are roadmap.

Which data-readiness check usually surfaces the biggest cliff?

Access controls and the least-privilege question. Most enterprises have data that is technically accessible but operationally locked behind permission models built for a pre-AI world — quarterly access-review cycles, role-based access lists that have accumulated cruft for a decade, and ownership chains where the named owner has left the company. The classification and lineage gaps are usually fixable in weeks with focused effort; the access-control gap is the one that takes a quarter to remediate because it crosses into HR and identity-management territory the data team does not own. The strategy that assumes existing access patterns will work for AI workloads is the strategy that discovers, six months in, that the model cannot see the data the use case requires.

Do we need a data catalog tool, or can we run AI readiness on spreadsheets?

For the inventory and classification work at small scale, spreadsheets are sufficient and the catalog procurement is premature. The break-even point is roughly when the data inventory crosses 500 distinct datasets, or when the number of named data owners exceeds 20. Below that scale, the catalog tool is overhead; above it, the manual maintenance cost overtakes the licence cost. The procurement signal that decides which catalog to buy is whether the centre of gravity is discovery and search (Atlan, Select Star, Secoda lead) or governance and compliance (Collibra, Informatica, Alation lead); the archetype split matters more than the vendor choice.

Where does Databricks Unity Catalog or Snowflake Cortex fit against the standalone catalog vendors?

They are the hyperscaler-native answer to the same archetype question the governance-tools piece raises. Unity Catalog is the right primary catalog for Databricks-anchored estates; Snowflake Cortex extends Snowflake's existing governance surface with AI-specific features. Both work well inside their own platform's footprint and less well outside it. The procurement decision mirrors the wider cloud-versus-best-of-breed conversation — above 80% of AI workload inside a single data platform, the native catalog is the default; below 50%, a standalone catalog overlay produces better multi-platform coverage at higher complexity. The hyperscaler suites also have the same lock-in posture the cloud governance suites have.

What is the procurement category most teams underbuy in data readiness for AI?

Embedding-store provenance and prompt-context auditability. The classical data catalogs are still extending into the model-training and retrieval-augmented-generation surfaces, and the coverage is uneven. Enterprises building agentic systems on top of vector databases (Pinecone, Weaviate, Chroma, pgvector) discover at audit time that they have no way to reconstruct which source document produced which embedding, or which retrieved chunks formed the context for a model response that a customer or regulator is asking about. The procurement category that handles this is still emerging — it overlaps with LLM observability — and most teams do not budget for it until the audit question lands. The cost of producing the audit trail retroactively is materially larger than the cost of building it in from the start.

AI Data Readiness: The Data Layer That Pre-Dates Your Strategy Approval

Tom Prommer · CIO/CTOUpdated 2026-05-3014 min read

Executive summary

The four data-readiness checks no AI strategy survives without — inventory, classification, lineage, access controls — extended with the AI-specific surfaces classical data governance does not cover (training-data licensing, embedding-store provenance, prompt-context auditability), and the named-vendor read on Alation, Atlan, Collibra, Informatica, Databricks Unity Catalog, Snowflake Cortex, Select Star, Secoda, and Cloudera against an honest practitioner rubric.

The audit finding I want to describe arrived in a Stuttgart insurer’s compliance review last September. The team had shipped an internal retrieval-augmented-generation assistant that summarised policy documents for underwriters, sat on a Databricks-anchored data estate, and had been in production for nine months. The internal audit asked a question the team had not prepared for: for a specific summary the assistant had produced six weeks earlier, which source documents had been retrieved, which chunks of those documents had formed the prompt context, and which embedding model version had been in use at the time. The team had logs of the model’s responses. They had no record of the retrieval step, the chunk selection, or the embedding version. The audit recommendation was unambiguous: until the prompt-context audit trail exists, the system cannot continue operating against regulated-document workflows. The team had three weeks. The retroactive build of the audit infrastructure took five. The assistant was offline for six weeks and the underwriting workflow reverted to manual document review — the hidden tax that effectively wiped out the ROI the project had been greenlit to deliver.

This is what AI data readiness looks like in 2026, and why the four-page enterprise AI readiness assessment on this site lists data accessibility as the first page rather than the fourth. The strategy that survives is the one whose data layer was ready before the use case shipped. The four classical data-readiness checks — inventory, classification and sensitivity tiering, lineage and provenance, access controls — are the floor, and the AI-specific surfaces (training-data licensing, embedding-store provenance, prompt-context auditability) are the extension that classical data governance does not cover. What follows is the deeper read on the data-layer specifically, with the named-vendor scorings against Alation, Atlan, Collibra, Informatica, Databricks Unity Catalog, Snowflake Cortex, Select Star, Secoda, and Cloudera, and the honest practitioner rubric the procurement teams I work with use.

This page extends rather than duplicates the readiness assessment piece, which covers the four-page diagnostic at the strategy level. This is the data-layer deep-dive that piece points at. Both pages assume the root hub’s four-question diagnostic has been answered; both treat the data layer as a prerequisite the strategy does not get to assume.

The four classical readiness checks, taken seriously

Every credible data-readiness assessment for AI work answers four questions, and the order matters because each question’s answer constrains the next. Treating them as independent — which the generic consultancy version of this work usually does — produces a maturity Christmas tree. Treating them as sequenced produces a project plan.

Inventory. What datasets exist, where they live, who owns them, and what they contain. This is the artefact most enterprises do not have at the level of fidelity AI workloads require. A typical enterprise discovers, when forced to produce a real inventory, that the formal catalog (if one exists) covers between 40% and 70% of the actual data footprint. The remainder lives in operational stores nobody catalogued, in shadow analytics environments the central data team does not see, and in SaaS-tool data exports that bypass the warehouse entirely. The inventory work is dull, takes two to four weeks of focused effort, and is the single highest-leverage activity in any AI data-readiness exercise because every subsequent question assumes the inventory exists.

The honest test of whether your inventory is good enough: pick a use case the AI strategy is likely to require — an underwriting assistant, a customer-service summariser, a sales-research agent — and ask which datasets the use case would need. If the team cannot answer that question in one meeting with confidence in the completeness of the answer, the inventory is not good enough yet, and the strategy is being approved against an inventory that does not exist.

Classification and sensitivity tiering. Every dataset on the inventory gets a sensitivity tier. The classification matters because the AI workload’s access pattern is different from the classical analytics access pattern — a model that ingests training data inherits the sensitivity posture of every record it saw, a retrieval-augmented application that pulls live documents into the prompt context exposes the sensitivity of those documents to whatever surface the response reaches, and an agentic system with tool-calling permissions inherits the sensitivity exposure of every system it can query. Classifying after the AI workload has shipped is the pattern that produces the audit finding the Stuttgart insurer above was working through.

The classification scheme that works in 2026 has at least three tiers — public, internal, restricted — with the restricted tier sub-tiered for PII, financial data, regulated industry data (HIPAA, PCI, etc.), and high-confidentiality strategic data. The mapping to the EU AI Act risk tiers (limited, high, unacceptable) is downstream of this classification; the Act’s risk-tier work is sharper when the data layer has been classified first.

Lineage and provenance. For every dataset on the inventory, where did the data come from, what transformations has it been through, and what is the chain back to the originating source. Classical lineage tooling handles the warehouse-to-mart-to-report chain reasonably well; the AI-specific extension is the chain from source through transformation through model training or retrieval, which most classical tools cover incompletely. The procurement signal that distinguishes mature data-catalog vendors from immature ones in 2026 is the breadth and depth of the lineage coverage into the model-training and retrieval surfaces.

The honest test for lineage: pick a model response or a model prediction your AI system produced in the last month. Can you trace it back through the retrieval or training data to the originating source documents, including the transformations applied along the way. If the answer is no, the lineage gap is real, and the audit response to a regulator asking the same question will be unsatisfactory.

Access controls. Who has technical authority to query which datasets, and what is the procurement lead time to add a new system (a model training environment, an embedding pipeline, an inference endpoint) to the access list. This is the check that surfaces the biggest cliff in most readiness exercises, because access controls have accumulated cruft for a decade in most enterprises and the cleanup is the work the data team does not own — it crosses into HR, identity management, and information security territory.

The honest test for access controls: pick the use case the strategy is most likely to require, identify the team that would build the AI workload, and ask whether they have technical access to the data the workload would need today. If the answer is we would need to start an access-review cycle that completes next quarter, the access-control gap is the cliff the strategy timeline will hit first, and the remediation work is a separately-budgeted workstream rather than a footnote to the data-readiness assessment.

The three AI-specific extensions

The four classical checks above are the floor. The three AI-specific extensions are the surfaces classical data governance does not cover, and the ones that have produced most of the AI-data audit findings I have seen through 2024–2026.

Training-data licensing. What is the legal posture of every dataset that flows into model training or fine-tuning. Open-source datasets carry licenses (CC-BY-4.0, CC0, MIT, Apache 2.0, restrictive non-commercial licenses, undisclosed-license web-scraped content); enterprise datasets carry contractual restrictions (the customer-data clause that says this data may be used to provide the service, not to train models that benefit other customers); third-party API data carries terms-of-service clauses (most LLM API providers now exclude training on customer data by default, but the corollary obligation flows back to the enterprise buying the API). Most enterprises building AI applications in 2026 have not produced a training-data licensing register, and the audit risk is real and growing as the courts work through the model-training copyright cases.

The procurement signal that the licensing register is needed: the enterprise either trains or fine-tunes its own models, or uses third-party model providers under terms that flow data-handling obligations back to the enterprise. The mitigation is a licensing field in the dataset inventory that records the license, the source, and any contractual restrictions, and a deployment-gate check that the training pipeline never consumes datasets with incompatible licenses for the intended use.

Embedding-store provenance. For every vector in your embedding store, what source document produced it, what chunking strategy was applied, what embedding model version was in use, and can you reproduce the embedding under audit. This is the surface the classical data catalogs are still extending into, and the coverage in 2026 is uneven. Enterprises building retrieval-augmented-generation applications on top of vector databases (Pinecone, Weaviate, Chroma, pgvector, the vector features in Snowflake and Databricks) discover at audit time that the embedding store is a black box from the perspective of provenance — vectors in, vectors out, and no chain back to the source.

The procurement category that handles this is still emerging. The classical catalog vendors are extending into it (Alation and Collibra both have vector-store integration roadmaps; Atlan’s product covers it for AWS-anchored estates; Unity Catalog handles it inside the Databricks estate); the LLM observability vendors handle it from a different angle (Arize, Fiddler, WhyLabs have been extending into retrieval-step auditing); some specialised tools (the open-source LangChain tracing layer, the various RAG-evaluation harnesses) cover narrower slices. The procurement signal is the size and criticality of the retrieval-augmented-generation footprint; for enterprises with a single small RAG application, manual auditability is sufficient; for enterprises with multiple production RAG applications against regulated workflows, the tooling procurement is necessary and most teams underbudget for it.

Prompt-context auditability. For every model response your AI system produced, what was the full prompt context — system prompt, retrieved chunks, conversation history, tool-call results — that fed into the model when it generated the response. This is the audit trail the Stuttgart insurer above could not produce, and the one most production-facing AI applications cannot produce six weeks after the fact because the logging was not configured to retain it. The cost of producing this audit trail retroactively is the project the insurer ran to bring the assistant back online; the cost of building it in from the start is materially smaller.

The procurement category that handles this overlaps with LLM observability (Langfuse, LangSmith, Helicone, Arize, the vendor-native logging in OpenAI’s enterprise tier, Anthropic’s enterprise observability surface) and with the application-side instrumentation (custom logging into the data warehouse, the audit features in some application-development frameworks). The procurement signal is whether the application is in a regulated workflow or whether the response could ever be questioned in a forum the enterprise needs to defend in; if yes, the audit trail needs to exist before the application ships, not after.

Data catalog procurement: governance-first versus discovery-first

The standalone data-catalog market in 2026 splits cleanly into two archetypes distinguished by the procurement question the vendor’s centre of gravity answers. The split is the same shape as the governance tooling piece’s four-archetype split — and just as in that piece, the highest-leverage move in catalog procurement is naming the archetype before naming the vendor.

Governance-first catalogs answer the question how do we govern data assets across the enterprise with policy, ownership, and compliance evidence the auditor will accept. The platforms in this archetype assume an existing data-governance programme, a mature policy library, named data owners, and a compliance audit cycle the catalog feeds into. Collibra leads this archetype; Informatica’s Cloud Data Governance and Catalog (CDGC) sits here; Alation straddles this and the discovery-first archetype with the centre of gravity closer to governance for enterprise deployments. The procurement strength is the depth of the governance surface — policy enforcement, data quality rules, regulatory mapping, audit-trail production. The procurement weakness is the operational tax — these platforms require a data-governance team to operate, and enterprises without that team underdeploy and the catalog quietly falls out of date.

Discovery-first catalogs answer the question how do our data consumers find, understand, and use the data assets they need. The platforms in this archetype optimise for the user experience of the analyst, data scientist, or AI engineer searching for which dataset has the customer transaction history I need for this model. Atlan leads this archetype with a particularly strong AI-engineer-friendly surface; Select Star and Secoda are the focused mid-market entrants; Cloudera’s data catalog work fits here for Cloudera-anchored estates. The procurement strength is adoption — these platforms are used by the people they were built for, which is the failure mode the governance-first archetype most often runs into. The procurement weakness is the governance depth — these tools record what exists and help users find it; they do not enforce policy at the depth a compliance-led organisation requires.

The procurement signal that tells you which archetype to buy is the question your data organisation and your AI strategy team are arguing about. If the argument is we cannot demonstrate to the auditor that our data is governed, you need governance-first. If the argument is our AI engineering teams cannot find the data they need and end up duplicating datasets across the estate, you need discovery-first. If the argument is both, you need one of the platforms that straddles (Alation, increasingly) or you accept the operational complexity of running two catalogs with a clear handoff between them — which is a real choice some large enterprises make consciously.

The procurement signal that tells you when to buy at all is the inventory scale. Below 500 distinct datasets or 20 named data owners, the catalog tool is operational overhead and a spreadsheet plus a manual review cadence does the work. Above that scale, the manual cost overtakes the licence cost. Most enterprises buy the catalog three years before they need it on a vendor-driven sales cycle and three years after they need it on an internally-driven one; the right time is in between.

Alation, Collibra, Informatica

These three are the most established enterprise data-catalog vendors and the ones most likely to appear on a governance-first shortlist.

Alation is the broadest of the three, with a centre of gravity that has moved through 2023–2026 from discovery-first toward governance-first as the enterprise customer base matured. The platform’s strength is the breadth of integrations across the warehouse and BI estate (Snowflake, Databricks, Tableau, Power BI, Looker, the standard set), the depth of the policy-and-stewardship workflow, and the AI-specific extensions in Alation Anywhere and the Open Data Quality Framework. The weakness is the operational tax — Alation deployments require a data-stewardship function to operate, and enterprises that buy the platform without committing the stewardship headcount underdeploy. The procurement signal Alation answers cleanly is we have a mature data organisation, a governance programme, and a heterogeneous data estate that requires a single catalog overlay. For mid-to-large enterprises with the operational capacity, this is the default broad-coverage answer.

Collibra is the governance-first specialist and the platform most often shortlisted on regulatory-driver procurement. The strength is the depth of the governance surface — policy enforcement, data lineage, compliance reporting, and regulatory mapping (financial services, healthcare, EU data regulations) are the centre of gravity. The weakness is the user-adoption surface — Collibra is built for the data-governance team, not for the AI engineer searching for a dataset, and the discovery experience requires deliberate complement (Collibra Data Marketplace addresses this partly). The procurement signal Collibra answers cleanly is we are in a regulated industry and we need the catalog to be load-bearing for compliance evidence. For finance, healthcare, and pharmaceutical enterprises with mature regulatory postures, this is the right answer.

Informatica is the data-integration incumbent extending into the catalog category through the Informatica Data Management Cloud (IDMC) suite, including Cloud Data Governance and Catalog. The strength is the breadth of the data-management estate the catalog plugs into — if you already run Informatica for ETL, master data management, or data quality, the catalog inherits the existing integration footprint and the procurement is incremental rather than greenfield. The weakness is the same shape as for any incumbent-extending-into-AI play; the platform is heavier than a focused catalog purchase requires, and the AI-specific extensions are roadmap rather than mature in 2026. The procurement signal Informatica answers cleanly is we are an Informatica-anchored data shop and we need the catalog inside the existing platform relationship. For Informatica-anchored enterprises, the procurement is straightforward; for everyone else, the standalone catalog vendors produce cleaner outcomes.

Atlan, Select Star, Secoda

These three lead the discovery-first archetype, and they are the catalogs most likely to be adopted by AI engineering teams that need to find data quickly.

Atlan is the broadest of the three and the most AI-engineer-friendly catalog in the market in 2026. The strength is the user experience — search, lineage visualisation, collaboration, and the workflows that an AI engineer uses to find and qualify a dataset are well-designed. The integration breadth (Snowflake, Databricks, dbt, Tableau, Power BI, and a long tail of operational sources) is strong, and the AI-specific extensions (column-level lineage into ML feature stores, embedding into Databricks and Snowflake workflows, the recently-shipped Atlan AI assistant) are real rather than roadmap. The weakness is the governance depth at the policy-enforcement level — Atlan is the catalog the data team adopts, not necessarily the catalog the chief data officer’s audit team builds policy in. The procurement signal Atlan answers cleanly is our AI engineering and data analytics teams need a catalog they will actually use. For organisations whose AI work is led by engineering and analytics functions, this is the right primary answer.

Select Star is the mid-market discovery-first specialist with a focus on automatic lineage and column-level documentation. The strength is the depth of the automatic-lineage surface — the platform infers lineage from query logs at a level of detail most competitors require manual configuration to match. The weakness is the smaller integration footprint and the narrower governance surface relative to the larger vendors. The procurement signal Select Star answers cleanly is we are a mid-sized data shop and we want strong automatic lineage at a price point the enterprise vendors do not meet. For mid-market deployments, this is the pragmatic discovery-first answer.

Secoda is the smallest of the three named here, with a focus on the data-discovery-and-documentation surface for smaller data teams. The strength is the price point and the speed of deployment; the weakness is the breadth and depth relative to the larger vendors. The procurement signal Secoda answers cleanly is we are a smaller organisation and we want a catalog that ships in weeks not months. For startups and mid-sized organisations, this is the entry-tier answer; for enterprise-scale deployments, the larger vendors produce better outcomes.

Databricks Unity Catalog, Snowflake Cortex

These two are the platform-native answers to the catalog question, mirroring the hyperscaler-native archetype in the governance tooling piece.

Databricks Unity Catalog is the governance and catalog surface inside the Databricks Lakehouse Platform, and the right primary catalog for Databricks-anchored enterprises. The strength is the integration depth — Unity Catalog sees the workspace’s tables, ML models, feature stores, notebooks, and AI artefacts natively, with policy enforcement that extends across the whole Lakehouse footprint. The platform has been extending into the AI-specific surfaces (model registry integration, embedding-store lineage, RAG-evaluation logging) ahead of most standalone catalogs because Databricks owns the surface the AI workloads run on. The weakness is the multi-platform coverage — Unity Catalog governs what is in Databricks, and the coverage outside Databricks (data in Snowflake, BigQuery, operational sources) is limited or requires the broader Databricks-native federation layer. The procurement signal Unity Catalog answers cleanly is we are 80%+ on Databricks and we need the catalog to be the Databricks-native answer. For Databricks-anchored enterprises, this is the default answer; for multi-platform enterprises, Unity Catalog covers the Databricks slice and a standalone catalog overlay covers the rest.

Snowflake Cortex is the AI surface inside Snowflake, including the catalog-adjacent features that extend Snowflake’s existing governance into model-and-AI workflows. Cortex includes model-serving primitives, embedding generation, and the governance hooks that bring AI workloads into Snowflake’s existing access-control and audit surface. The strength is the integration depth inside Snowflake-anchored estates and the procurement economics — Cortex is sold as an extension of an existing Snowflake relationship rather than a separate procurement. The weakness is the same shape as for Unity Catalog — strong inside Snowflake, less so outside it — and the maturity is earlier than Unity Catalog because Snowflake started the AI extension later. The procurement signal Cortex answers cleanly is we are Snowflake-anchored and we want AI workloads inside the Snowflake governance surface. For Snowflake-anchored enterprises, this is the procurement direction the platform is moving toward.

Cloudera

Cloudera is the Hadoop-incumbent answer to the catalog and data-readiness question for enterprises that still run substantial on-premise or hybrid data infrastructure. The Cloudera Data Platform includes catalog, governance, and AI-workflow features integrated into the broader Hadoop-and-Spark ecosystem the platform anchors.

The strength is the on-premise and hybrid coverage that the cloud-native catalog vendors do not address as well — for regulated industries, government, and large enterprises with substantial on-premise data estates that cannot move to cloud platforms, Cloudera remains the credible incumbent answer. The weakness is the broader trajectory — Cloudera’s market position has been shrinking relative to the cloud-native data platforms, and the procurement question is whether the next five years of catalog-and-AI investment will produce competitive features against the cloud-native vendors. The procurement signal Cloudera answers cleanly is we have substantial on-premise data infrastructure that is staying on-premise, and we need the catalog to live in that estate. For enterprises that fit that shape, the procurement is straightforward; for cloud-anchored or cloud-migrating enterprises, the cloud-native answers produce cleaner outcomes.

The four-criterion scoring rubric

The procurement methodology mirrors the structure of the governance tooling four-week PoC and the agent-security four-criterion sheet, with the criteria adjusted for the data-catalog character of the work.

Criterion one: inventory coverage against your actual data estate. The test is not whether the vendor’s marketing claims coverage; it is whether the platform produces a credible inventory of the test datasets you nominate within the proof-of-concept window. Five is coverage across the full inventory including the shadow-analytics environments and SaaS-export data; one is coverage of the central warehouse and nothing else.

Criterion two: AI-specific extension fit. Does the platform cover the three AI-specific surfaces (training-data licensing, embedding-store provenance, prompt-context auditability) at the depth your use cases require, or only on roadmap. Five is all three AI-specific surfaces covered with production-grade features against real test workloads; one is classical data-catalog coverage only, AI-specific work on roadmap.

Criterion three: archetype fit (governance-first versus discovery-first). Does the platform’s centre of gravity match the procurement question your data and AI organisations are arguing about. Five is clean fit with the archetype your inventory and your governance posture require; one is fit on the opposite archetype, requiring cultural change to deploy successfully.

Criterion four: three-year total cost against the consolidation prediction. The catalog market is consolidating slower than the governance-tools market but consolidating nonetheless — the move from standalone catalogs into platform-native suites is the structural pressure. The test is whether the three-year total cost is justified against the likelihood the platform will be acquired or absorbed during the contract term, and whether the data-portability and exit terms are honest. Five is cost at the low end of the archetype range with clean data-portability terms; one is high cost with five-year-commitment pricing and no defined exit path.

The full four-criterion scoring sheet is published under CC-BY-4.0 and linked from the capabilities hub.

Where I would buy what, by enterprise shape

A pragmatic short list, scoped to enterprise shape rather than to vendor preference.

For a mid-to-large enterprise with a heterogeneous data estate and a mature data-governance organisation: Alation as the primary catalog (centre-of-gravity on governance with adequate discovery experience), with the AI-specific extensions delivered through a combination of Alation’s roadmap features and a complementary LLM observability tool for the prompt-context and embedding-provenance surfaces.

For a regulated-industry enterprise (finance, healthcare, pharma) with substantial compliance-driver procurement: Collibra as the primary catalog, paired with deliberate investment in the AI-specific surfaces through the platform’s evolving feature set or through complementary specialised tools.

For an Informatica-anchored data shop: Informatica IDMC’s catalog as the inside-the-relationship answer, with the AI-specific extension work modelled against Informatica’s product roadmap explicitly rather than assumed.

For an AI-engineering-led organisation prioritising discovery and adoption over governance theatre: Atlan as the primary catalog, with the governance-policy surface either accepted at Atlan’s level or extended through complementary tooling depending on the regulatory posture.

For a Databricks-anchored estate above 80%: Unity Catalog as the primary catalog, with the standalone overlay only for the workloads that genuinely live outside Databricks.

For a Snowflake-anchored estate above 80%: Snowflake’s native governance plus the Cortex AI surface, with the same multi-platform caveat as for Databricks.

For a mid-market organisation with a 500–2000 dataset inventory: Select Star or Secoda at the discovery-first archetype, or Alation at the lower enterprise tier if governance is the procurement driver.

For a Cloudera-anchored on-premise or hybrid estate: Cloudera Data Platform’s native catalog and governance, with the procurement caveats about the platform’s broader trajectory.

None of these recommendations come with a referral fee or a sponsorship. The honest signal of a working data-catalog deployment is that the data engineers, AI engineers, and analytics teams use it daily and the governance team produces compliance evidence from it monthly. The signal of a failing one is that the catalog exists, the dashboards are populated, and nobody opens them; the deployments that look fine at the six-month review and flat at the eighteen-month one are the ones where the cultural commitment to the archetype did not match the platform.

The data layer pre-dates the strategy approval

The structural argument of this page, and the reason it sits as a P1 in the capabilities cluster alongside the readiness assessment, is that the data layer is the prerequisite the AI strategy does not get to assume. Strategies that approve use cases their data layer cannot support are the strategies that slip their first milestone in month three and produce the audit findings in month nine that the Stuttgart insurer above worked through.

Goodhart’s Law applies to data readiness as much as to any other metric in this terrain. The data-maturity score in the consultancy assessment is the published metric the consultancy can measure. The realised data accessibility for the specific use case the strategy will require is the underlying outcome the published metric is a proxy for. The two correlate at the start of the assessment and diverge as the use case gets specific. The assessments that produce a green data-maturity score and a missed milestone six months later are the ones where the score measured the wrong thing.

Conway’s Law applies to the catalog procurement. The catalog you buy will be shaped by the organisation you have, and the catalog will, in turn, shape the data conversation your organisation can have. Governance-first catalogs produce governance-led data conversations; discovery-first catalogs produce engineering-led ones. Neither is wrong; the misalignment between the catalog’s archetype and the organisation’s cultural commitment to data is the procurement failure mode that produces the underdeployed platform.

The Brooks corollary applies to the AI-specific extension question. The temptation to bolt the training-data licensing, embedding-store provenance, and prompt-context auditability work onto an in-flight AI deployment as a remediation step at audit time is the same shape as adding people to a late project — it makes the deployment later and more expensive than building the audit infrastructure from the start. The cost of producing the audit trail retroactively, as the Stuttgart insurer paid, is materially larger than the cost of building it in before the use case ships.

What I would do, if starting from scratch on AI data readiness in mid-2026, is exactly this. Produce the inventory in week one — manually, on a spreadsheet, by interviewing the right twelve people with the right four questions. Classify and tier the inventory in week two. Map the lineage for the datasets the strategy is most likely to require in week three. Audit the access controls against the use case in week four. Identify the AI-specific extension surfaces the use case will need (licensing, embedding provenance, prompt auditability) and budget for them as separate workstreams that run in parallel with the use-case build rather than after it. Decide whether the catalog purchase belongs now (inventory above 500 datasets) or later (inventory below). Pick the catalog archetype before the catalog vendor. Run the four-week proof of concept against three shortlisted vendors inside the chosen archetype. Sign annual or two-year contracts. Plan the consolidation review at month twelve, and the data-layer refresh at the next strategy cycle.

The data layer is unglamorous, and the work is dull, and the procurement is harder to defend in a board meeting than the AI use case the board cares about. The use case the board cares about will not survive without it. The strategies that ship are the ones whose data layer was ready before the use case was approved. The strategies that slip are the ones that assumed the data layer would catch up. The pattern is consistent enough across the engagements I have audited that I would commit to publishing the next refresh of this page from the same position. The data layer pre-dates the strategy approval. Treat it that way, and the rest of the work has a chance.

Sources

NIST AI Risk Management Framework, v1.0 — the underlying risk framework against which AI data readiness maps
EU AI Act, Regulation (EU) 2024/1689 — Articles 10 (data and data governance) and 12 (record-keeping) for high-risk systems
ISO/IEC 42001:2023 — AI management system standard, with explicit data-governance clauses
Alation Data Intelligence Platform, Atlan Active Metadata Platform, Collibra Data Intelligence Platform, Informatica Cloud Data Governance and Catalog — primary vendor references for the standalone catalog archetypes
Databricks Unity Catalog and Snowflake Cortex — platform-native catalog references
Select Star, Secoda, Cloudera Data Platform — additional catalog references covering the mid-market, smaller enterprise, and hybrid-or-on-premise archetypes
Related: capabilities hub, enterprise AI readiness assessment, cost of failed AI projects, governance hub, AI governance tools

Methodology: catalog scorings and the four-criterion rubric drawn from fractional CTO and CIO engagements (2024–2026) where the catalog procurement either preceded an AI strategy approval and the strategy shipped on time, or did not and the strategy slipped on a data-layer cliff the catalog would have surfaced. The four-page data-readiness extension and the AI-specific surface assessment template are CC-BY-4.0 and linked from the capabilities hub. Fork them, change the criteria, publish a variant with different weights for your sector and send me the link; I will reference the fork from the next refresh.

Thomas Prommer CIO / CTO · 20 years · Practitioner, not consultant

Tom Prommer writes The AI Strategy Guide from the operator's seat — every tool covered, tested with real money before forming a view. Connect on LinkedIn · prommer.net · X