AI for Platform Engineering: The New Operating Model — Software Engineering illustration

AI for Platform Engineering: The New Operating Model

The platform-engineering standup I sat in on in February ran for forty minutes on a single thread. The team had built — by any reasonable architectural standard — a clean golden-path template for AI-augmented backend services, three months earlier. The template included an opinionated model-vendor abstraction, a tool-calling pattern, a structured logging surface, an evaluation harness stub, and a Helm chart that wired the whole thing into the team’s existing Kubernetes platform. The senior platform engineer who had led the work was visibly proud of it. Roughly nobody was using it. The two application teams that had adopted it in the first four weeks had quietly forked the template within a fortnight, ripped out the model-vendor abstraction because it was three releases behind Anthropic’s actual tool-use surface, and were now running a parallel stack the platform team could not observe and had not authorised. The platform lead’s question to the room was the right one: “is the golden path the wrong shape, or is the model market moving faster than any platform team can absorb?” The answer was both, in a specific ratio, and the operating-model implications are the reason this page exists.

The platform-engineering operating model that worked through the 2020s — pre-AI, in the era when the platform team’s job was Kubernetes, observability, CI/CD, and the developer-experience layer that wrapped them — does not survive contact with the AI-augmented application stack without changes. The changes are not the ones most platform leads are making in 2026. The instinct has been to grow the platform team, expand its charter, build more golden paths, and abstract more of the AI infrastructure stack. The result is platform teams that are too large, charters that are too broad, golden paths that lag the market they are supposed to standardise, and abstractions that hide the capabilities the application teams need to ship. The pattern is consistent enough across the engagements I have walked into in the last eighteen months to be worth writing down, and the operating-model alternatives are concrete enough that they can be implemented without a strategy refresh.

This page covers the operating-model shifts specifically. The procurement decisions that sit one layer above — which LLM observability vendor, which orchestration framework, which inference platform — are covered at the capabilities cluster. The application-team-level decisions that sit one layer below — which tool the application engineer reaches for inside their IDE, how they structure their prompts, which pattern they use for a specific workflow — are covered at the agentic AI architecture patterns page. What is unique to this page is the platform team’s own charter, staffing model, build-versus-buy decisions, and the anti-patterns the team has to actively avoid.

The platform team’s expanded charter, named honestly

Pre-AI, the platform team’s charter was roughly stable across a generation of engineering organisations. The team owned the Kubernetes platform or its equivalent, the CI/CD pipeline, the observability stack, the developer-experience surface (the internal-developer-platform layer or its informal equivalent), the security and compliance plumbing that wrapped the other four, and the on-call rotation for the platform itself. Every line item on that charter had a vendor ecosystem behind it, a clear measurement framework, and a body of engineering literature to draw on. The platform-engineering discipline through the 2020s converged on a remarkably consistent shape across organisations.

The AI-augmented platform team’s charter is the same five line items plus four new ones, and the four new ones are the source of most of the operating-model trouble.

The model-vendor abstraction surface. Application teams need a consistent way to call models from inside the platform, with the operational concerns — credentials, retry policy, request and response logging, prompt-template versioning, cost attribution — handled centrally rather than re-implemented per application. This is the one layer of the AI stack the platform team should genuinely own, and it is also the layer that produces the most over-engineering. The abstraction that works covers the operational surface. The abstraction that fails tries to normalise the capability surface across model vendors that are genuinely different shapes.

The evaluation primitives. Every AI-augmented application team needs the ability to evaluate model outputs against labelled datasets, track regressions across model releases, and gate deployments on evaluation results. The application teams cannot build this surface from scratch each time; the cost would be prohibitive and the inconsistency would prevent the platform team from observing what is actually shipping. The platform team has to provide the evaluation primitives — the test-runner, the dataset-management surface, the regression-tracking dashboard, the deployment-gating integration — even though the actual evaluation datasets and metrics belong to the application teams. The split between the platform-owned primitives and the application-owned content is one of the harder lines to draw cleanly, and most platform teams draw it wrong on the first attempt.

The observability surface for agentic and AI-augmented systems. Standard application observability does not cover the AI-specific failure modes — tool-selection drift, prompt-template regression, context-window exhaustion, hallucinated tool calls, supervisor-worker context loss, the named failure modes at the agentic AI architecture patterns page. The platform team has to extend the observability surface to cover these failures, which usually means integrating a dedicated LLM observability vendor alongside the existing observability stack. The integration work is non-trivial. The build-versus-buy decision on this surface is one of the central decisions the platform team makes, and the answer is mostly buy.

The platform’s own AI-tooling rollout. Cursor, Claude Code, GitHub Copilot, Sonar, the inline-assist tools, the PR-bots — every one of these tools has a platform-side integration story (SSO, audit logging, data-handling, code-base indexing) that the platform team owns regardless of which application team is using the tool. The platform team is now responsible for the operational reliability of the AI coding tools across the engineering organisation. This is a real responsibility, with a real on-call surface attached, and it is the line item most platform teams underestimate at planning time.

Four new line items, each with a real workload behind it. The instinct to add headcount to the platform team to cover them is intuitive and wrong. The platform team’s effectiveness scales with senior density rather than with headcount, and the four new line items in particular reward senior judgement over additional hands. The headcount-versus-density question is the one the rest of this page leans on.

Headcount and staffing — scale by senior density, not by size

The platform team I worked with in February was a twelve-person team serving roughly 350 engineers. The composition was three senior engineers, three mid-level engineers, two junior engineers, two engineering managers, and two SREs split between the platform and a separate reliability team. By 2024 standards this was a reasonable composition. By the standards of what an AI-augmented platform team actually needs to do, it was wrongly weighted by a meaningful margin. The senior engineers were absorbed by the integration work, the model-vendor abstraction maintenance, and the evaluation-primitives design. The mid-level engineers were producing platform code at high individual throughput — they were heavy Cursor users themselves — but the code they were producing was queuing for review by the same three senior engineers, who were already overloaded. The platform team had built the throughput-versus-velocity gap into itself.

The pattern that works, across the engagements where the platform team has stabilised, is roughly two to four senior engineers per hundred engineers served, with a hard ceiling around twelve people for any single platform team regardless of organisation size. The composition is heavier on senior and staff engineers than the pre-AI platform team was, and lighter on mid-level engineers. The reasoning is the same code-review bottleneck reasoning that applies to the application teams, with a complicating amplifier: the platform team’s own code is more architecturally consequential than the application teams’ code, the senior-judgement review load is therefore proportionally higher, and the platform team can least afford the velocity loss that comes from a mid-level-heavy composition.

For organisations beyond about 500 engineers, the right pattern is multiple federated platform teams rather than one larger platform team. Each federated team owns a specific surface (the AI-infrastructure surface, the developer-experience surface, the reliability surface, the security-and-compliance surface) and the federation is held together by a thin coordinating function — usually one engineering manager and one principal engineer — rather than by a unified platform organisation. The instinct to grow the platform team past the twelve-person ceiling produces the platform-team-internal-coordination-cost failure mode that Brooks named in 1975 and that has not stopped applying since. The federation pattern works because each federated team stays under the coordination ceiling.

Brooks’ Mythical Man-Month is the foundational reference here and not in the loose buzzword sense. The specific Brooks insight that applies — that communication overhead in a team grows as n-squared and limits the team’s effective output — is the load-bearing claim behind the twelve-person ceiling. The platform team that ignores this and grows past the ceiling is the platform team that finds its internal coordination cost dominates its external delivery cost, which is the empirical state of about a third of the platform teams I have audited.

The hiring posture follows. The platform team in an AI-augmented organisation should hire more senior engineers than the pre-AI version would have hired, even at the cost of a smaller team overall. The compensation pressure on senior platform engineers is real — these are exactly the engineers AI-native firms are targeting hardest — and the retention work for them has to be deliberate. The fastest way I have seen platform teams hollow out in 2026 is by under-paying their three or four senior engineers while paying market for mid-level engineers, on the theory that the mid-level engineers are the ones doing the visible work. The senior engineers leave, the platform team’s effectiveness collapses, and the engineering organisation discovers what the platform team was actually doing roughly two months too late to fix it cleanly.

Build versus buy — the platform-layer version

The build-versus-buy decision at the platform layer is different from the build-versus-buy decision at the application layer, and it is more nuanced than most platform leads have time to think through. The principle that survives across the engagements where I have watched the decision get made well is: buy the operational surfaces where the vendor market has matured, build the integration glue and the layer that has to ground in your specific codebase, and never build what the model vendors themselves will eventually own.

Buy LLM observability. The vendor market here has matured enough that building it internally produces a worse surface at higher cost. Langfuse, Helicone, LangSmith, Datadog’s LLM observability work, and Honeycomb’s AI observability features all cover the production-side observability surface competently. The integration work is non-trivial but is glue work, not platform-engineering R&D. The platform team that builds its own LLM observability surface in 2026 is the platform team that will spend a quarter replicating what these vendors already do and a year maintaining the result against a moving model market.

Buy inference, mostly. The GPU inference market — covered in detail at the GPU inference page — has consolidated to the point where renting inference from one of the model vendors directly, or from one of the inference-API platforms (Together AI, Anyscale, Modal for serverless), is the right answer for almost all enterprise workloads. The exceptions are narrow and well-known: extreme-low-latency requirements, regulatory-compliance constraints requiring on-prem inference, or workload economics that genuinely justify the capex of a dedicated inference cluster. If you are not in one of those exception cases, do not build inference. The build path is more expensive than the operating-cost calculation suggests because the maintenance burden is higher than the surface area implies.

Build the model-vendor abstraction. The platform team should own this layer, and it is one of the few layers where building is genuinely the right answer. The reasoning: the abstraction has to be coupled tightly to the platform’s specific credentials management, logging surface, cost attribution, and compliance posture, none of which a vendor can provide cleanly for a single-tenant deployment. The abstraction has to evolve with the model vendors’ own surfaces, which means continuous maintenance the vendor cannot anticipate. And the abstraction is small enough — three to five hundred lines of code in a competently built version — that the build cost is bounded. Buy attempts here mostly produce abstractions that lag the model vendors’ actual capabilities by months, which is the opposite of what an abstraction is for.

Build the evaluation harness. Same reasoning as the model-vendor abstraction, with one addition: the evaluation harness has to integrate with the application teams’ deployment pipelines, which are organisation-specific in ways no vendor can anticipate. The buy options for evaluation harnesses are either over-priced (the dedicated eval-platform vendors) or under-shaped to the specific evaluation needs of a given application team (the eval features bundled into the LLM observability vendors). Build the eval harness as a thin layer on top of the bought LLM observability surface. The line between the two is the line between production observability (bought) and pre-deployment evaluation (built).

Build the golden-path templates, and accept they will be forked. The platform team has to provide opinionated templates for AI-augmented services, because the alternative is every application team rebuilding the same boilerplate from scratch. The templates will be forked by application teams when the model market moves faster than the template’s update cadence. This is fine and expected; the platform team’s job is to update the template back to incorporate the application teams’ forks, not to prevent the forks from happening. Templates that cannot be forked produce shadow stacks; templates that can be forked produce a feedback loop the platform team can learn from.

Do not build what the model vendors will own. This is the negative principle that catches the most platform-team time-waste. Tool-calling primitives, structured-output enforcement, prompt-template versioning at the application level, basic agentic orchestration for simple loops — all of these are surfaces the model vendors themselves are building, and the model vendors’ versions will be better than any platform-team version within a release cycle or two. The platform team that builds these surfaces is the platform team that will spend a year building something the vendors deliver for free six months later, and then spend another quarter migrating off its own implementation. Watch the model vendors’ roadmaps; do not build what is six months away from being part of the SDK.

The platform-engineering anti-patterns

The four anti-patterns below are the ones I see most often in platform-engineering engagements in 2026. They cluster.

Over-engineered orchestration that nobody uses. The single most common anti-pattern, by a meaningful margin. The platform team builds a sophisticated agentic orchestration layer — multiple model vendors abstracted behind a routing tier, complex tool-calling logic, retry-and-fallback policies, request-and-response logging integrated with the observability surface. The architecture is correct. The diagram is clean. The application teams route around it within a quarter, because the abstraction layer adds latency, hides the capabilities the application teams need, and is three model releases behind the actual capability surface. The fix is to build the smallest orchestration layer that solves the application teams’ actual problems, observe how they extend it, and add complexity only when an application team’s hack has stabilised into a pattern worth promoting. The principle is the same one Anthropic published in Building Effective Agents: the right architecture is the smallest one that works. Platform teams over-build for the same reason application teams over-build, with the additional pressure that platform teams’ over-builds get presented to the CTO as architectural achievements.

Evaluation harness divorced from the deployment pipeline. The platform team builds an evaluation surface that produces beautiful reports, tracks regressions across model releases, supports labelled datasets, and is integrated with the LLM observability vendor. Nobody runs evaluations as part of their deployment pipeline because the integration between the eval harness and the CI/CD layer was left for the application teams to implement. The application teams do not implement it because they have other priorities. The eval harness becomes a reporting surface for after-the-fact analysis rather than a gating surface for pre-deployment quality. The fix is to make the eval harness deployment-gating by default — the platform-team-owned default in the CI/CD pipeline is that no AI-augmented deployment ships without a passing eval run — and to make opting out require an explicit decision that gets logged.

Golden-path templates that lag the model market. The platform team publishes opinionated templates for AI-augmented services. The model market moves; the templates do not update; application teams quietly fork the templates; the platform team discovers the forks during an incident postmortem two months later. The fix is to commit to a fixed update cadence for the golden-path templates — monthly is the minimum that survives the 2026 model-release pace — and to make the update cadence a tracked metric for the platform team. The team that lets the templates drift past a quarter is the team that will discover its golden paths have become museum exhibits.

The platform team’s own AI tooling is the last to be modernised. The platform team is staffing the AI coding tool rollout for the rest of the engineering organisation while running its own internal tooling on a 2023 stack. The pattern is so consistent it is almost a cultural feature: platform teams underinvest in their own developer experience because their work feels too foundational to disrupt. The result is that the platform team’s productivity lags the application teams’ by twelve to eighteen months, which compounds the senior-engineer scarcity problem the team already faces. Fix this aggressively. The platform team should be running the latest tooling, not the most stable.

What this connects to

The parent software-engineering hub covers the four discipline shifts of which this is one. The code review page is the most adjacent — the platform team’s own code review process is one of the highest-leverage applications of the discipline changes on that page, because the platform team’s code is the most architecturally consequential code in the engineering organisation. The testing strategy page covers the evaluation harness question from the application-team angle; this page covers it from the platform-team-owns-the-primitives angle, and the two pages should be read together if you are scoping the eval-harness investment.

The agentic AI architecture patterns page is the pattern-vocabulary the platform team should standardise on when application teams come asking for orchestration support. The AI coding tools hub covers the procurement frame for the tools the platform team is rolling out across the organisation. The LLM observability cluster is the vendor-procurement frame for the observability surface this page argues you should buy rather than build.

The AI for engineering teams page covers the throughput-versus-velocity argument that constrains the platform team’s own internal operating model in the same way it constrains the application teams. The platform team is not exempt from the bottleneck-moves-to-code-review dynamic; it is, if anything, more exposed because the consequences of unreviewed platform-team code are higher.

Conway’s law applies here in a specific form. The platform team’s organisational shape determines the shape of the platform it produces, and the shape of the platform determines what the application teams can build. A platform team weighted too heavily mid-level produces a platform with too much abstraction and not enough capability. A platform team without a clear federation pattern produces a platform that becomes a coordination bottleneck. The operating-model decisions on this page are, in a meaningful sense, architectural decisions in disguise — the architecture lands eighteen months later as the consequence of the staffing and charter decisions made now.


Sources

Methodology: operating-model patterns drawn from fractional CTO and platform-advisory engagements across roughly fifteen engineering organisations adopting AI-augmented workflows (2024-2026), with platform-team sizes ranging from four to thirty engineers and served populations from forty to twelve hundred. The twelve-person ceiling reflects the pattern I have observed; the senior-density ratio reflects the ratio that has stabilised across the teams that have made the operating model work. Disagreement with your organisation’s observed reality is the most useful form of feedback — send it and I will incorporate it in the next refresh.

Frequently asked questions

How big should the platform team be in an AI-augmented organisation?
Smaller than most engineering leaders expect, and weighted heavier on senior density than the historical platform team. The pattern that works across the engagements where I have watched it work is roughly two to four senior engineers per hundred engineers served, with a hard ceiling around twelve people for any single platform team regardless of organisation size. Beyond twelve, the platform team's internal coordination cost exceeds the value it delivers to its application-team customers. The instinct to grow the platform team linearly with the engineering organisation is wrong. Scale by adding senior density and federating to multiple platform teams above the twelve-person ceiling, not by adding mid-level engineers to a single team.
Should the platform team build a model-vendor abstraction layer?
Yes, but only for the parts the application teams genuinely need abstracted, which is much less than the platform team's first instinct. The abstraction that earns its place is the one that hides credential management, prompt-template versioning, and the request-and-response logging surface — operational concerns the platform team owns. The abstraction that does not earn its place is the one that tries to normalise across model vendors' actual capabilities. Anthropic's tool-use surface and OpenAI's Responses API are genuinely different shapes; abstracting away the difference produces an abstraction that lags both vendors' actual capabilities and creates a maintenance burden the platform team cannot keep up with. Abstract the operational surface. Pass the capability surface through.
Build or buy LLM observability?
Buy, with one specific exception. The LLM observability vendors (Langfuse, Helicone, LangSmith, Datadog's LLM observability work, Honeycomb's AI observability features) all do something a platform team cannot reasonably build internally in a quarter, and the operational value of the observability surface is genuinely high. The exception is the evaluation harness specifically — the eval-as-CI surface that runs prompts against labelled datasets and tracks regressions. That surface has to be tightly coupled to the application team's deployment pipeline, and the buy options for it are either over-priced or under-shaped to your specific evaluation needs. Buy the production-side observability; build the evaluation-side harness. Treat them as separate procurement decisions.
What is the most common platform-engineering anti-pattern in AI-augmented organisations?
Over-engineered orchestration that nobody uses. The platform team builds a beautiful agentic orchestration layer with five model vendors, a routing tier, a tool-calling layer, a sophisticated retry-and-fallback surface, and the application teams quietly route around it within a quarter because the abstraction has become the bottleneck. The pattern is so common I see it in roughly two of every three platform-engineering engagements. The fix is to build the smallest orchestration layer that the application teams actually need, observe how they extend it, and add complexity only when an application team's hack has stabilised into a pattern worth promoting. Build for the observed need, not the imagined one.