GPU and Inference Platforms in 2026: The Procurement Read
The procurement meeting I sat through in October had landed on a verdict before the slides loaded. The CTO wanted Groq because the demo on the supported open-source model had been visibly faster than the team’s existing inference path. The platform lead wanted Together AI because the team’s actual workload was a fine-tuned model that Groq did not support, and the cost-per-token on Together at the team’s volume was roughly half the cost the team was paying on the bare-metal cluster the data-engineering org had stood up six months earlier. The CFO wanted bare-metal because the bare-metal cluster was already a sunk cost. None of the three had asked which workload pattern actually fit which archetype, and the conversation went in circles for forty minutes until the platform lead drew the archetype split on the whiteboard and the room stopped arguing. The procurement-deciding question was not “which is best.” It was “which archetype fits the workload, and at what scale point does the answer change.” That is the question this page answers.
This is the operator-voice read on GPU cloud and AI inference procurement in 2026. The four archetypes, the workload patterns that fit each, the cost reality at three scale points, the lock-in trade-offs, and the procurement-correct posture for an enterprise that wants to be honest about what it actually needs. The strategic conversation about whether to invest in inference capability lives at the capabilities hub; this page is for the engineering exec who has already decided to spend and now has to decide where.
The four archetypes
The single most common procurement mistake in this category is treating the GPU-and-inference market as one market. It is four. Each archetype has a different cost structure, a different operational commitment, a different lock-in shape, and a different workload pattern it fits cleanly.
Archetype one: hyperscaler-native. AWS (EC2 P5/P5e instances, SageMaker JumpStart, Bedrock for the managed-inference surface), Azure (ND-series, Azure AI Foundry, the OpenAI partnership models served through Azure), Google Cloud (A3 instances, Vertex AI Model Garden, the Gemini API). The procurement archetype: rent compute and managed-inference services from the cloud provider you already buy everything else from. The cost structure is on-demand or reserved instance pricing for raw compute, per-token pricing for managed inference. The lock-in is the broader cloud lock-in your organisation already lives with.
Archetype two: inference-API platform. Together AI, Replicate, Fireworks, Anyscale, Lepton. The procurement archetype: pay per token for a managed inference endpoint that runs open-source or fine-tuned models on the platform’s compute. The cost structure is per-token at competitive rates, typically materially cheaper than equivalent hyperscaler-managed inference for the supported model surface. The lock-in is at the API shape and the fine-tuning artefact format; commodity-GPU portability underneath means the model is portable, but the deployment surface requires rebuilding.
Archetype three: serverless GPU. Modal, RunPod, fal.ai. The procurement archetype: run your own container with GPU access on a pay-per-second basis, no instance provisioning, no idle cost. The cost structure is per-second or per-request consumption with rapid scale-to-zero. The lock-in is at the platform’s container and scheduling primitives; the model and the inference code are yours, the orchestration around them is the platform’s.
Archetype four: specialised silicon. Groq, SambaNova, Cerebras, Etched. The procurement archetype: deploy specific supported models onto purpose-built non-GPU inference hardware (LPUs in Groq’s case, reconfigurable dataflow units in SambaNova’s, wafer-scale engines in Cerebras’s, transformer-fixed-function silicon in Etched’s). The cost structure varies — Groq and Cerebras have moved toward per-token API pricing competitive with commodity-GPU inference at the latency they offer; SambaNova and Etched are more enterprise-deployment shaped. The lock-in is the most pronounced in the category: the deployment surface is purpose-built for the silicon, and porting away is real engineering work.
Archetype five (adjacent): bare-metal / dedicated. CoreWeave, Lambda Labs, Crusoe. The procurement archetype: rent dedicated GPU capacity at the rack level, typically on multi-month or multi-year commitments, at materially lower per-hour cost than hyperscaler on-demand. The cost structure is reserved-capacity pricing. The lock-in is the commitment term; the operational commitment is yours — you build the orchestration and the inference platform on top of the dedicated compute.
The five archetypes are not interchangeable. Comparing Together AI to Groq to CoreWeave on a flat price-per-token matrix produces nonsense numbers because the three are answering different procurement questions. The right comparison happens within an archetype, after the archetype has been chosen against the workload pattern.
The workload patterns that fit each archetype
The procurement-deciding axis is workload pattern, not vendor feature depth. Five patterns cover the realistic shape of enterprise inference workloads in 2026.
Bursty, latency-tolerant. Batch jobs, evaluation runs, occasional fine-tuning, embedding generation for nightly re-indexing. The right archetype is serverless GPU — Modal, RunPod, fal.ai — because the scale-to-zero economics dominate. Hyperscaler on-demand works but the cold-start and provisioning friction is wrong-shaped for the pattern. Specialised silicon is wasted because the latency advantage does not pay back on a workload that does not care about latency.
Steady-state, mid-volume, frontier-model. Production agentic systems running against GPT-class or Claude-class models at hundreds of thousands of calls per day. The right archetype is the model vendor’s API directly (OpenAI, Anthropic, Google) or the hyperscaler-managed equivalent (Bedrock, Azure OpenAI, Vertex). The inference-API platforms (Together, Fireworks) are not in scope because the frontier closed-source models are not available there. The cost-deciding question is whether the per-token rate negotiation with the model vendor or with the hyperscaler is more favourable; in practice the hyperscaler deal is usually better for enterprises already in that estate.
Steady-state, mid-volume, open-source or fine-tuned. The same pattern but on Llama, Mistral, DeepSeek, or a fine-tune thereof. This is the heart of the inference-API platform market — Together AI, Fireworks, Replicate. The cost-per-token is materially cheaper than equivalent hyperscaler-managed inference and the developer experience is competitive. The procurement-correct default for most enterprises running fine-tuned models at scale.
Latency-critical, supported model. Real-time agentic systems, voice applications, low-latency search augmentation, where the per-token latency floor of commodity-GPU inference is the constraint. The right archetype is specialised silicon — Groq for the supported open-source models, Cerebras for the larger-context use cases the wafer-scale engine handles well. The trap is generalising; if your latency-critical workload uses a model the silicon vendor has not ported, the archetype does not fit.
High-volume, sustained, fine-tuned. A specific workload running tens of millions to billions of tokens per day on a fine-tuned model where the unit economics start to favour bare-metal over per-token API. The right archetype is bare-metal — CoreWeave, Lambda Labs, Crusoe — with an internal inference platform built on top. The procurement-correct trigger to make this move is unit economics, not strategic preference; most teams who move to bare-metal before the unit economics demand it underestimate the operational commitment.
Map the workload pattern first. Then pick the archetype. Then evaluate vendors within the archetype. The reverse sequence — picking a vendor and reverse-engineering the workload to fit — is the most common procurement failure mode in this category.
The cost reality at three scale points
The cost comparison across the four archetypes is sensitive to scale in ways the list prices do not capture. The three scale points below are the ones where the procurement-correct archetype changes.
One million tokens per day (single workload, small team). All four archetypes are economically viable. The all-in cost difference at this volume is materially smaller than the operational difference. Inference-API platforms (Together AI, Fireworks) are the procurement-correct default because the time-to-production is the shortest and the operational overhead is the lowest; the absolute cost is small enough that optimising for it is not worth the engineering time. Hyperscaler-managed inference works but the per-token pricing is typically 2x to 4x the inference-API platforms for the same open-source model. Serverless GPU works for bursty patterns; specialised silicon works if the latency requirement is real. Bare-metal is over-built — the unit economics do not justify it.
One hundred million tokens per day (multiple workloads, platform-engineering capacity). This is the scale point where the archetype choice starts to matter. Inference-API platforms remain competitive but the all-in cost relative to bare-metal starts to look meaningfully different — at this volume on a single fine-tuned model running steady-state, dedicated CoreWeave or Lambda capacity with an internal inference platform on top can land at 30% to 60% of the inference-API cost on the same workload. The trade-off is the platform-engineering FTE required to build and operate the internal inference layer; at one or two engineering-FTE of cost, the breakeven is real but not overwhelming. Specialised silicon for the supported workload pattern (latency-critical) becomes economically defensible at this scale. Hyperscaler-managed inference remains the most expensive per-token option and is usually only justified by the procurement convenience of buying everything from one cloud.
One billion tokens per day (mature platform, multiple workloads). Bare-metal economics can dominate for the workloads that fit the bare-metal pattern (steady-state, predictable volume, fine-tuned model) — provided the platform-engineering headcount to absorb the hidden operational tax (orchestration, observability, capacity planning) is already on the team. The all-in cost can land at materially less than inference-API rates for the equivalent workload, but the operational commitment is substantial — full platform-engineering team, full orchestration stack, full observability. Specialised silicon is economically defensible across more workload patterns at this scale because the absolute cost savings on the latency-critical workloads justify the lock-in trade-off. Inference-API platforms remain the right default for the bursty, lower-volume, or experimental workloads where the bare-metal commitment does not fit. The procurement-correct posture at this scale is multi-archetype: bare-metal for the big workloads, inference-API for the long tail, specialised silicon for the specific latency-critical surfaces.
The procurement-correct posture for 2026 is to start with the inference-API platform archetype for everything, then graduate workloads to bare-metal only when both the unit economics AND the internal platform-engineering maturity reach a clear tipping point regardless of long-term ambition, measure realised unit economics per workload monthly, and migrate individual workloads to specialised compute (bare-metal or specialised silicon) when the unit economics of that specific workload justify the migration cost. Starting with bare-metal because the long-term unit economics will favour it is the most common over-commitment mistake; the engineering capacity to operate it never quite materialises and the bare-metal cluster runs at 30% utilisation while the inference-API line item grows in parallel.
The lock-in question
Lock-in shapes differ materially across the four archetypes and the procurement decision is sensitive to which lock-in is acceptable.
Hyperscaler-native. The lock-in is the broader cloud lock-in. The inference layer is portable in principle — the models are the models — but the integration surface (IAM, networking, observability, billing) is the cloud’s, and the cost of moving an inference workload off the cloud is the cost of moving anything off the cloud, which is to say substantial. This is acceptable lock-in for organisations already committed to the cloud; it is unacceptable for organisations that want optionality at the inference layer specifically.
Inference-API platform. The lock-in is at the API shape and the fine-tuning artefact format. The underlying compute is commodity GPU, the models are open-source or your fine-tunes, so the model itself is portable. Switching from Together AI to Fireworks is real engineering work — the API surfaces are similar but not identical, the fine-tuning artefacts have to be ported or retrained, the observability and cost-tracking dashboards have to be rebuilt — but the work is bounded and measured in weeks, not quarters. The procurement-correct read: acceptable lock-in for the archetype because the alternatives within the archetype are competitive and portable.
Serverless GPU. The lock-in is at the platform’s container and scheduling primitives. Modal’s job definitions, RunPod’s pod configurations, fal.ai’s serverless model deployments — each has a platform-specific shape. The model and the inference code are yours, but the orchestration around them is the platform’s. Switching is real work but not architecturally hard. Acceptable lock-in for the archetype.
Specialised silicon. The most pronounced lock-in in the category. The model surface that runs on Groq’s LPUs is purpose-built for Groq’s silicon; porting a fine-tuned variant from Groq to Cerebras to commodity GPU is non-trivial engineering work and sometimes requires retraining. The procurement-correct read in 2026: assume specialised-silicon deployments are workload-specific commitments, not platform commitments, and scope them accordingly. Do not assume the next generation of specialised silicon will be compatible with the current generation’s deployment artefacts.
Bare-metal. The lock-in is the commitment term. CoreWeave, Lambda Labs, Crusoe typically sell on multi-month to multi-year reserved-capacity contracts; the lock-in is the contract, not the technology. The underlying compute is commodity GPU and the inference platform is yours, so the technical portability is the highest in the category. The procurement-correct read: bare-metal lock-in is financial, not technical, and the financial commitment can be sized to the confidence in the workload’s steady-state volume.
The honest noise on specialised silicon
Specialised silicon is the part of the GPU-and-inference market with the loudest marketing and the most pronounced gap between marketing and procurement reality. The honest read in 2026.
The latency advantages are real on the supported workloads. Groq’s token-per-second numbers on Llama and Mixtral variants are materially above commodity-GPU inference; the public benchmarks have held up under independent measurement. Cerebras’s latency on long-context inference workloads is genuinely competitive in a different shape because the wafer-scale architecture handles long sequences differently than GPU-based inference. SambaNova’s reconfigurable dataflow architecture shows similar advantages on the model classes it has tuned for. Etched’s transformer-fixed-function silicon is the most aggressive bet on a single model class — the architecture is purpose-built for transformer inference at the silicon level, and the procurement-deciding question is whether the model class your workload uses will be supported and stable across the lifecycle of the deployment.
The trap, in every case, is generalising the advantage. The latency win on the supported model is not a latency win on every model. Most enterprises will run a mix of workloads, and specialised silicon will win on one of them and lose on the others. The procurement-correct posture is workload-specific deployment, not platform commitment.
The cost narrative is also more complicated than the marketing implies. Per-token pricing on Groq for the supported workloads is competitive with commodity-GPU inference at the latency Groq offers, which is to say the latency advantage is not paid for in a substantial cost premium. But the procurement-correct comparison is not Groq versus commodity-GPU per-token; it is Groq for the specific latency-critical workload versus the all-in inference architecture you would otherwise build. If the latency-critical workload is a small fraction of your total inference volume, the specialised-silicon deployment is a marginal cost optimisation on a small surface. If it is a large fraction, the deployment is material and the lock-in trade-off becomes a serious procurement question.
Buy specialised silicon for the specific workload it wins on. Do not buy it as a general inference layer. The procurement-correct posture is the same posture that applies to every adjacent question in this market: workload pattern first, archetype second, vendor third.
What I would procure in 2026, by stage
A pragmatic short list, scoped to the four archetypes and the realistic shape of an enterprise inference programme.
Stage one: experimental, small team, sub-million tokens per day per workload. Inference-API platforms for everything. Together AI, Fireworks, or Replicate for the steady-state workloads. Modal or RunPod for the bursty experimental workloads. Hyperscaler-managed inference (Bedrock, Vertex, Azure OpenAI) for the closed-source frontier models. No bare-metal, no specialised silicon. The engineering capacity does not exist to operate them and the unit economics do not justify them.
Stage two: production, platform-engineering capacity, hundreds of millions of tokens per day across workloads. Inference-API platforms remain the default for most workloads. Add specialised silicon for the specific latency-critical workload if the workload pattern fits Groq, Cerebras, SambaNova, or Etched. Begin evaluating bare-metal for the largest steady-state workload if the unit economics calculation justifies the FTE commitment.
Stage three: mature platform, multi-workload, billions of tokens per day in aggregate. Multi-archetype posture is the procurement-correct default. Bare-metal for the largest steady-state workloads (CoreWeave, Lambda Labs, Crusoe, with internal inference platform on top). Inference-API platforms for the long tail and the experimental workloads. Specialised silicon for the specific latency-critical surfaces. Hyperscaler-managed inference retained for the frontier closed-source models where the model vendor is the constraint. Multi-vendor by design, not by accident.
The honest signal of a working inference procurement strategy is that each workload sits in the archetype that fits its pattern and the team can articulate why. The signal of a failing one is that the procurement decisions accumulated by vendor convenience rather than workload fit, and the platform-engineering team is operating four overlapping inference surfaces because nobody made a deliberate choice between them.
None of this is sponsored, none of the vendors named pay for inclusion, and the scoring sheet behind the workload-to-archetype map is published under CC-BY-4.0 alongside the capabilities hub. The procurement-correct sequence is the same one that applies to every adjacent decision on this site: name the workload pattern honestly, pick the archetype the pattern fits, evaluate vendors within the archetype against the cost and lock-in axes that actually decide procurement. Get that sequence right and the four archetypes stop being a confusing market and start being a clean set of trade-offs.
Sources
- Anthropic — Building effective agents — minimum-viable agent design that informs the inference-workload pattern
- Groq — LPU inference architecture — primary vendor reference for the latency-first specialised silicon archetype
- Together AI documentation — primary inference-API platform reference for open-source and fine-tuned model serving
- Modal documentation — primary serverless-GPU archetype reference
- CoreWeave architecture overview — primary bare-metal-GPU reference for the dedicated-capacity archetype
- NIST AI Risk Management Framework, v1.0 — evidence-trail baseline that the inference observability surface has to satisfy
- Related: capabilities hub, ML platform comparison, LLM observability hub, scalable adoption, cost of failed projects
Methodology: archetype and cost analysis drawn from fractional CTO procurement engagements (2024–2026) on production inference programmes across all four archetypes, cross-checked against published vendor pricing and the realised unit-economics data the operating teams shared on the condition of anonymity. Cost comparisons are typical enterprise-scale at each of the three named volume points.
