Are autonomous voice agents real, or is the category still mostly demo?

Real in contact-centre voice for a narrow slice of workloads — appointment booking, simple lookup-and-confirm flows, status updates, basic triage on inbound calls. Still mostly demo for general-purpose autonomous voice agents that handle open-ended conversation across multiple intents. The honest 2026 read is that the contact-centre AI agent category has shipped real systems that pay back; the consumer-facing autonomous voice agent category has shipped impressive demos that fail under the long tail of real call shapes. Procure into the first; pilot, do not commit, in the second.

What is the realistic latency budget for a production voice agent?

Sub-300ms round-trip from end-of-user-speech to start-of-agent-speech is the contact-centre table stakes; anything longer and the conversation feels broken to the caller. Meeting AI and async voice workflows have much more slack — five to fifteen seconds is fine for note-takers, longer for transcription pipelines. The hardest latency tier is real-time consumer voice (Wispr Flow, Aqua Voice, Superwhisper-style dictation), where the user expects keystroke-equivalent responsiveness. Procure against the right tier; a contact-centre stack will be over-built for meeting AI and under-built for real-time consumer voice.

Should I buy a voice-AI platform or assemble one from voice infrastructure, voice models, and orchestration?

Buy the platform if your team does not have full-stack voice engineering experience and your workload fits one of the platform vendors' opinionated shapes (Vapi, Retell AI, Bland AI for general agents; PolyAI and Parloa for contact centres). Assemble from primitives if your workload is unusual, your scale justifies the engineering investment, or you need control over the latency budget. The assembled path is LiveKit or Twilio for infrastructure, ElevenLabs or Cartesia for the voice model, your existing LLM stack for reasoning, and your existing orchestration for tool calling. Most teams over-estimate their need for the assembled path.

Is meeting AI a real procurement category or just a feature of every video-conferencing tool?

Real and worth procuring deliberately, with the caveat that the consolidation pressure on this category is intense. Otter.ai, Fireflies.ai, Fathom, Granola, and the others compete with native features from Zoom, Google Meet, and Microsoft Teams. The standalone vendors win on transcription quality, cross-platform workflow, and the post-meeting actions layer; the native features win on default-on convenience. The procurement question is whether your enterprise gets more value from a best-of-breed standalone tool with deeper workflow integration, or from the native feature your team will use without being told to.

Voice AI in 2026: Four Procurement Categories, Named Verdicts

Tom Prommer · CIO/CTOUpdated 2026-05-3016 min read

Executive summary

A practitioner's read on voice AI procurement in 2026 — contact-centre voice agents, meeting AI, voice infrastructure, voice-model providers. Named verdicts on ElevenLabs, Cartesia, LiveKit, Vapi, Retell AI, Bland AI, PolyAI, Parloa, Otter.ai, Fireflies.ai, Granola, and the rest of the field.

The contact-centre voice-AI procurement I reviewed last autumn had been approved at board level on a vendor demo where the agent booked a dental appointment, confirmed a delivery, and rescheduled a flight in three immaculate ninety-second conversations. The vendor was confident. The procurement team was excited. I watched the live pilot two months later and counted seventeen distinct call shapes the agent had to handle in the first hour, of which the agent handled four of them well, eight badly enough that the caller asked for a human, and five badly enough that the caller hung up. The demos had shown the happy-path three of the seventeen shapes. The other fourteen were the long tail — the edge cases, the slang, the ambient noise — that kill voice-AI deployments in production, and the long tail is where most voice-AI deployments meet reality. The system was redeployed with a much narrower scope — appointment booking only, with a human escalation path for everything else — and at that scope it paid back inside six months. At the original scope it would have failed publicly.

That is the voice-AI question in 2026. The capability is real, the latency is achievable, the model quality has caught up to the production bar — and the procurement category is still the place teams make the most expensive mistakes, because the demos hide the long tail and the buying motions are not yet well-named. There are four genuine procurement categories, not one. Each has a different buyer, a different latency budget, a different vendor set, and a different definition of “works.” Conflating them produces the same shape of failure I have watched three times in the last year.

This page is the procurement-decision read on voice AI in 2026. The four categories, the realistic latency reality, the named vendor verdicts per category, and an honest assessment of where the market is real and where it is still mostly demo. Voice AI sits adjacent to the broader orchestration question covered at the orchestration architecture page; the LLM layer underneath is the same one the rest of the capabilities cluster is about. What is different — and what justifies a procurement category of its own — is the latency, the model-quality requirements, and the deployment shape.

The four categories

The mistake worth naming up front: treating “voice AI” as a single market. It is four markets, with different buyers, different success criteria, and largely non-overlapping vendor sets. The vendors who try to play in all four (a small but growing number) are usually strong in one and weak in the others.

Contact-centre voice agents. The buyer is CX, sometimes COO, occasionally CIO. The job is automating inbound or outbound voice calls that previously required a human agent — appointment booking, status lookups, simple triage, outbound surveys, follow-ups. The latency budget is brutal: sub-300ms round-trip from end-of-user-speech to start-of-agent-speech, or the call feels broken. The ROI calculation is concrete: cost-per-call before, cost-per-call after, containment rate (percentage of calls resolved without human escalation), CSAT delta. This is the most procurement-ready voice-AI category in 2026 and the one where the vendor verdicts are most defensible.

Meeting AI and note-takers. The buyer is usually individual (bottom-up adoption) or a department head (top-down rollout for sales or customer success). The job is transcription, summarisation, action-item extraction, and integration into CRM, project tools, or knowledge bases. The latency budget is forgiving — five to fifteen seconds for live summaries, much longer for post-meeting outputs. The ROI is fuzzier than contact-centre but the per-seat cost is lower, which makes the procurement decision lighter. Consolidation pressure from native features (Zoom AI Companion, Google Meet Gemini, Teams Copilot) is the biggest threat to the standalone vendors.

Voice infrastructure and SDKs. The buyer is CTO or platform engineering. The job is providing the real-time audio plumbing — telephony, WebRTC, audio routing, turn-taking, interruption handling — that the voice-AI workload sits on top of. The latency budget is the tightest in the category because everything else inherits it. The ROI is indirect; you are buying the substrate that makes the other categories possible.

Voice-model providers. The buyer is again CTO or ML/AI lead. The job is the speech-to-text and text-to-speech models themselves — the quality, latency, and language coverage that determine how good the voice experience can be. The procurement question is which model to commit to, given that the model layer is the place quality differences are most audible to the end user.

The conflation that most damages procurement is treating contact-centre voice agents and voice-model providers as the same purchase. They are not. The contact-centre vendors typically resell or wrap a voice-model layer; the voice-model providers do not solve the contact-centre workflow. Buying both as one decision usually produces overpayment for one and under-fit for the other.

The latency reality

Voice AI lives or dies on latency in a way text AI does not. A 1.5-second delay in a chat interface is annoying. A 1.5-second delay before a voice agent responds breaks the conversational protocol humans use to take turns, and the caller will start talking over the agent or hang up. The latency budgets, named:

Real-time consumer voice (Wispr Flow, Aqua Voice, Superwhisper for dictation; Siri for assistant). Sub-200ms end-to-end. The hardest tier. The user expects keystroke-equivalent responsiveness. Most LLM-based reasoning falls outside this budget unless heavily pipelined with streaming inference.
Contact-centre voice agents. Sub-300ms round-trip from end-of-user-speech to start-of-agent-speech. Achievable in 2026 with the right stack (LiveKit or Twilio infrastructure, Cartesia or ElevenLabs Turbo for TTS, a streaming LLM with prefill caching). Not achievable with a naive stack that calls a non-streaming LLM after a non-streaming transcription.
Meeting AI live features (live summaries, action items as they emerge). Five to fifteen seconds is acceptable. The user is not waiting on the AI’s response; they are seeing it surface alongside the conversation.
Async voice workflows (transcription pipelines, recorded-call analysis, batch summarisation). Minutes are fine. The latency budget is whatever fits the downstream workflow.

The mistake worth naming: procuring against the wrong tier. A team that buys a contact-centre-grade voice stack for an internal meeting-AI workload is over-paying by an order of magnitude. A team that buys a meeting-AI-grade stack for a contact-centre workload will ship a system that the callers experience as broken. Match the procurement to the tier.

Contact-centre voice agents: named verdicts

The category where the procurement question is most concrete and the vendor set most differentiated. Five vendors worth naming, plus the two that come up in every conversation but are mostly the wrong call.

PolyAI. The enterprise contact-centre vendor with the deepest production track record in the category. PolyAI’s positioning has been narrower than the rest of the field — they focus on inbound contact-centre voice specifically, with deep integration into existing CCaaS infrastructure (Genesys, Five9, NICE) — and the narrowness has been a feature, not a bug. Their agents handle a wider range of call shapes reliably than the general-purpose voice-agent platforms, because they have spent more years tuning against real call data. The trade-off is procurement weight: PolyAI sells enterprise contracts with implementation services, not a self-serve SaaS, and the deal cycle is correspondingly long.

Verdict: the right pick for large-enterprise inbound contact-centre voice where the integration with existing CCaaS infrastructure is the binding constraint. Wrong pick for teams that want to self-serve or pilot quickly.

Parloa. The European challenger, with strong recent enterprise traction and the cleanest GDPR-and-EU-AI-Act-compliant positioning in the category. Parloa’s platform approach lets enterprises build their own voice agents against their own knowledge bases and workflows, which sits between the deeply-implemented PolyAI model and the self-serve developer platforms. Their European data-residency story is the most defensible in the category for enterprises with regulatory pressure on that axis.

Verdict: the right pick for European enterprises with regulatory data-residency requirements and a workflow-customisation need that the developer platforms do not fit cleanly.

Vapi. The strongest developer-platform in the category. Vapi is what you procure if you want to build voice agents in the same way you build other LLM applications — API-first, model-agnostic, bring-your-own-prompts, with the voice infrastructure abstracted behind a thin layer. The platform handles telephony, real-time audio, turn-taking, and the model orchestration; you handle the application logic. Their pricing is per-minute and competitive.

Verdict: the right pick for engineering-led teams building voice agents into a product or an internal workflow. The cleanest API surface in the category.

Retell AI. The closest competitor to Vapi on positioning — developer-platform, API-first, model-agnostic. Retell AI’s edge is in the conversation-quality tuning (interruption handling, turn-taking, the conversational protocol details that distinguish a good voice agent from a competent one). The trade-off versus Vapi is largely stylistic at the platform level; the right choice often comes down to which API surface fits the team’s existing patterns better.

Verdict: the right pick if conversational quality is the binding constraint and the engineering team is willing to invest in tuning. Strong alternative to Vapi.

Bland AI. The outbound-voice-agent specialist. Bland’s positioning is narrower than Vapi or Retell AI — they focus on outbound calls (sales, surveys, follow-ups) — and at that narrower scope they have shipped some of the most production-grade voice agents in the category. Their infrastructure is purpose-built for outbound, with phone-number provisioning, compliance workflows, and call-pacing primitives that the general-purpose platforms treat as afterthoughts.

Verdict: the right pick for outbound voice workloads at scale. The general-purpose platforms (Vapi, Retell AI) can do outbound but trade depth for breadth; Bland is the depth pick.

The two non-recommendations. General-purpose autonomous voice agents from the major LLM vendors (the ChatGPT voice modes, Gemini Live, Claude’s voice surface where it exists) are excellent demos and mostly the wrong procurement for production contact-centre workloads, because they are optimised for open-ended conversation rather than narrow, reliable, workflow-bounded calls. They are the right pick for internal voice interfaces and consumer products; they are the wrong pick for the contact-centre buying motion.

Voice infrastructure and SDKs

The substrate. Three vendors worth naming.

LiveKit. The strongest open-source-rooted real-time audio infrastructure in 2026, and the substrate that most of the developer-platform voice agents (Vapi, Retell AI, parts of the new wave) build on top of. LiveKit handles WebRTC, real-time audio routing, turn-taking primitives, and the integration with voice models. They added telephony support in 2024 that has matured into a credible alternative to Twilio for new builds. Their managed cloud product is operationally tractable; the self-hosted option is real for teams that need it.

Verdict: the right pick when you are building a voice-AI workload from primitives and want a modern, well-designed real-time audio layer. The strongest infrastructure-layer pick for new builds in 2026.

Twilio. The category incumbent on telephony, with a credible Voice AI product (Twilio AutoPilot, the broader voice agent stack) layered on top of the telephony substrate. Twilio’s advantage is the telephony reach — global phone numbers, carrier relationships, compliance scaffolding — and the operational depth that comes from a decade of running this stack at scale. The trade-off is that the voice-AI surface on top of Twilio is less developer-friendly than LiveKit’s; the platform is older and the API surface shows it.

Verdict: the right pick when global telephony reach is the binding constraint, or when the existing Twilio footprint in the enterprise makes adding another infrastructure vendor untenable.

Daily.co. The third option, with strong WebRTC infrastructure and a developer-friendly API surface, increasingly competing with LiveKit in the new-build segment. Daily’s positioning is closer to video-first than LiveKit’s voice-first, which makes them a stronger pick for workloads that combine voice and video. For voice-only workloads, LiveKit is usually the cleaner pick.

Verdict: the right pick for combined voice-and-video workloads. Reasonable alternative to LiveKit for voice-only.

Voice-model providers

Two vendors that matter, both in active production at enterprise scale.

ElevenLabs. The category leader on text-to-speech quality and the broadest voice catalogue. Their Turbo model line is the right pick for low-latency conversational voice (sub-300ms TTS generation), and their voice-cloning surface is the most mature in the category. The trade-off is cost — ElevenLabs at scale is the most expensive option — and the lock-in implied by building against their voice catalogue. They sit at the centre of most production voice-AI stacks in 2026.

Verdict: the right pick when voice quality is the binding constraint and the cost trajectory at scale is acceptable. Default for most contact-centre and consumer voice workloads.

Cartesia. The strongest challenger on the latency axis specifically. Cartesia’s Sonic models are designed for sub-100ms first-byte TTS, which makes them the right pick for the most latency-sensitive contact-centre workloads where ElevenLabs Turbo adds the wrong tens of milliseconds. The voice catalogue is narrower than ElevenLabs; the quality at the top end is competitive. Cartesia is the technical procurement choice when the latency budget is the binding constraint.

Verdict: the right pick when latency is the binding constraint. Strong alternative to ElevenLabs.

Meeting AI: a consolidating category

Nine vendors worth naming, and the consolidation pressure from native features is real enough that the procurement question is whether to buy any standalone vendor at all.

Otter.ai. The category incumbent with the broadest workflow integrations and the strongest mobile experience. Otter’s transcription quality is competitive, the meeting-summary and action-item extraction has improved through 2025 and 2026, and the integration with CRMs and knowledge bases is the deepest in the category for the standalone vendors.

Verdict: the right pick when cross-platform workflow integration is the binding constraint. Strong default for sales-led enterprises.

Fireflies.ai. The closest competitor to Otter on positioning, with stronger conversational-intelligence features (call coaching, talk-time analytics, topic detection) and a more deliberate sales-team focus. Fireflies’ edge over Otter is in the post-meeting analytics layer.

Verdict: the right pick for sales-team and customer-success workflows where conversational analytics are the binding constraint.

Fathom. The freemium player with strong product-market fit in the SMB and bottom-up enterprise segments. Fathom’s CRM integration (Salesforce, HubSpot) is competent, the transcription quality is fine, and the price point is aggressive. Worth procuring when the binding constraint is procurement simplicity at small scale.

Verdict: the right pick for SMB and bottom-up adoption at lower price points.

Granola. The interesting new entrant. Granola’s positioning is that the meeting AI should produce notes that look like the notes a human would have written, rather than a generic AI-generated transcript-summary. The product has gained meaningful traction in tech-leadership circles through 2025 and 2026; the bet is that the quality of the produced notes is high enough to displace human note-taking entirely for many workloads. The trade-off is the narrower enterprise feature set compared to Otter or Fireflies.

Verdict: the right pick when note quality is the binding constraint and the workflow is individual-led rather than enterprise-rolled-out.

Rev. The transcription incumbent with credible AI features layered on top of a strong base. The right pick when transcription accuracy is the binding constraint (legal, medical, regulated workflows).

Read AI, tl;dv, Krisp, Circleback. The next tier of standalone vendors, each with a particular wedge — Read AI on meeting-effectiveness analytics, tl;dv on async meeting-clip workflows, Krisp on noise cancellation plus AI features, Circleback on the action-item-to-CRM pipeline. Each is a defensible procurement at small scale; consolidation pressure means at least two of them are likely to be acquired or exit by 2027.

The native-feature question. Zoom AI Companion, Google Meet Gemini, and Microsoft Teams Copilot all provide meeting AI as a default-on feature of the video-conferencing platform. The transcription quality is competitive with most standalone vendors; the workflow integration is narrower; the cost is bundled. For enterprises whose meetings happen overwhelmingly on one platform, the native feature often replaces the standalone vendor entirely. The procurement question is whether your workflow needs cross-platform integration, deeper conversational analytics, or a particular workflow integration that the native feature does not provide.

Real-time consumer voice: the hardest tier

The narrower category of dictation, voice assistants, and real-time voice interfaces for individual productivity. Worth naming because the latency budget is the tightest in the field and because the consumer-facing vendors here are increasingly relevant to enterprise procurement as personal-productivity tools.

Wispr Flow, Aqua Voice, Superwhisper. The serious-dictation category, with sub-200ms responsiveness and quality that has displaced traditional dictation for a meaningful share of knowledge workers. Wispr Flow has emerged as the category leader through 2025 and 2026; Aqua and Superwhisper are credible alternatives. Enterprise procurement is bottom-up; the seat-license model fits this category well.

Siri (Apple), Google Assistant, Alexa. The consumer voice assistants. Largely outside enterprise procurement scope except where iOS or Android device-management policies touch them. Worth mentioning because the underlying voice infrastructure is among the most operationally proven in the category.

What I would buy in 2026, by archetype

For an enterprise contact-centre voice agent deployment, EU-data-residency required: Parloa as the platform, ElevenLabs Turbo or Cartesia Sonic for voice, with the existing CCaaS integrated through Parloa’s enterprise connectors. Narrow scope on launch — one workflow, with human escalation everywhere else — with scope expansion gated on measured containment rates.

For an enterprise contact-centre voice agent, US, no special compliance pressure: PolyAI for the enterprise inbound use case if the integration with existing CCaaS is heavy, Vapi or Retell AI if the workload is greenfield and the team has engineering capacity.

For an outbound voice workload at scale: Bland AI. Built for outbound, with the compliance and pacing primitives the general-purpose platforms treat as afterthoughts.

For voice infrastructure for a new build: LiveKit. Modern, well-designed, with telephony support that has matured into a credible Twilio alternative for new work.

For voice models in a production agent: ElevenLabs Turbo as the default, Cartesia Sonic when sub-100ms TTS first-byte latency is the binding constraint. Treat the voice-model layer as swappable from day one; the model market is moving fast enough that hard-wiring one provider is a procurement risk.

For enterprise meeting AI with cross-platform workflow integration: Otter.ai or Fireflies.ai, depending on whether workflow integration or sales analytics is the binding constraint. For tech-leadership and individual-led adoption, Granola.

For native-feature replacement of standalone meeting AI: if your enterprise runs overwhelmingly on Zoom, Google Meet, or Teams, run the cost-and-quality math against the native feature before procuring a standalone vendor. The native features are good enough in 2026 that the standalone vendors have to earn the seat licence on workflow integration, not on transcription quality.

None of these recommendations come with a referral fee. The voice-AI procurement scorecard is CC-BY-4.0 and lives on the governance tooling page.

The honest signal of a working voice-AI deployment is that callers, meeting participants, or users prefer the voice experience to the human-only alternative — measured against containment rate, CSAT, or seat-licence renewal. The signal of a failing one is that the demo went well, the pilot looked clean, and the production deployment failed quietly because the long tail of real conversational shapes overwhelmed the agent. Match the procurement to the category. Match the latency budget to the tier. Pilot before committing. The voice-AI market is real in 2026 in a way it was not in 2024, but it is real in four shapes, not one.

Sources

LiveKit — Voice AI Quickstart — primary reference for the real-time voice agent infrastructure pattern
ElevenLabs — Turbo model latency benchmarks — reference for the sub-300ms TTS latency tier
Anthropic — Building effective agents — the orchestration baseline voice agents inherit from
Related: capabilities hub, orchestration architecture, RAG architecture, AIOps platforms, governance tooling

Methodology: voice-AI procurement verdicts drawn from fractional CTO engagements (2024–2026), cross-checked against published vendor latency and quality benchmarks. Where engagement experience and vendor-published numbers disagreed, the engagement number is reported.

Thomas Prommer CIO / CTO · 20 years · Practitioner, not consultant

Tom Prommer writes The AI Strategy Guide from the operator's seat — every tool covered, tested with real money before forming a view. Connect on LinkedIn · prommer.net · X