Return the Fund
Posts
Exploring Voice AI

Exploring Voice AI

Problems, opportunities, prospects, and how I understand the space

Prerit Das
May 18, 2025

I’m sick of cold calls.

Time has blossomed my relationship with email and shattered my relationship with phone calls. Unified inboxes, email filtering, and team delegation get me to inbox zero. But when my phone buzzes with an unknown number, I’m thrust into a game of Russian Roulette.

A customer? A dishwasher technician waiting at the door? Nope—a very thick-accented recruiter who'd dodge the Grim Reaper to hire me five South African SDRs.

And yet, the voice experience is compelling. (Can I call it VX?) It’s a comfortable, natural mode of communication with high throughput, low latency, and low time-to-completion. In other words, it’s natural and effective.

If I put aside the thick-accented-SDR-recruiter (TASDRR) problem, I’m cautiously optimistic about the future of voice platforms. Today, I’ll show you the value proposition of AI in voice, the problems plaguing the industry, a breakdown of infrastructure vs. provider vs. application companies, the interfaces on which voice is being deployed, and for my Startup Navigator friends, some juicy, medium-rare prospects.

Big thanks to Alan (CEO of Hume), Jordan (CEO of Vapi), and Patrick’s team at Rime for letting me pick their brains. Makes my heart happy when today’s RTF superfans lend themselves in service of tomorrow’s. :)

BEFORE WE START

Get hyped, my investor friends

The best deal in VC, told in 3 questions

Question #1: What do Bessemer, Sequoia, a16z, Alumni, Accel, Techstars, NEA, Antler, Greylock, USV, Founders Fund, Lightspeed, Battery, Summit, and Khosla have in common? They all read Return the Fund.

I’m taking RTF to the next level, to help you return your fund. My mission is to help tech investors make informed decisions, and to have fun while doing it.

In a world saturated by AI word salads devoid of meaning, everything in Return the Fund is carefully hand-written. Diagrams are hand-drawn. I write every word of every edition to leave you smiling in thought—that’s my fun.

You’ve come to love the narratives—RTF has a mind-boggling 80% open rate. The regular editions won’t change; but, it’s time to level up.

Question #2: How much do you value your growth as a tech investor?

“Before RTF, I lived in fear of being sucked into the noise of consensus jargon.” If you want to make decisions predicated on experience and first-principles, and if you care to deeply understand technology, RTF Startup Navigator is for YOU.

Frequent, shorter posts: Startups happen in real-time. As founders connect, as news hits the wire, as ideas arise, as tomorrow’s winners materialize, as companies raise… You’ll get it first. These shorter narratives are frequent, hand-written, and fun to read, helping you conquer every day
Gated content: Love regular RTF? There’s more behind the scenes, powered by Harmonic. Startup Navigators unlock the exclusive section under regular narratives, including RTF startup prospect lists, insider interviews, founder intros, RTF market maps, and Harmonic’s pick of stealth founders making moves
Warm intros to select startup prospects and fellow RTF investors
Grandfathered into the discount below

OG readers: remember RTF’s first edition on ContextAI and LLM observability? Just acquired by OpenAI. The AI cloud spotlight companies have tripled in valuation. You can see those before they happen.

Or… Remember when DeepSeek had the tech world in shambles? You could show up to work the next day understanding what really happened, and why it was actually bullish for Nvidia (RTF on Nvidia).

All this for just $249 a month.

Wait a second… scrap that. Try $9 for the first 350 members.

Yeah… One NYC latte, because why not. (I heard you can expense it…)

Once you sign up, you don’t have to do anything. If you enjoy reading RTF, you just get more of its value delivered to your inbox. In doing so, you level up as a tech investor.

Don’t bother squinting—here’s what’s important. Only 350 spots are open at $9. Get 10x more of what you already love from RTF and its Tier 1 partners. A couple clicks with Apple Pay, and you’re in the club.

Question #3: Are you ready for the first wave of exclusive VC content?

NOW FOR TODAY’S EDITION

The voice AI proposition

When I say that time has shattered my relationship with phone calls, I don’t think I’m alone. In fact, I’m certain of it.

Some arise from deep slumber in a cold sweat when their dream-selves are asked, “do you mind if I put you on a brief hold?”

Consumers have been burned.

The brightest builders in voice AI are on a mission to rebuild consumers’ trust in voice interfaces, enabled by the real-time, personalized nature of AI. They’re building a future wherein consumers get solutions in seconds, and importantly, feel good afterward—as if they’ve spoken with a kind, empathetic, efficient human.

I think of AI distribution in the framework of metaphorical truths. These truths are disconnected from objective reality, but true insofar as they are observed. “Humans have spirits,” for example. True in observation; objectively unprovable. Some say LLMs have emotions, others say they don’t. I say it doesn’t matter, as long as they understand yours. Then, it is metaphorically true that you’re speaking with an emotionally intelligent being, and objectively true that you’re better off.

The best AI builders are selling outcomes, not technology. In the world of voice, I refer to platforms enabling companies to configure call-based VXs with their users. The “how” is unimportant. As a B2B2C platform, you can tell any story you want—all of them will be metaphorically true, if your customer’s customers reach their intended outcomes.

THE PLAYERS

Who’s building?

With so much activity in the voice AI space, it’s tough to boil my excitement down into 6 companies.

In other industries—DevOps, say—companies are highly specialized. They focus on one level of the stack, eating market share at that level alone. The voice AI industry is different.

Conversational orchestration

At a low level, operating a voice agent is extremely difficult. I learned this the hard way a couple years ago in building Jeeves. There are so many small things to get right. VX is real-time and intimate, so details matter a lot, and I never quite got them right. Built on Twilio webhooks, Jeeves would pass spoken phrases between two different API routes, calling OpenAI and ElevenLabs in between to generate responses.

It was clunky. The voice feature only worked when you left seconds to respond, knew not to interrupt, and spoke in a consistent stream of words without pausing. In the voice AI world, these problems are respectively named as latency, interrupt detection, and end-of-turn detection. Technically speaking, these are the biggest problems to solve.

Humans don’t abide by or even consider such rules. They just talk, and assume the human or bot on the other side is keeping up. Think back to product-market fit as a metaphorical truth—without changing the foundations of communication, humans will talk as they do, and voice agents must keep up.

Orchestration platforms serve to abstract the technical burden of replicating communicational norms. They handle the back-and-forth between human and agent, calling LLMs, synthesizing voices, and running external tools. Vapi and Vogent fall into this category. They are everything Jeeves lacked, implicitly answering the following critical questions:

When do I stop listening to the user, and start generating a response to send back? (End-of-turn detection)
Was the user trying to say something, or was I merely hearing background noise? (Interrupt detection)
Should I wait for a response from this external tool before responding to the user, or should I break the silence?
Should I switch to a tiny model to improve latency, given the user is simply making small talk?

Businesses—and developers, for that matter—shouldn't have to answer these questions when building a voice agent, just as they don't worry about server load balancing when using AWS.

Vapi’s voice AI workflow builder

Orchestration platforms empower users to select models, craft prompts, and integrate their own tools. The most prolific platforms, like Vapi, remain provider-agnostic—much like a conductor who skillfully leads an orchestra regardless of the instruments’ makers. Vapi doesn’t care if you like OpenAI or Anthropic, just that you bring an API key.

In my book, what separates orchestration platforms is how well they solve end-of-turn detection, interrupt detection, and latency. Of course, interoperability/ease of integration are important, but in these early stages of the voice rollout, my biggest KPI is performance. A little later on, I’ll share a market opportunity born of these challenges.

Emotional intuition

A basic voice agent orchestration looks something like this.

The user speaks, the speech is transcribed, the transcription is given to an agent, the agent responds, the response is turned into vocal audio bytes, and the final audio is played for the user.

Two problems come to mind:

This sounds slow. How’s the latency?
The responding agent (red) has no understanding of the user’s emotions.

To my point about the intimacy of VX, milliseconds matter. Latency has improved across the board thanks to increased LLM throughput, faster transcription models, and real-time/streaming voice synthesizers. The most sophisticated orchestration platforms stream response tokens from the agent (red) into the synthesizer (blue), and simultaneously stream the resulting audio bytes from the synthesizer back to the user. Pretty slick!

Emotional intuition, however, is a much trickier problem to solve. Latency is an improvable KPI; emotional intuition, however, is definitionally impossible in the current setup without a middleman. I’ve seen people try layering an emotional detection model on top of the transcriber to include emotion keywords as metadata to an agent. This doesn’t work very well.

Emotional intuition is important. When your friend tells you they’re sad and broken-hearted, you’ll respond differently than if they’re sobbing in front of you. You’ll feel differently. Again, metaphorical truths—an “empathetic” voice agent needs to convince you it feels differently too.

This is where Hume AI shines.

Hume’s homepage

Hume was founded on the thesis of emotional intelligence in AI. Their models are deeply aware of how a user feels as they speak, often grasping the undertones and connotations of speech. It’s very, very impressive.

I asked both Jordan and Alan, CEOs of Vapi and Hume respectively, what they think about the emotional intelligence problem. Their starkly different answers spoke volumes about their differing approaches. Alan, a seasoned computational emotion scientist, was adamant about the need for emotional intuition in a productive VX. Humans rely on each others’ body language and inflection to actively converse. Once again, metaphorical truths—agents don’t get a free pass, and need to communicate naturally with humans.

Jordan, while acknowledging the value of emotional intuition, pointed out that many business use cases don't require a human touch. Think of transactional phone calls, like confirming appointment times. In these settings, a strong transcription piping into a strong language model is adequate.

Vapi is incredible at walking a phone caller to conversational resolution. Alan’s eyes are on the widespread deployment of integrated voice experiences, be it on landing pages, in UI dashboards, in mobile apps, or in your car. Think of a small bubble in the corner of your screen waiting to talk to you.

I’d be remiss not to mention Sesame, an under-the-radar prospect who considers voice AI to be the ideal human companion. They’re building both software and hardware to productize their belief. They have a preview, but it’s not much on which to base an opinion. I’m keeping an eye on them, though.

Speech-to-speech models

There is a way to circumvent the definitional problems of a user → transcript → agent → synthesizer → user setup: speech-to-speech (STS) models. These are foundationally trained on audio bytes, turning sound directly back into sound. ChatGPT’s Advanced Voice mode is built on STS.

These models are much harder to fine-tune and control behaviorally, but are slowly emerging as a viable alternative. Hume is already on it.

A question I’ve had about STS models is their ability to call tools, as agentic tool-calling is done with structured output (ex. an LLM outputs {”tool_name”: “Search”, “tool_input”: “Weather in NYC”}). How would a STS model operate like an agent if it’s spitting audio back?

I asked Jordan this question, and he shared an exciting insight. He’s confident in STS models eventually being able to output both audio and text, allowing them to maintain emotional intuition while operating in an “agent loop.” He’s confident STS models will fit snugly in the existing orchestration stack—a positive signal of these platforms’ staying power as architectures evolve.

This is an area I’m very excited about, as agentically-capable STS models are a powerful alternative to the traditional voice agent setup, implicitly adding EQ by always carrying emotion as a data point. Of course, that isn’t to diminish the gargantuan amount of research needed to smarten STS models, which leads to a second venture opportunity I’ll mention later on.

Voice models

I recently came across Rime, run by RTF superfans who reached out after the last edition. :)

Rime’s homepage

The company makes personality-focused voice models. They don’t deal with orchestration, end-of-turn detection, or anything other than voices. But, they do voice incredibly well, and plug directly into orchestration tools like Vapi, Pipecat, and LiveKit.

Cartesia is another company building voice models, with a focus on speed and realism. Their models’ pronunciation is strong, and like Rime, they integrate with popular orchestration platforms.

Last year, I spoke with Zach Koch, CEO of Fixie AI (now Ultravox) as part of a collaborative research on AI agent infrastructure. Here’s what he said:

We're interested in making the interactions with the AI real-time. In the real world, humans get together in groups and discuss ideas rapidly. How do we build models + a stack that enables the AI to participate in this natural, free-flowing world of conversations (while still having access to tools).

Zach Koch, CEO @ Ultravox / Fixie

Quite powerful. His vision boils down to latency, which model providers are addressing by improving throughput. Orchestration platforms are the other main latency bottleneck… Unless we find alternative deployment vectors.

Cartesia, Ultravox, and Ponder are innovating on WebRTC, an open-source real-time communication protocol for the web. It allows websites to open a two-way audio/video stream with an external service, much like a websocket. In this way, a voice model may be hosted in Ultravox’s cloud, but stream back-and-forth with your landing page in real time.

The advancement of WebRTC for AI applications is important to watch, as it enables the vision of lightweight, real-time VX deployments.

Honorable mentions

Last month, LiveKit raised a $45M Series B. They’ve built a full-stack, unified, open-source VX. I love that they’re not only an orchestration platform, but an active researcher in voice AI’s problems (see LiveKit on using transformers to improve end-of-turn detection).

LiveKit’s homepage

LiveKit tells the story of what many voice platforms are becoming—unified, holistic, and integrated. ElevenLabs started as a voices company, and now has a Conversational AI platform. Pipecat started as an open-source SDK for orchestrating agents, and is now a fully infrastructed voice-cloud platform by daily.co (a typical, Docker-esque open-source monetization strategy).

I love that LiveKit is constantly addressing the key problems of voice AI.

Vapi, Hume, ElevenLabs, LiveKit, Pipecat, Cartesia, and Ultravox were all founded to build voice AI. But not all voice players are AI-native. Aircall, a popular outbound dialer for sales/customer support teams, backed into voice agents by building orchestration on top of their widely-used manual platform.

I find Aircall interesting as a case study for incumbents’ go-to-market. How expensive is the agentic platform for Aircall, and how much does it improve customer retention? How much of a selling point is it for potential customers evaluating dialer software? To what degree is Aircall, with voice agents as a feature, in competition with voice-native systems like Vapi?

It’s too early to tell. Distribution is king, so let’s not discount incumbents amid our excitement over emerging companies.

Aircall’s homepage

Future venture opportunities

Speech to speech models

Every advancement in tech needs infrastructure, and STS models are no exception.

Abstractions/implementations of low-latency web communication protocols like WebRTC (SDKs, hosted proxies, inference clouds)
Fine-tuning to reliably instill behaviorisms in STS models (what OpenPipe is for LLMs)
Logging and observability tools built for the “odd” nature of purely audio-based data streams, enabling querying and abnormality alerts

Voice agent evals

Enterprises are hyperaware that consumers have been burned by terrible legacy VXs. In re-inspiring consumer confidence, enterprises must have confidence in their voice AIs to behave as they train their humans. The necessity of AI → human connection is metaphorically true, requiring an enormous amount of data and training.

I’m excited to see evals take shape in the voice AI world, starting with a chatbot arena equivalent for VX. Systems will be tested on their reactions to conversational edge cases, including end-of-turn and interrupt detection. They’ll be tested on changes in response to emotionally charged user statements, when compared to monotone speech.

Finally, just as I suggested STS-native logging and observability, I’m excited for an end-to-end monitoring platform for voice agents. Think of Google’s cloud console, allowing DevOps teams to manage errors, rollback deployments, and test changes in real time.

I’m thinking of integration tests but for voice agents, allowing builders to test changes to their models, prompts, tools, context, and setup before hitting production. Users can define these tests as expected outcomes from various scenarios.

Perhaps orchestration software will natively embed such a platform. If done right, it would be a monstrous moat expander, retention builder, and magnet enhancer. I know you’re reading this, Jordan… :)

If you’re an RTF fan and building anything listed above, or if you know anyone who is, please hit reply. Thanks for taking this intellectual journey with me, especially given this edition was more technical than the last. It truly means the world. I’ll see you in the next edition.

If you’re a Startup Navigator, keep reading.

WELCOME TO THE CLUB

Startup Navigators

You made it, and I’m so happy to see you here. So much amazing content planned. For now, let’s kick off with some stealth picks from Harmonic, then dive deeper into my takes on emerging voice AI venture bets.

Subscribe to Startup Navigator to read the rest.

Be the most informed investor at your fund, for the price of one NYC coffee a month. (PS: that $9 can be expensed...)

Upgrade

Already a paying subscriber? Sign In.

A subscription gets you:

• Exclusive access to frequent & shorter rants on tomorrow’s paradigm shifts—today. Stay ahead of other VCs
• Insider interviews & research behind regular RTF editions, including hot startups poised to raise
• Warm intros to select startup prospects
• Discounts on my favorite software products
• Grandfathered low price (increasing in a couple months)

Reply

or to participate.