Why LLMs Are Bad at Real-Time Conversation And How We Fixed It

Summary

The architectural gap between text-based LLMs and the nuances of human speech creates a "latency chasm" that breaks immersion in sales training; bridging this requires a shift from sequential processing to a parallel, streaming architecture optimized for WebRTC.

The "uncanny valley" of AI has shifted. It is no longer about how the AI looks, but how it waits.

In a standard text-based interaction with a Large Language Model (LLM), a three-second delay is acceptable. In a high-stakes sales role-play, a three-second delay is an eternity. It is the difference between a fluid objection-handling exercise and a frustrating technical glitch. Humans are biologically wired for rapid-fire turn-taking. Research published in Nature indicates that the typical gap between speakers in human conversation is roughly 200 milliseconds.

When an AI takes 2,000 milliseconds to respond, the human brain stops engaging with the persona and starts troubleshooting the software. For a B2B SaaS sales rep trying to practice a discovery call, this lag destroys the "flow state" necessary for effective learning.

At Sellerity, we encountered this wall early on. To build an AI sales role-playing platform that actually works, we couldn't just wrap a UI around an LLM. We had to fundamentally re-engineer how a text-based brain communicates over a real-time voice stream.

The Three Pillars of the Latency Problem

To understand the fix, we must first dissect why the standard LLM stack fails at real-time conversation. Most AI voice applications use a linear pipeline:

Speech-to-Text (STT): The user speaks, and the audio is transcribed into text.
LLM Processing: The text is sent to the model, which generates a text response.
Text-to-Speech (TTS): The response is converted back into audio and played to the user.

This "sequential" approach is the primary culprit. Each step introduces its own delay, and they stack on top of each other.

1. The Transcription Lag (STT)

Most STT models wait for a "silence" marker to decide a sentence is finished before sending the text to the LLM. If you set that silence threshold to 500ms (to ensure the user is actually done), you have already added half a second of latency before the AI even begins to "think."

2. The Inference Bottleneck (LLM)

LLMs generate text one token at a time. While modern models are fast, the "Time to First Token" (TTFT)—the time it takes for the model to start responding—can still range from 200ms to 1,000ms depending on the model size and server load. If the system waits for the entire paragraph to be generated before moving to the next step, the delay becomes insurmountable.

3. The Synthesis Delay (TTS)

Traditional TTS engines need a full sentence to understand context and prosody (the rhythm and intonation of speech). If the TTS doesn't start until the LLM is finished, you’ve added another 500ms to 1,500ms of "generation time" before the first phoneme hits the user's speakers.

The Problem of Turn-Taking and Interruptions

Beyond raw speed, there is the social complexity of conversation. Human dialogue is not a series of rigid blocks; it is a collaborative dance. We use "backchanneling" (saying "mm-hmm" or "right" while the other person speaks) and we interrupt when we anticipate the end of a thought.

Standard LLM architectures struggle with two specific conversational behaviors:

The Interruption: If a sales rep interrupts a rambling AI customer to get the call back on track, a standard pipeline will keep playing the AI's pre-generated audio until the buffer is empty. This makes the AI seem deaf and robotic.
The False Start: Humans often pause mid-sentence to think. A naive AI will see a 300ms pause and immediately jump in, cutting the user off.

How We Fixed It: The Sellerity Architecture

To solve these issues, we moved away from the "Chain of Tools" model and toward a "Streaming Orchestration" model. Here is the framework for making LLMs feel human in real-time.

1. Predictive VAD and Streaming STT

Instead of waiting for a "hard silence" to transcribe audio, we utilize Voice Activity Detection (VAD) that works in tandem with streaming transcription. We use models that provide "partial" transcripts in real-time.

By analyzing the partials, our orchestration layer can often predict when a user is finishing a thought versus just taking a breath. If a sales rep says, "That makes sense, but what about the pricing—", the system recognizes the trailing thought and holds its response. If they say, "Does that make sense?" the system triggers the LLM immediately.

2. Token-to-Speech Streaming

This is the "secret sauce" of low-latency AI. We do not wait for the LLM to finish its sentence. As soon as the first 5-10 tokens are generated, they are fed into a streaming TTS engine.

By using a "look-ahead" buffer, the TTS can begin synthesizing the beginning of the sentence while the LLM is still "thinking" about the end of it. This reduces the perceived latency to the Time to First Token (TTFT) plus a small synthesis buffer, often bringing the total response time under 600ms—well within the range of a natural, albeit slightly slow, human response.

3. The WebRTC Edge Advantage

Standard HTTP requests are too heavy for real-time voice. They involve handshakes and headers that add unnecessary milliseconds. We utilize WebRTC (Web Real-Time Communication) to create a direct, low-latency "pipe" between the user's browser and our inference servers.

According to Cloudflare’s technical documentation on WebRTC, the protocol's ability to use UDP (User Datagram Protocol) allows for data transmission without the "stop-and-wait" overhead of traditional TCP. This is critical for maintaining audio quality in the face of network jitter. If a packet is lost in a role-play, it's better to have a tiny audio glitch than to pause the entire conversation to wait for a re-transmission.

4. Handling Interruptions with "Barge-In" Logic

To make the AI interruptible, we maintain a constant duplex connection. The moment the user's microphone detects speech while the AI is talking, the system triggers a "kill switch" on the TTS buffer.

However, we don't just stop the audio. We send the "interrupted" state back to the LLM so it knows it was cut off. This allows the AI to respond naturally, saying something like, "Oh, sorry, go ahead," or "Good point, let's address that pricing question now." This level of reactivity is what makes Sellerity feel like a real customer rather than a talking script.

The Role of "Small" Models in Real-Time

One of the counter-intuitive findings in our development was that the biggest, most powerful models (like GPT-4 or Claude 3.5 Sonnet) are often worse for real-time conversation due to their higher latency.

For the "brain" of a sales bot, we often use a hybrid approach. We use a faster, smaller model (like a fine-tuned Llama 3 or GPT-4o-mini) to handle the immediate conversational flow, while a larger "reasoning" model works in the background to analyze the sales rep's performance and provide feedback after the call.

In sales role-play, the vibe of the conversation matters as much as the logic. A fast, snappy "No, that's too expensive" from a smaller model is more realistic than a 3-second-delayed, perfectly reasoned 500-word essay on budget constraints from a larger one.

The Prosody Problem: Adding Emotion

Even if the response is fast, a monotone AI is a "tell" that ruins the simulation. We integrated "emotion-aware" TTS that can interpret the context of the LLM's text. If the LLM generates a response that includes "frustration" in the metadata, the TTS engine shifts its pitch and speed to match.

In a Sellerity role-play, if a rep pushes too hard on a closed-ended question, the AI bot doesn't just say it's annoyed—it sounds annoyed. This creates a physiological response in the sales rep, teaching them to read vocal cues just as they would on a real discovery call.

Why This Matters for Sales Leaders

You might ask: Is 500ms of latency really worth all this engineering?

The answer lies in the concept of "Cognitive Load." When a sales rep is practicing, they are already juggling a dozen things: their talk track, their discovery questions, the prospect's objections, and their own body language.

If the AI tool they are using is clunky or slow, their brain has to dedicate a portion of its processing power to "managing the tool." This reduces the efficacy of the training. High-fidelity, low-latency role-play allows the rep to stay in the "Simulated Reality." They aren't "using a tool"; they are "talking to a prospect."

This is why we built Sellerity with a "performance-first" mindset. Our conversation intelligence suite doesn't just analyze what was said; it analyzes the timing of the conversation. Did the rep interrupt too much? Did they leave too much dead air? You can't train those skills on a platform that has inherent 2-second delays.

Actionable Guidance for Implementing Real-Time AI

If you are a technical leader looking to implement real-time voice AI, here are three frameworks to follow:

Optimize for TTFT, not Total Throughput: In conversation, the speed of the first word is more important than the speed of the last word. Choose models and providers that prioritize low-latency streaming. OpenAI’s streaming API is a good starting point, but you must optimize your client-side buffering to match.
Move Logic to the Edge: Use edge functions to handle VAD and initial audio processing. The closer the "ears" of your AI are to the user's mouth, the less network latency you will encounter.
Implement a "Coordinator" Layer: Don't let your STT, LLM, and TTS talk to each other directly. Build a central coordinator (often using WebSockets or WebRTC data channels) that can cancel tasks, manage state, and handle "barge-ins" gracefully.

We are currently moving toward a world where models are "multi-modal native." Instead of converting audio to text and back again, models like GPT-4o are beginning to process audio tokens directly. This will eventually eliminate the STT and TTS steps entirely, allowing the model to "hear" the tone of the user's voice and "speak" with its own vocal cords, so to speak.

However, even with these advancements, the orchestration of the real-time stream—handling the "plumbing" of WebRTC and the nuances of human turn-taking—will remain the differentiator between a "cool demo" and a "professional training tool."

Conclusion

Building AI that can hold a conversation is easy. Building AI that can participate in a conversation is incredibly hard. It requires a move away from the "request-response" architecture of the web and toward a "continuous-stream" architecture of the human brain.

At Sellerity, we believe that the best sales training happens when the technology disappears. By solving the latency chasm, we allow reps to focus on what matters: building rapport, uncovering pain, and closing deals. If you're looking for a solution that provides truly lifelike sales practice, Sellerity can help you bridge that gap.

The future of sales training isn't just about what the AI says—it's about how it listens, how it reacts, and how it waits. Through architectural rigor and a focus on the "human

Why LLMs Are Bad at Real-Time Conversation And How We Fixed It