Your LangChain Agent Can Talk Now. It Took 30 Minutes.

Adding voice to your AI agent isn't as hard as you think, but only if you stop trying to solve it alone.

@tigranbs
12 min read
Technicalvoice-ailangchainpydanticaiintegrationsayna-ai
Your LangChain Agent Can Talk Now. It Took 30 Minutes.

You've built something real. Your AI agent works. It understands context, handles tools, processes meaning. Your LangChain pipeline is solid. Your PydanticAI models are validated and type-safe. Everything flows beautifully through text.

Then someone says the words you knew were coming: "What if users could just talk to it?"

And something shifts inside you. Because now you're thinking about all of it. WebRTC connections, audio codecs, streaming protocols, voice activity detection. Browser incompatibilities. Network latency. Real-time audio processing. The mental list grows, and your two-week estimate quietly becomes three months of infrastructure work.

Here's what I want to tell you, and I mean this carefully: you don't have to build this yourself.

The Thing Nobody Wants to Admit

I understand the instinct. You're a builder. You see a problem, and you want to solve it. That's not arrogance. It's actually how you've learned to think. But voice infrastructure is different. It's one of those domains that looks simple until you're inside it.

Think about it like this: you can understand how a bridge is built without building one. You can know how aviation works without designing planes. Voice infrastructure is the same. The gap between understanding and implementing is where the trouble lives.

Here's what I've seen happen, again and again:

Week 1: It's just audio streaming, how hard could it be?

Week 2: Wait, why does Safari handle this completely differently?

Week 3: What even is a STUN server? Do I need TURN? How many servers? This architecture diagram is getting complicated.

Week 4: The audio keeps cutting out, especially on mobile.

Week 5: Now we need to support phone calls too?

Week 6: Someone quietly updates their resume, and the whole project gets quietly archived.

And meanwhile, your agent is just sitting there. Waiting. Connected to nothing but text.

What You Actually Want (Let's Be Honest)

You're not trying to become an audio expert. You don't wake up dreaming about codec optimization. What you want is simple:

  1. User speaks. Your agent understands it as text.
  2. Agent responds. User hears it as voice.
  3. Everything in the middle just works.

That's the whole requirement. That's it. Everything else is details that shouldn't be your problem.

graph LR
    A[User speaks] --> B[Magic happens]
    B --> C[Agent gets text]
    C --> D[Agent thinks]
    D --> E[Magic happens]
    E --> F[User hears response]
    
    style A fill:#f6f8fa,stroke:#586069,stroke-width:2px
    style B fill:#e1e4e8,stroke:#586069,stroke-width:2px
    style C fill:#ffd33d,stroke:#586069,stroke-width:2px
    style D fill:#ffd33d,stroke:#586069,stroke-width:2px
    style E fill:#e1e4e8,stroke:#586069,stroke-width:2px
    style F fill:#f6f8fa,stroke:#586069,stroke-width:2px

Notice something? Your actual work, steps C and D, that's where your value is. You've already built that. It's good. It works.

The "magic happens" parts? That's infrastructure. And infrastructure shouldn't belong in the same codebase as your agent logic. They're solving different problems with different lifecycles and different concerns.

The 30-Minute Integration (And I'm Not Exaggerating)

What does this actually look like when you stop trying to rebuild the foundation and just build what matters?

Step 1: Your Agent (This Part's Done)

You have this already. Maybe it's LangChain:

from langchain.agents import create_react_agent
from langchain_openai import ChatOpenAI

agent = create_react_agent(
    llm=ChatOpenAI(model="gpt-4"),
    tools=your_tools,
    prompt=your_prompt
)

Or PydanticAI:

from pydantic_ai import Agent

agent = Agent(
    model="openai:gpt-4",
    system_prompt="You are a helpful assistant",
    tools=your_tools
)

Or something else entirely. Your agent takes text and returns text. That's the contract. It's beautiful in its simplicity. Leave it alone.

Step 2: Add a Voice Layer (This Should Be Easy)

This is where people stumble. They start researching WebRTC libraries, comparing audio codecs, trying to understand latency optimization...

Stop. Instead, connect to a voice layer that already solved this:

from sayna_client import SaynaClient, STTConfig, TTSConfig
import asyncio

# Initialize voice session
client = SaynaClient(
    api_key="your_key",
    stt_config=STTConfig(provider="deepgram"),  # or google, or whoever works best
    tts_config=TTSConfig(provider="cartesia")   # same idea
)

# Connect it to your agent
async def handle_speech(transcript):
    # User said something, your agent responds
    response = await agent.run(transcript.text)
    await client.speak(response.output)

client.register_on_stt_result(lambda t: asyncio.create_task(handle_speech(t)))

# That's it. Seriously.
await client.connect()

Look at what you're not doing:

  • No negotiating WebRTC connections
  • No managing audio buffers and state
  • No building voice activity detection
  • No worrying about provider switching
  • No handling TTS streaming in chunks
  • No debugging browser compatibility

You're just connecting your agent to voice. Like you'd wire it to a database or REST API. Infrastructure stays separate. Your code stays clean.

Step 3: You're Done

That's it. Your agent now:

  • Listens to users speaking
  • Processes their words
  • Responds with synthesized voice
  • Handles all the streaming complexity invisibly
graph TB
    A[User's Browser] -->|WebRTC Stream| B[Sayna Voice Layer]
    B -->|Transcript| C[Your Agent<br/>LangChain/PydanticAI/Whatever]
    C -->|Response Text| B
    B -->|Audio Stream| A
    
    D[STT Provider<br/>Deepgram/Google] -.->|Managed by Sayna| B
    E[TTS Provider<br/>ElevenLabs/Google] -.->|Managed by Sayna| B
    
    style A fill:#f6f8fa,stroke:#586069,stroke-width:2px
    style B fill:#79b8ff,stroke:#586069,stroke-width:3px
    style C fill:#ffd33d,stroke:#586069,stroke-width:2px
    style D fill:#e1e4e8,stroke:#586069,stroke-width:1px,stroke-dasharray: 5 5
    style E fill:#e1e4e8,stroke:#586069,stroke-width:1px,stroke-dasharray: 5 5

This Isn't About Being Lazy

I want to address something directly, because I know what you're thinking: "Shouldn't I understand the technology I'm using?"

Yes. Understand it. Learn it. Read the WebRTC specification. Understand how TURN servers work. Study audio codecs. This is all good knowledge.

But understanding something and building it yourself are different paths, and they lead to different places.

You understand how HTTP works without implementing TCP/IP from scratch. You understand relational databases without building your own. You understand how compilers work without building one every time you write Python.

Here's the thing: when you understand infrastructure well, you know when to use it and when to build it. And this isn't a "when to build" situation for 99% of teams.

Your competitive advantage isn't we have a slightly better WebRTC implementation. It never was. Your advantage is the intelligence you put into your agent, the domain knowledge you bring, the human problems you solve. That's where it matters.

The Real Benefit: You're Not Locked In

This might sound backwards, but the best part of using a voice layer is that you're actually less locked in than if you built it yourself.

When you build voice directly into your agent, you own all of it. Every bug. Every optimization. Every breaking change in a browser update. You're locked into your own implementation forever.

With a separate voice layer, you have a clean boundary. Your agent doesn't know or care about the voice infrastructure. The voice infrastructure doesn't know or care about your agent logic. If you ever want to swap it out (whether for performance reasons, cost reasons, or just because you want something different), you can. The interface stays the same. Your agent stays unchanged.

This is actual freedom.

Built together (stuck):
┌──────────────────────────────┐
│  Your Agent + ElevenLabs     │
│  (Everything coupled)        │
└──────────────────────────────┘

Properly separated (flexible):
┌──────────────────────┐
│  Your Agent          │
└──────────┬───────────┘
           │
    ┌──────▼────────┐
    │ Voice Layer   │
    │ (swap anytime)│
    └───────────────┘

Changing voice providers becomes a one-line config change:

# Change this
client = SaynaClient(
    tts_config=TTSConfig(provider="elevenlabs"),
    stt_config=STTConfig(provider="google"),
    # ...
)

# To this
client = SaynaClient(
    tts_config=TTSConfig(provider="google"),
    stt_config=STTConfig(provider="deepgram"),
    # ...
)

# Your agent code? Completely untouched.
# Your tests? Still passing.
# Your users? They don't notice anything.

The Honest Conversation About Resources

Let's talk about what actually happens when you build voice infrastructure in-house, because this matters:

DIY approach:

  • 2 engineers times 3 months = 6 engineering months
  • Ongoing maintenance = 0.5 engineers, indefinitely
  • Third-party infrastructure (TURN servers, etc.) = $500+/month
  • Opportunity cost = Whatever you didn't build instead

versus

Using voice infrastructure:

  • Integration time = 30 minutes to a day
  • Monthly cost = Scales with usage, totally predictable
  • Engineering time = Still yours, but for your actual product
  • Opportunity cost = Zero, because you shipped

And here's what matters most: even if you have the budget to build it yourself, even if you have the engineering talent, why would you want to?

Your business doesn't make money because of your WebRTC implementation. Your users don't care about your codec selection. They care about whether your agent understands them, whether it helps them solve their problem, whether the experience feels natural and responsive.

That's where your time should go. Not infrastructure theater.

The Objections (And What I'd Say Back)

"But shouldn't I understand the technology?"

Understanding and building are different things. You understand how a car engine works without rebuilding it every time you drive. Same principle applies here.

"What if the service goes down?"

Fair question. But ask yourself: what happens if your self-built WebRTC infrastructure goes down? You're on-call at 3am debugging audio streaming. With a dedicated service, that's their job. That's literally what they're there for.

"We have unique requirements!"

Maybe. But I'd push back gently: is your unique requirement about how audio packets move through the network? Or is it about what your agent does with those words? My guess is the latter. That's where you should build. The voice part is solved.

"This feels like vendor lock-in!"

Actually, it's the opposite. When you build voice directly into your agent, you're locked into your own implementation. When you use a separate layer with a clean interface, you can swap it if needed. You have more freedom, not less.

What Good Architecture Actually Looks Like

Here's the full picture:

graph TB
    subgraph "Your Application"
        A[Agent Logic<br/>LangChain, PydanticAI, etc.]
        B[Business Rules]
        C[Tools & Integrations]
    end
    
    subgraph "Voice Infrastructure (Sayna)"
        D[WebRTC Management]
        E[Provider Abstraction<br/>STT/TTS]
        F[Streaming Pipeline]
        G[Voice Analytics]
    end
    
    subgraph "External Services"
        H[LLM APIs]
        I[Your Database]
        J[Other Services]
    end
    
    A -->|Simple API| D
    A --> H
    A --> I
    A --> J
    B --> A
    C --> A
    E --> F
    F --> D
    
    style A fill:#ffd33d,stroke:#586069,stroke-width:3px
    style D fill:#79b8ff,stroke:#586069,stroke-width:2px
    style E fill:#79b8ff,stroke:#586069,stroke-width:2px
    style F fill:#79b8ff,stroke:#586069,stroke-width:2px
    style G fill:#79b8ff,stroke:#586069,stroke-width:2px

See how clean that is? Your agent talks to many things: LLMs, databases, voice infrastructure, services. Voice is just another API. Your core business logic doesn't know it exists. Your voice infrastructure doesn't know about your business logic.

This is what good architecture feels like. Each piece doing what it does best. Nothing brittle. Nothing that breaks because you changed something somewhere else.

The Real Timeline

You're convinced. You're going to do this the right way. Here's what it actually looks like:

Hour 0–0.5: Setup

  • Install the SDK
  • Get API credentials
  • Basic configuration

Hour 0.5–1: Integration

  • Connect the voice session to your agent
  • Wire up the transcript handler
  • Test with a simple conversation

Hour 1–2: Refinement

  • Configure your preferred providers
  • Add proper error handling
  • Test some edge cases

Hour 2–4: Production-Ready

  • Add logging and monitoring
  • Set up authentication properly
  • Test in staging, then deploy

Total: Half a day, if you're being thorough. Less if you just want it to work.

Compare that to the three-month marathon of building WebRTC infrastructure from scratch. The difference isn't small. It's the difference between "we shipped this week" and "we're still debugging codec compatibility in three months."

The Honest Parts (Where This Doesn't Fit)

I'm not going to pretend this is perfect for every situation. There are actual cases where building custom voice infrastructure makes sense:

  • You're building a WebRTC service itself
  • You're in a regulated industry where third-party services genuinely aren't allowed
  • You're operating in regions where no service has coverage
  • You have someone on the team who's a WebRTC expert and genuinely enjoys maintaining this infrastructure

For literally everyone else (which is the vast majority of teams building voice-enabled agents), just use infrastructure. Don't reinvent it.

The Shift We're All Making

Here's what I actually think is happening: we're finally learning the lessons from the microservices era, and now we're applying them to AI applications.

You don't build your own database anymore. You don't build your own authentication system. You don't build your own email service. Why? Because those are solved problems. You compose them. You integrate them. You move on to what actually matters.

Voice infrastructure is the same thing. It's solved. Use the solution. Spend your time building agents that are actually useful.

Think about how you spend your engineering time:

Wrong allocation:
70% Building voice infrastructure
20% Integrating voice with your agent
10% Actually making your agent good

Right allocation:
5% Connecting to voice infrastructure
15% Integration and testing
80% Making your agent genuinely useful

The future belongs to teams that built the most intelligent agents and connected them to voice in an afternoon, not teams that spent six months getting the codec selection exactly right.

Just Start

The biggest barrier isn't actually technical. It's not cost. It's not vendor lock-in fears.

It's a mindset shift. From I need to build everything to I should use what exists and build what matters.

Your agent is already good. Your business logic is already sound. You don't need to rebuild any of that to add voice. You just need to stop thinking you have to solve everything yourself.

# This is genuinely all it takes
from sayna_client import SaynaClient, STTConfig, TTSConfig
import asyncio
from your_existing_agent import agent

client = SaynaClient(
    api_key="your_key",
    stt_config=STTConfig(provider="deepgram"),
    tts_config=TTSConfig(provider="cartesia")
)

async def handle(transcript):
    response = await agent.run(transcript.text)
    await client.speak(response)

client.register_on_stt_result(lambda t: asyncio.create_task(handle(t)))

await client.connect()

Your agent can talk now. It took 30 minutes.

Now go build something that matters.