Adding voice to your PydanticAI Agent in Under an Hour

PydanticAI is fantastic for building type-safe AI agents, but what if you want your agent to speak... Let's add real-time voice capabilities to your existing PydanticAI agent using the Sayna Python SDK.

@tigranbs
7 min read
Technicalpythonpydanticaivoice-aisaynaai-agentvoice-agent
Adding voice to your PydanticAI Agent in Under an Hour

If you have been building AI agents with PydanticAI, you probably enjoy type-safety and the "FastAPI feel" it brings to agent development. But here is the thing: most AI agents are still text-only today and users increasingly expect natural voice conversations.

At Sayna.ai, we have solved exactly this question: how do you add voice capabilities to existing AI agents without re-architecting everything? The answer is our Python SDK and I'm going to show you how to integrate it with your PydanticAI agent in under one hour.

The whole idea is simple: your PydanticAI agent handles intelligence, Sayna handles the voice layer. You don't need to worry about STT / TTS providers, audio streaming or voice activity detection.

Why Voice Matters for AI agents

Text-based chatbots are great for many use cases BUT when building customer service agents, voice assistants or telephony applications, the users expect to talk naturally. The problem is that adding voice to an existing agent usually means dealing with:

  • Multiple STT (Speech to Text) providers
  • Multiple TTS (text-to-speech) providers
  • WebSocket connections for real-time streaming
  • Voice Activity Detection (VAD)
  • Audio buffer management

This is exactly what Sayna removes - you just connect the text output of your agent to our voice layer, and everything else is handled for you.

Setting up your environment

As for installation of the packages - If you already have a PydanticAI project you simply need to add Sayna client:

pip install pydantic-ai sayna-client

Make sure you have your API keys ready; for this example we'll use Deepgram for STT and ElevenLabs for TTS; however Sayna also supports Google Cloud and Azure:

export SAYNA_API_URL="https://api.sayna.ai"
export SAYNA_API_KEY="your_sayna_api_key"

Creating a basic PydanticAI agent

Let's start with a simple PydanticAI agent: If you already have one, you can skip this part, but I want to show the complete picture:

from pydantic_ai import Agent

# Define your PydanticAI agent
support_agent = Agent(
    'anthropic:claude-sonnet-4-0',
    instructions='''You are a helpful customer support agent. 
    Keep responses concise and natural for voice conversation.
    Aim for 2-3 sentences maximum.'''
)

# This is your standard agent - nothing special yet
async def get_agent_response(user_message: str) -> str:
    result = await support_agent.run(user_message)
    return result.output

Notice that I've added a hint about the simplicity of responses in the instructions - this is important for voice applications - nobody wants to hear a wall of text!

Connecting Sayna Voice Layer

Now here's where the magic happens: We're going to create a voice session that connects the user speech to your agent and replies to agent speech:

import asyncio
from sayna import SaynaClient, VoiceConfig

# Initialize Sayna client
client = SaynaClient(
    api_url="https://api.sayna.ai",
    api_key="your_sayna_api_key"
)

# Configure voice settings
voice_config = VoiceConfig(
    stt_provider="deepgram",
    tts_provider="elevenlabs",
    elevenlabs_voice_id="your_voice_id",  # Pick a voice from ElevenLabs
    deepgram_model="nova-2"
)

async def handle_voice_session():
    async with client.connect(config=voice_config) as session:
        # Listen for transcribed speech
        async for event in session.events():
            if event.type == "transcript":
                # User said something - send to your agent
                user_text = event.text
                print(f"User: {user_text}")
                
                # Get agent response
                agent_response = await get_agent_response(user_text)
                print(f"Agent: {agent_response}")
                
                # Send response back as speech
                await session.speak(agent_response)

# Run the voice session
asyncio.run(handle_voice_session())

That's it: Now you have a voice-enabled Pydantic AI agent - the flow is

  1. User speaks Sayna transcribes ( STT)
  2. Transcription Your PydanticAI agent
  3. Agent reaction Sayna synthesizes (TTS)
  4. Audio Back to user

Handling interruptions and turn detection

One thing that makes voice conversations feel natural is proper turntaking Sayna includes built-in voice activity detection (VAD) that handles this automatically, but you can also customize it:

voice_config = VoiceConfig(
    stt_provider="deepgram",
    tts_provider="elevenlabs",
    elevenlabs_voice_id="your_voice_id",
    # Enable interruption handling
    enable_interruption=True,
    # VAD settings
    vad_threshold=0.5,
    silence_duration_ms=800
)

async def handle_voice_session():
    async with client.connect(config=voice_config) as session:
        async for event in session.events():
            if event.type == "transcript":
                agent_response = await get_agent_response(event.text)
                await session.speak(agent_response)
            
            elif event.type == "interruption":
                # User interrupted - stop current speech
                await session.stop_speaking()
                print("User interrupted, stopping...")

The flag enable_interruption tells Sayna to detect when users start speaking while the agent is talking. This is crucial for natural conversations where users might want to cut in.

Adding context and memory

PydanticAI supports conversation history out of the box and you can leverage this for multi-turn voice conversations:

from pydantic_ai import Agent
from pydantic_ai.messages import ModelMessage

support_agent = Agent(
    'anthropic:claude-sonnet-4-0',
    instructions='''You are a helpful customer support agent.
    Keep responses concise for voice. Remember context from previous exchanges.'''
)

class VoiceConversation:
    def __init__(self):
        self.history: list[ModelMessage] = []
    
    async def process_speech(self, user_text: str) -> str:
        result = await support_agent.run(
            user_text,
            message_history=self.history
        )
        # Update history for next turn
        self.history = result.all_messages()
        return result.output

# Use in your voice session
conversation = VoiceConversation()

async def handle_voice_session():
    async with client.connect(config=voice_config) as session:
        async for event in session.events():
            if event.type == "transcript":
                response = await conversation.process_speech(event.text)
                await session.speak(response)

Now, the voice agent remembers what was said earlier in the conversation, which is essential for anything beyond simple Q & A.

Putting It Together:

Here's a complete example that can be run right now:

import asyncio
from pydantic_ai import Agent
from pydantic_ai.messages import ModelMessage
from sayna import SaynaClient, VoiceConfig

# PydanticAI Agent
support_agent = Agent(
    'anthropic:claude-sonnet-4-0',
    instructions='''You are a friendly customer support agent for a SaaS product.
    Keep responses concise and conversational - aim for 2-3 sentences.
    Be helpful and empathetic.'''
)

# Sayna Client
client = SaynaClient(
    api_url="https://api.sayna.ai",
    api_key="your_sayna_api_key"
)

voice_config = VoiceConfig(
    stt_provider="deepgram",
    tts_provider="elevenlabs",
    elevenlabs_voice_id="your_voice_id",
    deepgram_model="nova-2",
    enable_interruption=True
)

class VoiceAgent:
    def __init__(self):
        self.history: list[ModelMessage] = []
    
    async def respond(self, user_text: str) -> str:
        result = await support_agent.run(
            user_text,
            message_history=self.history
        )
        self.history = result.all_messages()
        return result.output
    
    async def run(self):
        agent = VoiceAgent()
        
        async with client.connect(config=voice_config) as session:
            print("Voice agent ready! Start speaking...")
            
            async for event in session.events():
                if event.type == "transcript":
                    print(f"User: {event.text}")
                    
                    response = await agent.respond(event.text)
                    print(f"Agent: {response}")
                    
                    await session.speak(response)
                
                elif event.type == "interruption":
                    await session.stop_speaking()
                
                elif event.type == "error":
                    print(f"Error: {event.message}")

if __name__ == "__main__":
    agent = VoiceAgent()
    asyncio.run(agent.run())

What About Telephony?

If you want your agent to answer phone calls, Sayna supports SIP integration with LiveKit The setup is slightly different but the core concept remains the same: your PydanticAI agent handles the intelligence, Sayna handles the voice and telephony layer.

Check out our [SIP configuration docs] (https://docs. sayna. ai/guides/sip) for details on connect with Twilio or other telephony providers.

Performance Considerations

A few things I learned while building voice agents:

Keep Agent Responses Short. Anything over 3-4 sentences feels unnatural in voice. You can always ask if the user wants more details.

Streaming is possible. Sayna supports TTS streaming which means that users start hearing the response before the full text is generated, which dramatically reduces perceived latency.

Handle gracefully errors. Network issues happen: wrap your agents calls in try/except and have ready fallback responses

async def respond_safely(self, user_text: str) -> str:
    try:
        return await self.respond(user_text)
    except Exception as e:
        print(f"Agent error: {e}")
        return "I'm sorry, I had trouble processing that. Could you repeat?"

Conclusion

Adding voice to your PydanticAI agent doesn't have to be complicated: with Sayna's Python SDK you basically add a voice layer on top of your existing agent logic. The best part? You can switch between STT/TTS providers without changing your agent code: update the config.

If you are building voice-enabled AI agents and want to keep using PydanticAI's type-safe approach, check out [Sayna] (https://sayna. ai) - We've handled all the hard work of managing audio streams, provider integrations and real-time WebSocket communication so you can focus on building great agent experiences.

Have questions or want to share what you've built? Find us on [GitHub] (https://github. com/SaynaAI/sayna) or reach us directly!

Never forget to and share this article!