Real-time voice agents with Stream Vision Agents and Amazon Nova 2 Sonic
Stream's Vision Agents framework can now be combined with Amazon Bedrock and Amazon Nova 2 Sonic to create real-time voice agents quickly. The integration supports advanced features like function calling, automatic reconnection, and multilingual voice. This enables developers to build production-ready voice agents in minutes.
Server-side extracted preview paragraphs from the original source.
Original article excerpt
In this post, you learn how to combine Stream's Vision Agents open-source framework with Amazon Bedrock and Amazon Nova 2 Sonic to build real-time voice agents that can be production-ready in minutes. You'll learn how the integration works under the hood, walk through code examples, and explore advanced capabilities like function calling, automatic reconnection, and multilingual voice support.
This post was co-authored with Neevash Ramdial, Technical Marketing leader at Stream
Building production-grade voice agents that feel natural and responsive is a complex engineering challenge. You must orchestrate speech-to-speech models, manage low-latency audio streaming, and handle connection lifecycle. You also need to deliver consistent experiences across web, mobile, and desktop applications.
In this post, you learn how to combine Stream’s Vision Agents open-source framework with Amazon Bedrock and Amazon Nova 2 Sonic to build real-time voice agents that can be production-ready in minutes. You’ll learn how the integration works under the hood, walk through code examples, and explore advanced capabilities like function calling, automatic reconnection, and multilingual voice support.
Building voice-enabled AI applications requires orchestrating multiple complex systems that must work together reliably. You face the challenge of managing real-time audio streaming infrastructure while simultaneously integrating speech recognition, language models, and text-to-speech services. Each of these has its own latency characteristics and failure modes. A typical voice interaction involves capturing audio from the user’s microphone, streaming it to a speech-to-text service, processing the transcript through a language model, generating a response, converting that response back to speech, and delivering it to the user. All of this must happen within a window of a few hundred milliseconds to feel natural. Delays in this pipeline can break the conversational flow and frustrate users.Beyond the core AI pipeline, production voice applications must handle the messy realities of real-world deployment: unreliable network connections, browser compatibility issues, session timeouts, and graceful degradation when services become unavailable. You often spend more time building reconnection logic, managing WebRTC connections, and handling edge cases than on the actual AI capabilities. This infrastructure burden means teams either invest months building custom solutions or settle for limited off-the-shelf products that don’t meet their specific needs. Vision Agents abstracts the infrastructure complexity while providing the flexibility to customize the AI experience.
Together, these components create a complete stack: Stream handles the real-time media transport and client-side experience, Amazon Nova 2 Sonic provides the AI intelligence, and Vision Agents provides the glue code that ties them together.
The system is designed around a clean separation of concerns: Stream’s infrastructure handles the real-time media transport and client connectivity, while Amazon Nova Sonic runs in the customer’s own AWS account and provides the AI intelligence. This separation helps keep sensitive data and business logic remain within the customer’s control, while Stream’s globally distributed edge network delivers the low-latency media experience users expect.Stream’s edge network acts as the media broker between end-user devices and the Vision Agent worker processes. When a user speaks, audio is captured, encrypted, and transmitted as RTP over UDP to the nearest Stream SFU (Selective Forwarding Unit). The SFU terminates the WebRTC connection, handles NAT traversal and bandwidth estimation, and forwards audio tracks to the Vision Agent worker as if it were another call participant. This means the agent integrates naturally into the call model. The agent is another peer, receiving and sending audio through the same infrastructure used by human participants.