Add voice to your agent - Cloudflare Blog Summary

Adding Voice to Cloudflare Agents

Cloudflare has released an experimental voice pipeline for the Agents SDK, allowing developers to add real-time voice to their existing agent architecture. The @cloudflare/voice package provides a set of tools and APIs for building voice-enabled agents, including speech-to-text (STT) and text-to-speech (TTS) capabilities.

Key Features

withVoice(Agent) and withVoiceInput(Agent) functions for building full conversation voice agents and speech-to-text-only use cases, respectively
useVoiceAgent and useVoiceInput hooks for React apps
VoiceClient for framework-agnostic clients
Built-in Workers AI providers for STT and TTS, including Deepgram Flux, Deepgram Nova 3, and Deepgram Aura

How it Works

The voice pipeline extends the existing Agents SDK model, using the same Durable Object, WebSocket connection, and application logic. The flow involves:

Audio transport: The browser captures microphone audio and streams it over the same WebSocket connection.
STT session setup: The agent creates a continuous transcriber session when the call starts.
STT input: Audio streams are processed by the STT provider.
TTS output: The agent sends TTS output to the client.

Getting Started

To build a voice-enabled agent, developers can use the minimal server-side pattern provided by Cloudflare, which includes importing the @cloudflare/voice package and extending the Agent class with the withVoice function. On the client side, developers can use the useVoiceAgent hook to connect to the agent and start a voice conversation.