logoVoxagent

Voice & Communication

WebRTC-based real-time voice communication in Voxagent

Voxagent uses WebRTC for real-time voice communication between users and AI agents.

How It Works

  1. A LiveKit room is created when an agent is dispatched
  2. The AI agent (Python worker) joins the room and starts listening
  3. The user connects via browser (widget or client) or phone
  4. Audio streams in both directions in real time
  5. The session is recorded automatically (audio egress)

Voice Pipeline

User speaks → WebRTC audio → STT (speech-to-text) →
LLM processes text → generates response →
TTS (text-to-speech) → WebRTC audio → User hears

Speech-to-Text (STT)

Converts user speech to text for the LLM. STT provider is configurable per agent.

Large Language Model (LLM)

Processes the conversation and generates responses. LLM provider and model are configurable per agent (direct vendor API or aggregator).

Text-to-Speech (TTS)

Converts agent responses to speech. TTS provider is configurable per agent.

Voice Activity Detection (VAD)

Detects when the user starts and stops speaking. Controls turn-taking behavior — when the agent should start or stop talking.

Audio Recording

Every conversation is automatically recorded via LiveKit Egress. Recordings include:

  • Full audio of the session (composite recording)
  • Individual participant tracks
  • Stored in S3-compatible object storage

Real-Time Notifications

The platform uses SignalR to notify the frontend about session events:

  • Agent Ready — agent has joined the room
  • Session Completed — conversation ended, audio available

Session Lifecycle

StatusDescription
PendingSession created, room not yet ready
DispatchingLiveKit room created, waiting for agent to join
ActiveAgent joined, conversation in progress
CompletedConversation ended normally
FailedError occurred during session

On this page