logoVoxagent

Architecture

Technical architecture and service interaction overview of Voxagent

Voxagent is built as a set of loosely coupled services communicating over REST, WebSockets, WebRTC and a Kafka event bus. This page describes every service in the system and how they interact.

Services marked [external] are third-party SaaS products that the platform integrates with. All other services are deployed as part of the platform (self-hosted or managed alongside the rest of the stack).

Core Backend Services

ASP.NET Backend

Main management API built with ASP.NET Core 9.0 following Clean Architecture and the CQRS pattern (MediatR + FluentValidation). Backed by PostgreSQL, it owns all domain data: users, spaces, agents, tools, phone numbers, billing state and webhooks. Issues and validates JWT tokens via Keycloak, exposes REST endpoints for the Angular clients, and publishes/consumes events on Kafka.

ASP.NET Webhook Receiver

Dedicated .NET API that receives webhooks from telephony and media providers (LiveKit, Twilio, VoxImplant) and forwards them to Kafka topics for asynchronous processing. Keeps webhook ingress decoupled from the main backend for reliability and isolation.

Python Backend

AI agent worker built on the LiveKit Agents SDK (Python 3.11). Registered as a LiveKit worker, it receives dispatch requests from LiveKit Server, joins rooms and runs the conversation loop: calls LLM / STT / TTS providers, invokes HTTP tools and handles function calling. Emits OpenTelemetry traces directly to Langfuse for every agent turn and also exports them via OTLP to the collector for Kafka-based analytics.

WebSocket Media Bridge

Go service that bridges providers without native SIP trunks (primarily VoxImplant, also Twilio Media Streams) to LiveKit. Converts base64 PCM16 audio frames from WebSocket connections into WebRTC tracks inside a LiveKit room, so AI agents can handle those calls with the same pipeline as native SIP calls.

Models Hosted

Self-hosted ML model runtime. Currently ships Nvidia Parakeet for STT and can host additional LLM/TTS models behind gRPC/REST endpoints, used by the Python backend when a customer opts out of third-party providers.

Messaging, Telemetry & Storage

Kafka

Kafka cluster — the backbone of the event bus. Carries webhook events, agent telemetry spans (OTLP Protobuf) and cross-service domain events.

OpenTelemetry Collector

Receives OTLP telemetry from the Python backend and agent workers, batches it, and exports it to the agent.telemetry.spans Kafka topic, which is then consumed by ClickHouse and Langfuse.

ClickHouse

Analytical database that stores agent telemetry, call records and performance metrics consumed from Kafka. Powers the analytics dashboards in the Angular client.

MinIO

S3-compatible object storage used for call recordings, exports and model artifacts. The ASP.NET backend talks to it through its ObjectStorage integration.

Redis

In-memory store used for caching, rate limiting and short-lived session data by the ASP.NET backend.

Realtime Media & Identity

LiveKit Server

Self-hosted WebRTC SFU that hosts all live agent conversations. Every call — web widget, phone or SIP — materializes as a LiveKit room into which a Python agent worker is dispatched. LiveKit also emits webhooks (room started, participant joined, egress finished) that flow through the webhook receiver into Kafka.

Keycloak

OpenID Connect identity provider. Issues the JWT tokens used by every backend and client in the platform, and hosts an admin UI with a custom theme for tenant administrators.

Business Services

Lago (GetLago)

Open-source billing and metering platform, deployed via Docker Compose. The ASP.NET backend reports usage events (minutes, characters, tool calls) to Lago, which handles plans, subscriptions and invoices.

Langfuse

LLM observability and tracing dashboard. Consumes agent telemetry to provide per-conversation traces, latency breakdowns and cost attribution across LLM providers.

Frontend & User-Facing

Angular Client

Main Single-Page Application (Angular 20) — the admin dashboard for managing agents, spaces, tools, phone numbers, campaigns, analytics and billing. Uses a Swagger-generated REST client against the ASP.NET backend.

Angular Widget

Embeddable web component (<speaknode-agent>) that customers drop into their websites. Loads agent configuration from the backend and joins a LiveKit room directly from the browser for voice or text conversations.

Angular Webhook Receiver Client

Companion Angular UI for the Webhook Receiver — used to manage inbound webhook paths, routing rules and authentication.

External Providers

These are third-party APIs consumed by the ASP.NET backend (and, for LLM/STT/TTS, by the Python backend). They are not deployed by the platform — only integrated with.

Speech & Voice

  • STT provider [external] — any supported speech-to-text service the admin configures on a per-agent basis.
  • TTS provider [external] — any supported text-to-speech service the admin configures on a per-agent basis.

LLM Providers

  • LLM provider [external] — any supported LLM (direct or via an aggregator) the admin configures on a per-agent basis.

Telephony

  • Twilio [external] — SMS and voice provider; inbound/outbound calls via SIP trunk or Media Streams.
  • VoxImplant [external] — Russian VoIP provider. Has no direct SIP trunk, so calls are bridged through the WebSocket Media Bridge.

Messaging

  • SMTP / Email [external] — External SMTP provider for transactional email.

User-configured integrations

  • Webhook tools [external] — arbitrary HTTP endpoints called by the Python agent worker as tools during a conversation. URL, method, headers and auth are configured per agent by the admin.
  • Outbound webhooks [external] — customer-provided HTTP endpoints that the ASP.NET backend calls when specific platform events occur (call started/ended, subscription changed, usage updated, etc.). URL and event subscriptions are configured by the admin.

Entry Points

An agent conversation can be started through one of three entry points. In every case the conversation materializes as a LiveKit room into which a Python agent worker is dispatched — what differs is how audio and metadata get into that room.

1. Testing page on the website

The admin dashboard (Angular Client) has a Testing section that lets operators talk to an agent directly from the browser. The flow:

  1. Admin opens the testing page in the Angular Client and picks an agent.
  2. Angular Client calls the ASP.NET backend to create a LiveKit room and obtain a join token.
  3. The browser joins the room via WebRTC directly against LiveKit Server.
  4. ASP.NET backend asks the Python backend to dispatch a worker into the same room.
  5. The worker runs the agent loop, calling LLM / STT / TTS providers.

Primarily used for manual QA, prompt tuning and smoke tests before going live.

2. Embeddable widget on a customer site

End users talk to an agent through the Angular Widget embedded into a third-party website. The flow is the same as the testing page, but initiated by an anonymous end user rather than an authenticated admin:

  1. Visitor opens a page that hosts <speaknode-agent>.
  2. Widget loads its config from the ASP.NET backend and requests a LiveKit token.
  3. Browser joins the LiveKit room over WebRTC.
  4. Python backend dispatches an agent worker into the room.

This is the main entry point for web-based conversational deployments.

3. Phone call (Twilio or VoxImplant)

End users reach the agent by dialing a phone number connected to the platform.

  • Twilio — supports both direct SIP trunk into LiveKit and Media Streams over WebSocket. For SIP, LiveKit's built-in SIP component answers the call and places it into a room. For Media Streams, audio goes through the WebSocket Media Bridge, which then pushes it into LiveKit as WebRTC.
  • VoxImplant — has no usable SIP trunk for our case, so calls always go through the WebSocket Media Bridge, which converts PCM16 audio frames into LiveKit WebRTC tracks.

In both cases:

  1. The provider fires an inbound webhook into the Webhook Receiver, which publishes it to Kafka.
  2. ASP.NET backend consumes the event, identifies the agent linked to the phone number, creates the LiveKit room and asks the Python backend to dispatch a worker.
  3. The audio path (SIP or bridge) lands in the same LiveKit room, and the agent conversation runs identically to web-based entry points.

Service Interaction Diagram

The diagram below shows how a typical request flows through the platform — from an end user on a phone or website, through realtime media, into the AI agent worker, and back out through telemetry and billing.

  • Deployment — how to install and run the platform
  • Features — end-user capabilities built on top of this architecture

On this page