Skip to content

Architecture Overview

This page maps how all core concepts relate to each other. Use it as a mental model for understanding how data and control flow through the framework.

The Big Picture

Every component lives inside VoiceSession. Two WebSocket connections bridge the client and LLM provider, with the framework orchestrating everything in between. The LLMTransport interface abstracts provider differences — Gemini Live and OpenAI Realtime are both supported.

Component Ownership

VoiceSession creates and manages every other component. Here's the ownership tree:

How Agents, Tools, and the LLM Interact

Each agent provides its system instructions and tool set to the LLM. When the model calls a tool, the execution mode determines the path:

Data Flow: A Single Voice Turn

This is what happens when a user speaks and gets a response. Note how audio is forked to both the LLM and the STT provider simultaneously:

Agent Transfer Flow

When the model calls transferToAgent, the framework handles the transition. For Gemini, this requires a reconnect; for OpenAI, it uses in-place session.update:

Memory Extraction Pipeline

The memory system runs alongside conversation, extracting durable facts about the user:

Transcription Pipeline

User speech is transcribed through a dual-layer system: Chrome STT provides instant visual feedback on the client, while a server-side STTProvider produces the authoritative transcript stored in conversation history.

Key behaviors:

  • Chrome STT shows what the user is saying in real-time (interim text, opacity 60%)
  • When the server sends its authoritative transcript, it replaces the Chrome STT text in-place
  • Orphaned Chrome STT interims (e.g., from assistant echo) are automatically removed on turn boundaries
  • Exactly one server-side STT path is active: either transport built-in or an external STTProvider

EventBus Wiring

All framework components communicate through the EventBus. Hooks provide a curated subset:

Transport Layer

The LLMTransport interface abstracts provider differences. An optional STTProvider handles user speech transcription independently from the LLM:

Session State Machine

The SessionManager tracks the connection lifecycle:

StateClientTransportLLMTransport
CREATEDNot startedNot connected
CONNECTINGListeningConnecting
ACTIVEForwarding audioStreaming
TRANSFERRINGBuffering audio (Gemini) / Brief pause (OpenAI)Reconnecting / session.update
RECONNECTINGBuffering audioReconnecting
CLOSEDStoppedDisconnected

How Concepts Connect

Agents → Tools → Subagents

Agents → Memory → Agents (cross-session)

Reading Order

If you're new to the framework, read the docs in this order:

  1. VoiceSession — The entry point. Understand how everything is wired.
  2. Agents — Define personalities and route conversations.
  3. Tools — Give agents the ability to take actions.
  4. Memory — Remember users across sessions.
  5. Events & Hooks — Observe and react to everything happening.
  6. Transport — Understand the audio and message plumbing.
  7. Subagent Patterns — Background execution for complex tasks.

Built with VitePress