Architecture Overview
This page maps how all core concepts relate to each other. Use it as a mental model for understanding how data and control flow through the framework.
The Big Picture
Every component lives inside VoiceSession. Two WebSocket connections bridge the client and LLM provider, with the framework orchestrating everything in between. The LLMTransport interface abstracts provider differences — Gemini Live and OpenAI Realtime are both supported.
Component Ownership
VoiceSession creates and manages every other component. Here's the ownership tree:
How Agents, Tools, and the LLM Interact
Each agent provides its system instructions and tool set to the LLM. When the model calls a tool, the execution mode determines the path:
Data Flow: A Single Voice Turn
This is what happens when a user speaks and gets a response. Note how audio is forked to both the LLM and the STT provider simultaneously:
Agent Transfer Flow
When the model calls transferToAgent, the framework handles the transition. For Gemini, this requires a reconnect; for OpenAI, it uses in-place session.update:
Memory Extraction Pipeline
The memory system runs alongside conversation, extracting durable facts about the user:
Transcription Pipeline
User speech is transcribed through a dual-layer system: Chrome STT provides instant visual feedback on the client, while a server-side STTProvider produces the authoritative transcript stored in conversation history.
Key behaviors:
- Chrome STT shows what the user is saying in real-time (interim text, opacity 60%)
- When the server sends its authoritative transcript, it replaces the Chrome STT text in-place
- Orphaned Chrome STT interims (e.g., from assistant echo) are automatically removed on turn boundaries
- Exactly one server-side STT path is active: either transport built-in or an external
STTProvider
EventBus Wiring
All framework components communicate through the EventBus. Hooks provide a curated subset:
Transport Layer
The LLMTransport interface abstracts provider differences. An optional STTProvider handles user speech transcription independently from the LLM:
Session State Machine
The SessionManager tracks the connection lifecycle:
| State | ClientTransport | LLMTransport |
|---|---|---|
| CREATED | Not started | Not connected |
| CONNECTING | Listening | Connecting |
| ACTIVE | Forwarding audio | Streaming |
| TRANSFERRING | Buffering audio (Gemini) / Brief pause (OpenAI) | Reconnecting / session.update |
| RECONNECTING | Buffering audio | Reconnecting |
| CLOSED | Stopped | Disconnected |
How Concepts Connect
Agents → Tools → Subagents
Agents → Memory → Agents (cross-session)
Reading Order
If you're new to the framework, read the docs in this order:
- VoiceSession — The entry point. Understand how everything is wired.
- Agents — Define personalities and route conversations.
- Tools — Give agents the ability to take actions.
- Memory — Remember users across sessions.
- Events & Hooks — Observe and react to everything happening.
- Transport — Understand the audio and message plumbing.
- Subagent Patterns — Background execution for complex tasks.