Skip to content

Transport

The transport layer handles the bidirectional communication between your client app, the framework, and the LLM provider. A provider-agnostic LLMTransport interface abstracts the differences between Gemini and OpenAI, so your agent code works with either.

Architecture

┌─────────┐         ┌──────────────────┐         ┌──────────────┐
│  Client  │◄──ws──►│    Framework     │◄──ws──►│ LLM Provider │
│   App    │        │                  │        │ (Gemini Live  │
└─────────┘         │  ClientTransport │        │  or OpenAI    │
                    │  LLMTransport    │        │  Realtime)    │
                    └──────────────────┘         └──────────────┘
  • ClientTransport — WebSocket server that your client app connects to
  • LLMTransport — Provider-agnostic interface implemented by GeminiLiveTransport and OpenAIRealtimeTransport

Audio flows directly between these two transports, bypassing the EventBus for minimal latency. Everything else (tool calls, transfers, transcripts, GUI events) goes through the control plane.

LLMTransport Interface

The LLMTransport interface decouples the framework from any specific LLM provider. VoiceSession, AgentRouter, and ToolCallRouter interact only with this interface — never with provider-specific classes.

Capabilities

Each transport advertises its capabilities as static booleans. The orchestrator branches on these — never on provider names:

CapabilityGeminiOpenAIWhat it means
messageTruncationNoYesCan truncate server-side message at audio playback position
turnDetectionYesYesServer-side VAD / end-of-turn detection
userTranscriptionYesYesProvides transcriptions of user audio input
inPlaceSessionUpdateNoYesSupports in-place session update without reconnection
sessionResumptionYesNoSupports session resumption on disconnect
contextCompressionYesNoSupports server-side context compression
groundingMetadataYesNoProvides grounding metadata with search citations

Key Methods

typescript
interface LLMTransport {
  readonly capabilities: TransportCapabilities;
  readonly audioFormat: AudioFormatSpec;

  connect(config?: LLMTransportConfig): Promise<void>;
  disconnect(): Promise<void>;
  reconnect(state?: ReconnectState): Promise<void>;

  sendAudio(base64Data: string): void;
  sendContent(turns: ContentTurn[], turnComplete?: boolean): void;
  sendFile(base64Data: string, mimeType: string): void;
  sendToolResult(result: TransportToolResult): void;
  triggerGeneration(instructions?: string): void;

  updateSession(config: SessionUpdate): void;
  transferSession(config: SessionUpdate, state?: ReconnectState): Promise<void>;

  // Callbacks
  onAudioOutput?: (base64Data: string) => void;
  onToolCall?: (calls: TransportToolCall[]) => void;
  onTurnComplete?: () => void;
  onInterrupted?: () => void;
  // ... and more
}

ClientTransport

The client-facing WebSocket server. It multiplexes two message types on a single connection:

Frame TypeContentDirection
BinaryRaw PCM audio (16-bit, mono)Both ways
TextJSON messages (GUI events, commands)Both ways

Audio Flow

typescript
// Binary frames carry raw PCM audio
// Client → Server: user's microphone audio
// Server → Client: LLM's voice response

The client sends raw PCM audio as binary WebSocket frames. The framework forwards it to the LLM transport and sends the LLM's audio response back the same way.

JSON Messages

Text frames carry JSON messages for GUI events and commands:

typescript
// Server → Client
{ "type": "session.config", "audioFormat": { "inputSampleRate": 16000, "outputSampleRate": 24000, ... } }
{ "type": "gui.update", "payload": { "sessionId": "...", "data": {...} } }
{ "type": "gui.notification", "payload": { "sessionId": "...", "message": "..." } }
{ "type": "ui.payload", "payload": { /* UIPayload from subagent */ } }

// Client → Server
{ "type": "ui.response", "payload": { /* UIResponse */ } }

Audio Buffering

During agent transfers and reconnections, client audio is buffered so nothing is lost:

Normal:       Client audio → LLM (real-time)
Transfer:     Client audio → Buffer → Replay to new session
Reconnect:    Client audio → Buffer → Replay after reconnection

Buffering only affects binary (audio) frames. Text (JSON) frames are always delivered immediately.

Connecting a Client

Any WebSocket client can connect. Here's a minimal browser example:

javascript
const ws = new WebSocket('ws://localhost:9900');

// Send microphone audio as binary frames
navigator.mediaDevices.getUserMedia({ audio: true }).then(stream => {
  const processor = new AudioWorkletNode(audioContext, 'pcm-processor');
  stream.getTracks()[0].connect(processor);

  processor.port.onmessage = (e) => {
    ws.send(e.data); // Binary frame: raw PCM
  };
});

// Receive audio and JSON from server
ws.onmessage = (event) => {
  if (event.data instanceof ArrayBuffer) {
    // Binary frame: play audio
    playAudio(event.data);
  } else {
    // Text frame: GUI event or config
    const message = JSON.parse(event.data);
    handleMessage(message);
  }
};

GeminiLiveTransport

WebSocket client for the Gemini Live API. Wraps the @google/genai SDK.

What It Handles

  • Connection setup — Sends system instruction, tool declarations, voice config, and compression settings
  • Audio streaming — Sends base64-encoded PCM to Gemini, receives audio output
  • Tool routing — Receives tool call requests, sends back tool results
  • Session resumption — Tracks resumption handles for reconnecting after GoAway signals
  • Transcription — Receives both input (user speech) and output (model speech) transcripts
  • Google Search grounding — Passes search citations from Gemini responses

Configuration

typescript
interface GeminiTransportConfig {
  apiKey: string;                    // Google API key
  model?: string;                    // Default: 'gemini-live-2.5-flash-preview'
  systemInstruction?: string;        // Agent's system prompt
  tools?: ToolDefinition[];          // Tools converted to Gemini function declarations
  resumptionHandle?: string;         // For resuming a previous session
  speechConfig?: { voiceName?: string };  // Voice preset (e.g. 'Puck')
  compressionConfig?: {              // Context window management
    triggerTokens: number;
    targetTokens: number;
  };
  googleSearch?: boolean;            // Enable Google Search grounding
  inputAudioTranscription?: boolean; // Transcribe user speech (default: true)
}

Session Resumption

The Gemini Live API sends periodic resumption handles and GoAway signals. The framework handles these automatically:

Gemini sends GoAway (server shutting down)
  → Framework saves resumption handle
  → Starts buffering client audio
  → Disconnects
  → Reconnects with resumption handle
  → Replays buffered audio
  → Session continues seamlessly

Agent Transfers

Gemini does not support in-place session updates (inPlaceSessionUpdate: false). Agent transfers require a full reconnect: disconnect the current session, connect a new one with the new agent's instructions/tools, and replay conversation history.

OpenAIRealtimeTransport

WebSocket client for the OpenAI Realtime API. Wraps the openai SDK's OpenAIRealtimeWS.

What It Handles

  • Connection setup — Creates WebSocket, sends session configuration with tools, instructions, and voice
  • Audio streaming — Sends base64-encoded PCM to OpenAI, receives audio output deltas
  • Tool call accumulation — OpenAI streams function call arguments incrementally; the transport accumulates and dispatches complete calls
  • Interruption handling — Truncates audio items at the user's speech point, suppresses queued audio deltas
  • when_idle scheduling — Buffers background tool results while the model is generating, flushes on response.done
  • Transcription — Input transcription via configurable model, output transcription from response events

Configuration

typescript
interface OpenAIRealtimeConfig {
  apiKey: string;                      // OpenAI API key
  model?: string;                      // Default: 'gpt-realtime'
  voice?: string;                      // Default: 'coral'
  transcriptionModel?: string | null;  // Default: 'gpt-4o-mini-transcribe', null to disable
  turnDetection?: Record<string, unknown>;   // Default: semantic_vad
  noiseReduction?: Record<string, unknown>;  // Optional noise reduction config
}

Agent Transfers

OpenAI supports in-place session updates (inPlaceSessionUpdate: true). Agent transfers send a session.update event with the new instructions and tools — no reconnect or history replay needed. This makes transfers faster than Gemini's reconnect-based approach.

Key Differences from Gemini

AspectGeminiOpenAI
Agent transferReconnect + replay historyIn-place session.update
Tool call deliveryComplete calls in one eventStreamed argument deltas, accumulated
Tool result generationAutomatic after sendToolResponseExplicit response.create required
InterruptionServer fires interrupted eventClient must conversation.item.truncate
Audio rate16kHz input, 24kHz output24kHz input, 24kHz output

Speech-to-Text (STT) Providers

The framework decouples user speech transcription from the LLM transport. An optional STTProvider receives the same audio the LLM gets and produces transcripts independently.

How It Works

Audio is forked to both the LLM transport and the STT provider simultaneously. The LLM uses the audio for voice understanding and response generation; the STT provider produces human-readable transcripts for display and conversation history.

STTProvider Interface

typescript
interface STTProvider {
  configure(audio: STTAudioConfig): void;  // Set sample rate, bit depth, channels
  start(): Promise<void>;                  // Open connection (if needed)
  stop(): Promise<void>;                   // Close connection

  feedAudio(base64Pcm: string): void;      // Stream audio chunks
  commit(turnId: number): void;            // Trigger transcription for this turn
  handleInterrupted(): void;               // User interrupted — preserve buffer
  handleTurnComplete(): void;              // Turn done — clear buffer

  onTranscript?: (text: string, turnId: number | undefined) => void;
  onPartialTranscript?: (text: string) => void;
}

Built-in Providers

Two providers ship with the framework:

GeminiBatchSTTProviderElevenLabsSTTProvider
ProtocolHTTP (generateContent)WebSocket (persistent)
LatencyHigher (batch after silence)Lower (streaming partials)
Partial resultsNoYes (onPartialTranscript)
Sample ratesAny (via WAV header)8-48 kHz native PCM
Dependencies@google/genaiws
CostUses Gemini API quotaUses ElevenLabs API quota

Configuration

typescript
import {
  GeminiBatchSTTProvider,
  ElevenLabsSTTProvider,
} from '@bodhi_agent/realtime-agent-framework';

// Option A: Gemini batch transcription (default in demo)
const session = new VoiceSession({
  // ...
  sttProvider: new GeminiBatchSTTProvider({
    apiKey: process.env.GEMINI_API_KEY!,
    model: 'gemini-3-flash-preview',
  }),
});

// Option B: ElevenLabs streaming transcription
const session = new VoiceSession({
  // ...
  sttProvider: new ElevenLabsSTTProvider({
    apiKey: process.env.ELEVENLABS_API_KEY!,
    model: 'scribe_v2',         // default
    languageCode: 'en',         // BCP-47 code, default
  }),
});

// Option C: No external STT — use transport's built-in transcription
const session = new VoiceSession({
  // ... (omit sttProvider)
  inputAudioTranscription: true,  // default
});

When sttProvider is set, the transport's built-in input transcription is automatically disabled to avoid duplicates.

Dual-Display: Chrome STT + Server STT

The web client uses a two-layer transcription strategy for the best user experience:

  • Chrome STT provides instant visual feedback using the browser's SpeechRecognition API
  • Server STT provides the authoritative, higher-quality transcript
  • Both write to the same DOM element — server text replaces Chrome text seamlessly
  • Orphaned Chrome STT interims (from assistant echo) are cleaned up on turn boundaries

Custom STT Provider

Implement the STTProvider interface for any transcription service:

typescript
import type { STTProvider, STTAudioConfig } from '@bodhi_agent/realtime-agent-framework';

class MySTTProvider implements STTProvider {
  onTranscript?: (text: string, turnId: number | undefined) => void;
  onPartialTranscript?: (text: string) => void;

  configure(audio: STTAudioConfig): void { /* store audio format */ }
  async start(): Promise<void> { /* connect to your service */ }
  async stop(): Promise<void> { /* disconnect */ }

  feedAudio(base64Pcm: string): void { /* send audio to your service */ }
  commit(turnId: number): void { /* trigger transcription */ }
  handleInterrupted(): void { /* preserve or clear buffer */ }
  handleTurnComplete(): void { /* clear buffer */ }
}

VoiceSession automatically:

  • Calls configure() with the LLM transport's audio format
  • Calls start()/stop() on session lifecycle
  • Forks every audio chunk to feedAudio()
  • Calls commit(turnId) when the model starts responding
  • Protects against stale results (drops transcripts from 2+ turns ago)

Audio Format

Each transport advertises its native audio format via transport.audioFormat. Input and output sample rates may differ:

ProviderInput RateOutput RateBit DepthChannels
Gemini16,000 Hz24,000 Hz16-bitMono
OpenAI24,000 Hz24,000 Hz16-bitMono

The AudioFormatSpec type models this asymmetry:

typescript
interface AudioFormatSpec {
  inputSampleRate: number;   // Rate for mic capture / sending to LLM
  outputSampleRate: number;  // Rate for LLM audio output / playback
  channels: number;
  bitDepth: number;
  encoding: 'pcm';
}

Audio Format Negotiation

On client connect, VoiceSession sends a session.config message with the active transport's audio format. The web client reads both rates and configures mic capture and audio playback independently:

javascript
// Web client receives session.config on connect
if (msg.type === 'session.config' && msg.audioFormat) {
  INPUT_RATE  = msg.audioFormat.inputSampleRate;   // mic downsampling target
  OUTPUT_RATE = msg.audioFormat.outputSampleRate;   // AudioContext playback rate
}

This means the same web client works with both Gemini and OpenAI without code changes — the server tells it what rates to use.

TIP

The framework is a pure byte relay — no server-side resampling. The web client handles resampling from the browser's native mic rate down to the provider's input rate.

Using a Pre-Configured Transport

For OpenAI (or any custom transport), you can inject a pre-constructed LLMTransport into VoiceSession:

typescript
import { OpenAIRealtimeTransport } from '@bodhi_agent/realtime-agent-framework';

const transport = new OpenAIRealtimeTransport({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'gpt-4o-realtime-preview',
  voice: 'coral',
});

const session = new VoiceSession({
  // ...required config...
  transport,  // Inject pre-configured transport
});

When you inject a transport, VoiceSession automatically syncs the agent's tools and instructions to it via updateSession() before connecting.

Voice Configuration

Voice configuration depends on the provider:

typescript
// Gemini — voice presets
const session = new VoiceSession({
  speechConfig: { voiceName: 'Puck' },
  // Available: Puck, Charon, Kore, Fenrir, Aoede, etc.
});

// OpenAI — voice names set in transport config
const transport = new OpenAIRealtimeTransport({
  apiKey: process.env.OPENAI_API_KEY!,
  voice: 'coral',
  // Available: alloy, ash, ballad, coral, echo, sage, shimmer, verse
});

Zod-to-JSON Schema Conversion

Tool parameters defined with Zod are automatically converted to JSON Schema for the provider's function declaration format. Each transport handles the conversion internally — Gemini uses uppercase JSON Schema conventions, OpenAI uses standard JSON Schema:

typescript
// Your tool definition (same for both providers):
parameters: z.object({
  city: z.string().describe('City name'),
  units: z.enum(['celsius', 'fahrenheit']),
})

// Converted automatically by the transport to the provider's format

Context Window Compression

Context compression is a Gemini-specific capability (contextCompression: true). For long conversations, Gemini automatically compresses when the token count exceeds the configured threshold:

typescript
// Configured via GeminiLiveTransport internally
// Gemini compresses when token count exceeds triggerTokens,
// targeting targetTokens after compression.

When compression occurs, the context.compact event is published:

typescript
session.eventBus.subscribe('context.compact', (payload) => {
  console.log(`Compressed: removed ${payload.removedItems} items`);
});

OpenAI Realtime does not currently support server-side context compression.

Built with VitePress