Skip to content

Transport

Transport abstracts provider-specific realtime APIs behind a common LLMTransport interface. VoiceSession owns agent/tool orchestration and delegates provider wire protocol details to the active transport.

Realtime LLM Providers

TransportUse WhenNotes
GeminiLiveTransportYou want Gemini Live native audio, Google Search grounding, session resumption, or Gemini input/output transcription.Created automatically when VoiceSession receives an API key and no custom transport.
OpenAIRealtimeTransportYou want OpenAI Realtime native audio and OpenAI session semantics.Pass a constructed transport through VoiceSessionConfig.transport.

Both transports expose:

  • audioFormat so clients and STT/TTS providers can match the provider's PCM rate
  • capabilities so orchestration branches on features instead of provider names
  • tool-call and tool-result mapping
  • turn completion, interruption, transcript, error, close, and usage callbacks
  • transferSession() so agent transfers can update provider instructions and tools

Using OpenAI Realtime

typescript
import { google } from '@ai-sdk/google';
import {
  OpenAIRealtimeTransport,
  VoiceSession,
} from 'bodhi-realtime-agent';

const transport = new OpenAIRealtimeTransport({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'gpt-realtime',
  voice: 'alloy',
});

const session = new VoiceSession({
  sessionId: 'session_1',
  userId: 'user_1',
  apiKey: process.env.GEMINI_API_KEY!, // still used by background subagents
  agents: [mainAgent],
  initialAgent: 'main',
  model: google('gemini-2.5-flash'),
  transport,
  port: 9900,
});

STT Providers

VoiceSessionConfig.sttProvider attaches an external transcription provider. The provider receives the transport's actual input format through configure() before it starts.

typescript
import {
  GeminiBatchSTTProvider,
  VoiceSession,
} from 'bodhi-realtime-agent';

const session = new VoiceSession({
  // ...
  sttProvider: new GeminiBatchSTTProvider({
    apiKey: process.env.GEMINI_API_KEY!,
    model: 'gemini-3-flash-preview',
  }),
});

When an external STT provider is set, VoiceSession feeds every client audio chunk into the provider and commits the provider at model-turn boundaries. Gemini built-in transcription can still be used as a post-hoc correction path when available.

Deepgram Nova-3 live streaming can be used when you want lower-latency partial user transcripts:

typescript
import {
  DeepgramSTTProvider,
  VoiceSession,
} from 'bodhi-realtime-agent';

const session = new VoiceSession({
  // ...
  sttProvider: new DeepgramSTTProvider({
    apiKey: process.env.DEEPGRAM_API_KEY!,
    model: 'nova-3',
    language: 'en-US',
  }),
  inputAudioTranscription: false, // Optional: make Deepgram the only input transcript source.
});

Deepgram receives the active transport's PCM input format through configure(). With Gemini Live that is 16 kHz mono PCM16; with OpenAI Realtime that is 24 kHz mono PCM16.

TTS Providers

VoiceSessionConfig.ttsProvider switches the realtime LLM to text-mode responses and routes text deltas into a streaming TTSProvider.

Built-in providers:

  • CartesiaTTSProvider
  • ElevenLabsTTSProvider

See External TTS for details.

Client Transports

There are two client-connection modes:

ModeConfigUse Case
Local session socketport / hostSimple examples and local development. VoiceSession creates a ClientTransport.
Server-owned socketclientSender plus feedAudioFromClient() / feedJsonFromClient()Multi-user servers that route many WebSocket connections to many sessions.

MultiClientTransport is the server-owned socket helper. It accepts many WebSocket clients, assigns each connection a ConnectionContext, and lets application code bind that connection to a VoiceSession.

typescript
const transport = new MultiClientTransport(9900, {
  onAudioFromClient: (_ws, data, context) => {
    const session = sessions.getSession(context.sessionId!);
    session?.feedAudioFromClient(data);
  },
  onJsonFromClient: (_ws, message, context) => {
    const session = sessions.getSession(context.sessionId!);
    session?.feedJsonFromClient(message);
  },
});

await transport.start();

Audio Formats

The framework normalizes around PCM L16 mono, but sample rates are provider-specific:

Provider PathInputOutput
Gemini Live16 kHz PCM24 kHz PCM
OpenAI Realtime24 kHz PCM24 kHz PCM
Twilio Media Streamsmulaw 8 kHzmulaw 8 kHz
Framework telephony bridge16 kHz PCM16 kHz PCM

STT providers receive the transport input format. TTS providers receive the preferred output format and can return a different supported PCM rate; VoiceSession resamples to the client format when needed.

Built with VitePress