Problem
Celeste has TTS (celeste.audio.speak()) but no support for real-time bidirectional voice conversations. Projects using Gemini Live, OpenAI Realtime API, or xAI Grok voice still need raw provider SDKs.
Provider Landscape (verified)
| Provider |
Protocol |
Audio Format |
Function Calling |
Notes |
| OpenAI Realtime |
WebSocket (wss://api.openai.com/v1/realtime) |
PCM 16-bit 24kHz mono |
Yes |
Multimodal input (audio + images + text) |
| Google Gemini Live |
WebSocket (wss://generativelanguage.googleapis.com/ws/...) |
PCM 16kHz in / 24kHz out |
Yes |
Video support (1 FPS, 768x768), 24 languages |
| xAI Grok |
WebSocket (wss://api.x.ai/v1/realtime) |
Base64 audio |
Yes |
OpenAI Realtime API-compatible protocol |
| ElevenLabs |
WebSocket (wss://api.elevenlabs.io/v1/text-to-speech/...) |
MP3/PCM/WAV/μ-law |
No |
Unidirectional TTS streaming only |
Note: Anthropic Claude audio capabilities are HTTP/SSE-based, not WebSocket. No verified WebSocket voice endpoint exists in official docs as of Feb 2026.
What this requires
Real-time voice is fundamentally different from Celeste's current request-response model:
- True bidirectional I/O — sending audio chunks WHILE receiving audio chunks simultaneously
- Event multiplexing — audio deltas, text transcripts, tool calls, session updates all on the same connection
- Binary frame handling — current
WebSocketConnection.recv() decodes bytes to UTF-8 (lossy for PCM audio)
- Stateful sessions — configure, update mid-session, keep alive, then close (not request → response → done)
- Concurrent send/receive —
asyncio.TaskGroup or similar pattern, not sequential iteration
Celeste already has src/celeste/websocket.py (WebSocketClient + WebSocketConnection + registry), and Gradium TTS uses it — but that's unidirectional "send-then-receive". The infrastructure would need significant extension.
The current Stream[Out, Params, Chunk] base class is sequential and single-event-type. Real-time voice needs parallel event processing across multiple event types.
Architecture: needs deeper design
This doesn't fit cleanly as "just another operation" like speak() or transcribe(). The entire streaming pipeline assumes request-response. Adding CONVERSE needs deeper architectural thinking about:
- Where does the bidirectional session abstraction live? (New protocol? Extended streaming? Session object pattern?)
- How does
DOMAIN_OPERATION_TO_MODALITY handle audio-in + audio-out?
- How do we multiplex different event types (audio, text, tool calls) on one connection?
- Can the existing
WebSocketConnection be extended, or do we need a new RealtimeSession abstraction?
Desired user-facing API (aspirational)
import celeste
session = await celeste.audio.converse(
model="gpt-4o-realtime-preview",
voice="alloy",
instructions="You are a helpful assistant.",
tools=[...],
)
await session.send_audio(pcm_chunk)
await session.send_text("Hello")
async for event in session:
match event:
case AudioDelta(data=bytes):
play(event.data)
case TextDelta(content=str):
print(event.content)
case ToolCall(name, arguments):
result = execute(event)
await session.send_tool_result(result)
await session.close()
Not in scope
- Video input (Gemini Live supports it — defer to future)
- Phone/telephony integration (Twilio etc.)
- Anthropic voice (no verified WebSocket API yet)
Problem
Celeste has TTS (
celeste.audio.speak()) but no support for real-time bidirectional voice conversations. Projects using Gemini Live, OpenAI Realtime API, or xAI Grok voice still need raw provider SDKs.Provider Landscape (verified)
wss://api.openai.com/v1/realtime)wss://generativelanguage.googleapis.com/ws/...)wss://api.x.ai/v1/realtime)wss://api.elevenlabs.io/v1/text-to-speech/...)What this requires
Real-time voice is fundamentally different from Celeste's current request-response model:
WebSocketConnection.recv()decodes bytes to UTF-8 (lossy for PCM audio)asyncio.TaskGroupor similar pattern, not sequential iterationCeleste already has
src/celeste/websocket.py(WebSocketClient + WebSocketConnection + registry), and Gradium TTS uses it — but that's unidirectional "send-then-receive". The infrastructure would need significant extension.The current
Stream[Out, Params, Chunk]base class is sequential and single-event-type. Real-time voice needs parallel event processing across multiple event types.Architecture: needs deeper design
This doesn't fit cleanly as "just another operation" like
speak()ortranscribe(). The entire streaming pipeline assumes request-response. AddingCONVERSEneeds deeper architectural thinking about:DOMAIN_OPERATION_TO_MODALITYhandle audio-in + audio-out?WebSocketConnectionbe extended, or do we need a newRealtimeSessionabstraction?Desired user-facing API (aspirational)
Not in scope