Migrated from tailored-agentic-units/tau-agent#1
Summary
Add a WhisperProvider implementation for speech-to-text transcription through the agent subsystem's provider abstraction. This issue also addresses provider-level architectural changes discovered during the audio protocol implementation.
Scope
Provider Architecture Changes
- Header ownership: Move
Content-Type ownership from the Request interface to the Provider. Remove Headers() from request types. Provider sets all protocol headers in PrepareRequest().
- Multipart encoding:
AudioData.Input changes from string to []byte. BaseProvider.marshalAudio() produces multipart/form-data with boundary. Provider stores content-type for PrepareRequest().
- Protocol capability validation: Providers should declare supported protocols upfront. Clean error at config/creation time if a protocol is unsupported.
Whisper Provider Implementation
Important: onerahmet/openai-whisper-asr-webservice is NOT OpenAI-compatible. It requires a dedicated provider:
| Aspect |
OpenAI/Azure Standard |
Whisper Container |
| Endpoint |
/v1/audio/transcriptions |
/asr |
| Field name |
file |
audio_file |
| Options |
Request body fields |
Query parameters |
| Model field |
Required |
Not used |
| Authentication |
Bearer token / API key |
None |
- Create
agent/providers/whisper.go embedding BaseProvider
- Implement
Provider interface with dedicated endpoint, field mapping, and query parameter handling
- Register via
Register("whisper", NewWhisper) in agent/providers/registry.go
Azure Whisper
- Azure uses standard OpenAI multipart format at
/deployments/whisper/audio/transcriptions
- Azure's existing provider should work once multipart encoding is implemented in
BaseProvider
Docker
- Add whisper service to
docker-compose.yml using onerahmet/openai-whisper-asr-webservice:latest on port 9999:9000 with GPU support
CLI
- Extend
cmd/prompt-agent/ with -input flag and executeAudio function
Configuration
- Add whisper configuration for local whisper container
Testing
- Black-box tests co-located with provider implementation
- Constructor, endpoint routing, request preparation, and response processing tests
- HTTP mocking with
httptest.NewServer()
Notes
- Multipart form data handling is non-standard compared to existing JSON-based providers
- Follow existing Ollama/Azure provider patterns for interface implementation
Summary
Add a
WhisperProviderimplementation for speech-to-text transcription through the agent subsystem's provider abstraction. This issue also addresses provider-level architectural changes discovered during the audio protocol implementation.Scope
Provider Architecture Changes
Content-Typeownership from the Request interface to the Provider. RemoveHeaders()from request types. Provider sets all protocol headers inPrepareRequest().AudioData.Inputchanges fromstringto[]byte.BaseProvider.marshalAudio()producesmultipart/form-datawith boundary. Provider stores content-type forPrepareRequest().Whisper Provider Implementation
Important:
onerahmet/openai-whisper-asr-webserviceis NOT OpenAI-compatible. It requires a dedicated provider:/v1/audio/transcriptions/asrfileaudio_fileagent/providers/whisper.goembeddingBaseProviderProviderinterface with dedicated endpoint, field mapping, and query parameter handlingRegister("whisper", NewWhisper)inagent/providers/registry.goAzure Whisper
/deployments/whisper/audio/transcriptionsBaseProviderDocker
docker-compose.ymlusingonerahmet/openai-whisper-asr-webservice:lateston port 9999:9000 with GPU supportCLI
cmd/prompt-agent/with-inputflag andexecuteAudiofunctionConfiguration
Testing
httptest.NewServer()Notes