Skip to content

Add Whisper provider for audio transcription #10

@JaimeStill

Description

@JaimeStill

Migrated from tailored-agentic-units/tau-agent#1

Summary

Add a WhisperProvider implementation for speech-to-text transcription through the agent subsystem's provider abstraction. This issue also addresses provider-level architectural changes discovered during the audio protocol implementation.

Scope

Provider Architecture Changes

  • Header ownership: Move Content-Type ownership from the Request interface to the Provider. Remove Headers() from request types. Provider sets all protocol headers in PrepareRequest().
  • Multipart encoding: AudioData.Input changes from string to []byte. BaseProvider.marshalAudio() produces multipart/form-data with boundary. Provider stores content-type for PrepareRequest().
  • Protocol capability validation: Providers should declare supported protocols upfront. Clean error at config/creation time if a protocol is unsupported.

Whisper Provider Implementation

Important: onerahmet/openai-whisper-asr-webservice is NOT OpenAI-compatible. It requires a dedicated provider:

Aspect OpenAI/Azure Standard Whisper Container
Endpoint /v1/audio/transcriptions /asr
Field name file audio_file
Options Request body fields Query parameters
Model field Required Not used
Authentication Bearer token / API key None
  • Create agent/providers/whisper.go embedding BaseProvider
  • Implement Provider interface with dedicated endpoint, field mapping, and query parameter handling
  • Register via Register("whisper", NewWhisper) in agent/providers/registry.go

Azure Whisper

  • Azure uses standard OpenAI multipart format at /deployments/whisper/audio/transcriptions
  • Azure's existing provider should work once multipart encoding is implemented in BaseProvider

Docker

  • Add whisper service to docker-compose.yml using onerahmet/openai-whisper-asr-webservice:latest on port 9999:9000 with GPU support

CLI

  • Extend cmd/prompt-agent/ with -input flag and executeAudio function

Configuration

  • Add whisper configuration for local whisper container

Testing

  • Black-box tests co-located with provider implementation
  • Constructor, endpoint routing, request preparation, and response processing tests
  • HTTP mocking with httptest.NewServer()

Notes

  • Multipart form data handling is non-standard compared to existing JSON-based providers
  • Follow existing Ollama/Azure provider patterns for interface implementation

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentagent subsystemfeatureNew functionality

    Type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions