Audio (STT & TTS)

hal0 ships two audio endpoints: speech-to-text on the stt slot, and text-to-speech on the tts slot. Both speak the OpenAI Audio shape so any client that hits OpenAI’s audio API works here.

Speech-to-text: Moonshine

The stt slot defaults to Moonshine, a small, fast ASR model built for edge real-time. The toolbox image is hal0-toolbox-moonshine.

curl http://localhost:8080/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F file=@hello.wav \
  -F model=stt

Response (OpenAI-shape):

{
  "text": "Hello, world."
}

Alternates

The stt slot can host any ASR-compatible model the Moonshine provider supports. For higher accuracy, whisper-large-v3-turbo (~1.6 GB) if you have the headroom, or Canary-Qwen-2.5B (Open ASR Leaderboard leader, 5.63% WER) for SOTA accuracy. Swap with:

hal0 slot swap stt --model whisper-large-v3-turbo

See Recommended loadouts → Voice mode for the picks per tier.

Text-to-speech: Kokoro

The tts slot defaults to Kokoro-82M v1.0, a small open TTS model with 54 voices across 8 languages. The toolbox image is hal0-toolbox-kokoro.

curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts",
    "input": "Hello from hal0.",
    "voice": "af_bella"
  }' --output speech.wav

Alternates

For voice cloning, the Kokoro provider also supports F5-TTS. Swap with:

hal0 slot swap tts --model f5-tts

Status today

Moonshine and Kokoro are first-class providers as of v0.1.0-alpha. Both have working code paths, slot lifecycle integration, and published toolbox container images on ghcr.io/hal0ai/. The stt and tts slots are configurable from the dashboard and start cleanly.

Coming soon

Real-time streaming TTS (chunked PCM output).
Speaker diarization for transcription.
Voice cloning UX in the dashboard.
WebSocket transport for full duplex voice mode.