Skip to content

Audio (STT & TTS)

hal0 ships two audio endpoints: speech-to-text on the stt slot, and text-to-speech on the tts slot. Both speak the OpenAI Audio shape so any client that hits OpenAI’s audio API works here.

The stt slot defaults to Moonshine, a small, fast ASR model built for edge real-time. The toolbox image is hal0-toolbox-moonshine.

Terminal window
curl http://localhost:8080/v1/audio/transcriptions \
-H "Content-Type: multipart/form-data" \
-F file=@hello.wav \
-F model=stt

Response (OpenAI-shape):

{
"text": "Hello, world."
}

The stt slot can host any ASR-compatible model the Moonshine provider supports. For higher accuracy, whisper-large-v3-turbo (~1.6 GB) if you have the headroom, or Canary-Qwen-2.5B (Open ASR Leaderboard leader, 5.63% WER) for SOTA accuracy. Swap with:

Terminal window
hal0 slot swap stt --model whisper-large-v3-turbo

See Recommended loadouts → Voice mode for the picks per tier.

The tts slot defaults to Kokoro-82M v1.0, a small open TTS model with 54 voices across 8 languages. The toolbox image is hal0-toolbox-kokoro.

Terminal window
curl http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts",
"input": "Hello from hal0.",
"voice": "af_bella"
}' --output speech.wav

For voice cloning, the Kokoro provider also supports F5-TTS. Swap with:

Terminal window
hal0 slot swap tts --model f5-tts

Moonshine and Kokoro are first-class providers as of v0.1.0-alpha. Both have working code paths, slot lifecycle integration, and published toolbox container images on ghcr.io/hal0ai/. The stt and tts slots are configurable from the dashboard and start cleanly.

  • Real-time streaming TTS (chunked PCM output).
  • Speaker diarization for transcription.
  • Voice cloning UX in the dashboard.
  • WebSocket transport for full duplex voice mode.