CPU-only
hal0’s CPU-only path is the fallback tier. It exists so the installer works on a fresh VM, so CI can smoke-test slot lifecycle on boxes without GPUs, and so anyone trying hal0 for the first time can do so before deciding whether to commit hardware to it.
It is not the headline experience. The streaming chat feel that makes local inference worth using needs at least an iGPU.
What works
Section titled “What works”The hardware probe detects “no GPU” and writes that to
/etc/hal0/hardware.json. The installer picks the Vulkan-CPU path:
llama.cpp compiled with the Vulkan backend running against
lavapipe (Mesa’s
software Vulkan implementation). This is the same path hal0’s CI uses
to smoke-test the slot lifecycle with the Qwen 0.5B model.
All five built-in slots can theoretically run:
primary: small Q4 chat (4B and under)embed: embeddings and rerankstt: Moonshine (CPU-capable but latency-sensitive)tts: Kokoro (CPU-capable but latency-sensitive)img: ComfyUI (CPU-only image generation is glacial; not realistic)
The OpenAI-compatible /v1/* API, the slot lifecycle state machine,
the dispatcher, and OpenWebUI all behave identically to a GPU box.
What to expect
Section titled “What to expect”The honest answer:
- Chat: a few tokens per second on a 4B Q4 model with a modest context. Fine for occasional Q&A, painful for long conversations.
- Streaming voice: not realistic. Moonshine STT and Kokoro TTS run on CPU, but the round-trip latency for streaming audio isn’t what the slots were designed for. You can sanity-check the path; you can’t run voice mode comfortably.
- Embeddings: fine.
nomic-embed-text-v2-moe-Q4_K_Mat 140 MB runs at usable speeds on any modern CPU, and the embed slot doesn’t need streaming.
Recommended loadout (CPU-only, 32–64 GB RAM, no GPU)
Section titled “Recommended loadout (CPU-only, 32–64 GB RAM, no GPU)”primary:gemma-3-1b-it-Q4_K_M(~0.7 GB) orQwen3-4B-Instruct-2507-Q4_K_M(~2.5 GB) for a snappier feel. (fallback:Phi-3-mini-4k-instruct-q4.gguf~2.4 GB, the curated default.)embed:nomic-embed-text-v2-moe-Q4_K_M(~140 MB). Runs fine on CPU.- No
stt/ttsslots. Leave them in theofflinestate.
This is also the smallest viable hal0 install. The whole runtime, with a model loaded, fits comfortably under 3 GB of RSS.
Use cases that make sense
Section titled “Use cases that make sense”- Smoke-testing the install in an LXC or VM before committing hardware to it.
- Development: running the API and dashboard against a tiny model
while you build something against
/v1/*. - CI: hal0’s own integration tier uses Vulkan-CPU + Qwen 0.5B, the same path you’d hit here.
- A box that’s already running for some other reason, where you’d like an occasional local Q&A endpoint behind your Traefik or Caddy.
Use cases that don’t
Section titled “Use cases that don’t”- A daily-driver chat box. Get an iGPU.
- Anything voice-mode. Get an iGPU.
- Any model larger than Q4 4B unless you are deeply patient.
- Anything where the model is wider than memory bandwidth supports. Above 8B Q4 on CPU you start hitting wall-clock limits that no amount of patience fixes.
Installation notes
Section titled “Installation notes”The standard installer from the install page detects no-GPU correctly and picks Vulkan-CPU:
curl -fsSL https://hal0.dev/install.sh | bashA few things to check:
- The Vulkan loader must be installed
(
apt install libvulkan1/pacman -S vulkan-icd-loader). - Mesa’s lavapipe (
mesa-vulkan-drivers/vulkan-swrast) provides the software Vulkan implementation. vulkaninfo --summaryshould showllvmpipeas a device.
CPU-only is the simplest deployment to put in an unprivileged LXC:
no /dev/dri, no /dev/kfd, no cgroup allows beyond the defaults.
It’s a useful way to validate the slot lifecycle, the dispatcher, and
the /v1/* API surface before you go cut a privileged container for
real GPU passthrough.
Troubleshooting
Section titled “Troubleshooting”vulkaninfo shows no devices on a no-GPU box. Install Mesa’s
software rasterizer Vulkan driver:
- Debian/Ubuntu:
apt install mesa-vulkan-drivers - Arch/CachyOS:
pacman -S vulkan-swrast - Fedora:
dnf install vulkan-loader mesa-vulkan-drivers
Slot starts but inference is extremely slow. Expected on CPU. Confirm the model is the size you think it is (a Q8 14B is dramatically slower than a Q4 4B), and shorten context windows where possible.
OOM on slot start. A Q4 model needs roughly its file size in RAM plus headroom for KV cache. A 7 GB model on a 8 GB box won’t fit; swap matters more here than it does on GPU paths.
When to graduate from CPU-only
Section titled “When to graduate from CPU-only”If you’re doing anything more than smoke-testing, the cheapest meaningful upgrade is any modern AMD APU with RDNA-class graphics. Even a 780M-class iGPU is dramatically faster than CPU-only Vulkan on chat workloads. The full Strix Halo experience is the top end; the floor is “any iGPU at all.”