CPU-only

hal0’s CPU-only path is the fallback tier. It exists so the installer works on a fresh VM, so CI can smoke-test slot lifecycle on boxes without GPUs, and so anyone trying hal0 for the first time can do so before deciding whether to commit hardware to it.

It is not the headline experience. The streaming chat feel that makes local inference worth using needs at least an iGPU.

What works

The hardware probe detects “no GPU” and writes that to /etc/hal0/hardware.json. The installer picks the Vulkan-CPU path: llama.cpp compiled with the Vulkan backend running against lavapipe (Mesa’s software Vulkan implementation). This is the same path hal0’s CI uses to smoke-test the slot lifecycle with the Qwen 0.5B model.

All five built-in slots can theoretically run:

primary: small Q4 chat (4B and under)
embed: embeddings and rerank
stt: Moonshine (CPU-capable but latency-sensitive)
tts: Kokoro (CPU-capable but latency-sensitive)
img: ComfyUI (CPU-only image generation is glacial; not realistic)

The OpenAI-compatible /v1/* API, the slot lifecycle state machine, the dispatcher, and OpenWebUI all behave identically to a GPU box.

What to expect

The honest answer:

Chat: a few tokens per second on a 4B Q4 model with a modest context. Fine for occasional Q&A, painful for long conversations.
Streaming voice: not realistic. Moonshine STT and Kokoro TTS run on CPU, but the round-trip latency for streaming audio isn’t what the slots were designed for. You can sanity-check the path; you can’t run voice mode comfortably.
Embeddings: fine. nomic-embed-text-v2-moe-Q4_K_M at 140 MB runs at usable speeds on any modern CPU, and the embed slot doesn’t need streaming.

Recommended loadout (CPU-only, 32–64 GB RAM, no GPU)

primary: gemma-3-1b-it-Q4_K_M (~0.7 GB) or Qwen3-4B-Instruct-2507-Q4_K_M (~2.5 GB) for a snappier feel. (fallback: Phi-3-mini-4k-instruct-q4.gguf ~2.4 GB, the curated default.)
embed: nomic-embed-text-v2-moe-Q4_K_M (~140 MB). Runs fine on CPU.
No stt / tts slots. Leave them in the offline state.

This is also the smallest viable hal0 install. The whole runtime, with a model loaded, fits comfortably under 3 GB of RSS.

Use cases that make sense

Smoke-testing the install in an LXC or VM before committing hardware to it.
Development: running the API and dashboard against a tiny model while you build something against /v1/*.
CI: hal0’s own integration tier uses Vulkan-CPU + Qwen 0.5B, the same path you’d hit here.
A box that’s already running for some other reason, where you’d like an occasional local Q&A endpoint behind your Traefik or Caddy.

Use cases that don’t

A daily-driver chat box. Get an iGPU.
Anything voice-mode. Get an iGPU.
Any model larger than Q4 4B unless you are deeply patient.
Anything where the model is wider than memory bandwidth supports. Above 8B Q4 on CPU you start hitting wall-clock limits that no amount of patience fixes.

Installation notes

The standard installer from the install page detects no-GPU correctly and picks Vulkan-CPU:

curl -fsSL https://hal0.dev/install.sh | bash

A few things to check:

The Vulkan loader must be installed (apt install libvulkan1 / pacman -S vulkan-icd-loader).
Mesa’s lavapipe (mesa-vulkan-drivers / vulkan-swrast) provides the software Vulkan implementation.
vulkaninfo --summary should show llvmpipe as a device.

CPU-only is the simplest deployment to put in an unprivileged LXC: no /dev/dri, no /dev/kfd, no cgroup allows beyond the defaults. It’s a useful way to validate the slot lifecycle, the dispatcher, and the /v1/* API surface before you go cut a privileged container for real GPU passthrough.

Troubleshooting

vulkaninfo shows no devices on a no-GPU box. Install Mesa’s software rasterizer Vulkan driver:

Debian/Ubuntu: apt install mesa-vulkan-drivers
Arch/CachyOS: pacman -S vulkan-swrast
Fedora: dnf install vulkan-loader mesa-vulkan-drivers

Slot starts but inference is extremely slow. Expected on CPU. Confirm the model is the size you think it is (a Q8 14B is dramatically slower than a Q4 4B), and shorten context windows where possible.

OOM on slot start. A Q4 model needs roughly its file size in RAM plus headroom for KV cache. A 7 GB model on a 8 GB box won’t fit; swap matters more here than it does on GPU paths.

When to graduate from CPU-only

If you’re doing anything more than smoke-testing, the cheapest meaningful upgrade is any modern AMD APU with RDNA-class graphics. Even a 780M-class iGPU is dramatically faster than CPU-only Vulkan on chat workloads. The full Strix Halo experience is the top end; the floor is “any iGPU at all.”