Skip to content

NVIDIA discrete GPU

hal0 supports NVIDIA discrete GPUs through a CUDA-backed llama.cpp toolbox. NVIDIA is on the v1 supported list and the path works today, with one caveat: the dedicated CUDA toolbox image hasn’t been published to ghcr.io/hal0ai/ yet, so most users in v1 fall back to the Vulkan path on NVIDIA hardware.

NVIDIA discrete is a supported target, not the reference platform. For dedicated VRAM with mature drivers, NVIDIA is the obvious choice. For big-model headroom in one box, Strix Halo remains the headline target.

  • RTX 5090 (32 GB GDDR7)
  • RTX 4090 / 3090 (24 GB GDDR6X)
  • RTX 4080 / 4080 Super (16 GB GDDR6X)
  • RTX 3080 / 3080 Ti (10–12 GB GDDR6X)
  • Older 20-series and below: technically supported by llama.cpp; not a v1 focus.

Cards below 10 GB run very small models only (Q4 4B and under).

Out of the box in v1, NVIDIA users typically run on the Vulkan toolbox, the same image Strix Halo and AMD discrete use. It works, and it’s the path the installer picks when it detects an NVIDIA GPU without a published CUDA toolbox.

The CUDA toolbox is on the build list. When it lands, expect a non-trivial throughput improvement on chat workloads and substantially better large-context handling. Before-and-after numbers will go on this page once the CUDA toolbox ships.

These mirror the discrete-card section of the Strix Halo loadouts.

  • primary: Qwen3-Coder-30B-A3B-Instruct-Q4_K_M (~18.6 GB) or any Q4 ~30B chat, comfortable with a 16–32k context.
  • embed: nomic-embed-text-v2-moe-Q4_K_M (~140 MB) co-resident.
  • Q4 70B (Hermes-4-70B / Llama-3.3-70B) is feasible but tight with partial CPU offload; expect lower tok/s than VRAM-resident inference.
  • Trade vs Strix Halo: no headroom for a hot STT/TTS slot alongside a 30B primary.
  • primary: Qwen3-30B-A3B-Instruct-2507-Q4_K_M (~18.6 GB) fits with shorter context, or gemma-3-12b-it-Q4_K_M (~6.6 GB) for a longer window.
  • embed: small Q4 embed only (nomic-embed-text-v2-moe ~140 MB).
  • Q4 70B requires partial CPU offload. It works, but drops well below VRAM-resident speeds.
  • Trade vs 5090: tighter context budgets at the same model size.
  • primary: gemma-3-12b-it-Q4_K_M (~6.6 GB) or Hermes-4-14B-Q4_K_M (~9 GB).
  • embed: nomic-embed-text-v2-moe-Q4_K_M (~140 MB) leaves several GB for a ~16k context.
  • Q4 32B class (Qwen3-30B-A3B) is offload-only here. Workable occasionally, not as a daily driver.
  • Trade vs 24 GB cards: keep the primary at ~13B class for a smooth experience.
  • primary: a 4–14B Q4. Hermes-4-14B-Q4_K_M, gemma-3-12b-it-Q4_K_M, or Qwen3-4B-Instruct-2507-Q4_K_M (~2.5 GB) for low-latency.
  • embed: skip on 10–12 GB cards.
  • One slot at a time is the norm.

The standard one-liner from the install page handles the Vulkan path on NVIDIA:

Terminal window
curl -fsSL https://hal0.dev/install.sh | bash

You’ll want:

  • Recent NVIDIA proprietary drivers installed via your distro (575+ series recommended for newest cards).
  • Vulkan runtime present (vulkaninfo --summary returns devices).
  • The service user with /dev/nvidia* access; usually handled by the driver install.

The hardware probe detects the GPU and writes VRAM size to /etc/hal0/hardware.json so slot fit warnings are accurate.

  • Bare-metal Linux. The path with the fewest moving parts. Install the proprietary driver, run the hal0 installer.
  • VM with PCIe passthrough. The most common homelab pattern for NVIDIA on Proxmox. Bind the card to vfio-pci on the host, pass it into a Linux VM, install the driver inside the VM, then hal0. CPU pinning matters more than people expect; IOMMU groups matter even more.
  • LXC with the NVIDIA Container Toolkit. Workable but pickier than AMD. Install the toolkit on the Proxmox host, expose /dev/nvidia* and /dev/nvidiactl to the container with matching cgroup allows, and keep host + container driver versions identical.

Probe doesn’t list the GPU. Run nvidia-smi to confirm the driver sees the card. If that’s empty, fix the host driver install first.

Slot fails to start with CUDA / library errors. v1 doesn’t bundle a CUDA toolbox. Make sure the slot is on the Vulkan provider:

Terminal window
hal0 slot list --json | grep provider

If a slot is set to a CUDA provider, swap it back to llama-cpp until the CUDA toolbox publishes:

Terminal window
hal0 slot swap primary --provider llama-cpp

Lower than expected throughput. Confirm the card isn’t power-limited or PCIe-mode-limited. NVIDIA’s Vulkan implementation is solid but won’t hit the throughput of a native CUDA llama.cpp build until the dedicated toolbox ships.