Streaming
hal0 streams /v1/chat/completions and /v1/completions responses as
Server-Sent Events, exactly matching the OpenAI streaming
protocol. Any OpenAI SDK that handles streaming today works against
hal0 unmodified.
Enable streaming
Section titled “Enable streaming”Add "stream": true to the request body:
curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "primary", "stream": true, "messages": [ {"role": "user", "content": "Count to five."} ] }'Wire format
Section titled “Wire format”Each chunk is a data: … line, JSON-encoded, terminated by a blank
line. The stream ends with data: [DONE]. Same shape OpenAI ships.
data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"One"}}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":", two"}}]}
data: [DONE]Python SDK
Section titled “Python SDK”from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
stream = client.chat.completions.create( model="primary", stream=True, messages=[{"role": "user", "content": "Count to five."}],)
for chunk in stream: print(chunk.choices[0].delta.content or "", end="", flush=True)What hal0 adds on top
Section titled “What hal0 adds on top”Streaming flows through the same dispatcher that handles non-streaming requests, so you get:
- Single-flight prefetch. If two clients open identical streams on a cold slot, the slot fires one upstream call and fans the token stream to both.
- Adaptive cold-boot. The first request after a slot reaches
readykeeps the connection open while the model finishes warming; you don’t get a 503 on a request that’s about to work. - Structured errors mid-stream. If the slot transitions to
errorpart-way through, the stream emits one final SSE event with the structured error envelope before closing.
The same SSE wire format works across the LAN. Point an OpenWebUI on
your laptop or an MCP server on another box at
http://hal0.lan:8080/v1 and streaming behaves identically through
your Traefik vhost.