Docs
API Server
OpenAI-compatible server endpoints, streaming, and voice setup.
Trillim exposes a small FastAPI server for local HTTP access.
If you want the deeper operational semantics behind the server surface, see Advanced SDK and Server Notes.
There are two ways to run it:
trillim serve Trillim/<name>for the built-in demo serverServer(...)in Python when you need more control
CLI Server vs Python Server
The distinction matters:
| Need | trillim serve | Server(...) |
|---|---|---|
| quick local server | yes | yes |
fixed 127.0.0.1:8000 bind | yes | configurable |
| custom host/port | no | yes |
/v1/models/swap | no | yes, with allow_hot_swap=True |
| custom search setup | no | yes |
CLI
trillim serve Trillim/BitNet-TRNQ
With voice routes:
trillim serve Trillim/BitNet-TRNQ --voice
Python
from trillim import LLM, Server
server = Server(
LLM("Trillim/BitNet-TRNQ"),
allow_hot_swap=True,
)
server.run(host="127.0.0.1", port=8000)
Endpoints
| Route | Method | Purpose |
|---|---|---|
/healthz | GET | readiness and component health |
/v1/models | GET | active model metadata |
/v1/chat/completions | POST | OpenAI-compatible chat completions |
/v1/models/swap | POST | optional hot-swap route |
/v1/audio/transcriptions | POST | optional STT route |
/v1/audio/speech | POST | optional TTS route |
/v1/voices | GET | optional voice list |
/v1/voices | POST | optional custom voice upload |
/v1/voices/{voice_name} | DELETE | optional custom voice deletion |
There is no /v1/completions route in this implementation.
GET /healthz
Returns 200 when all composed components are healthy:
{"status": "ok"}
If an LLM component is not in the running state, the server returns 503 and includes the component state:
{
"status": "degraded",
"components": {
"llm": {
"state": "swapping"
}
}
}
GET /v1/models
Returns truthful metadata for the active runtime:
{
"object": "list",
"state": "running",
"data": [
{
"id": "BitNet-TRNQ",
"object": "model",
"path": "/Users/you/.trillim/models/Trillim/BitNet-TRNQ",
"max_context_tokens": 4096,
"trust_remote_code": false,
"adapter_path": null,
"init_config": {
"num_threads": 0,
"lora_quant": null,
"unembed_quant": null
}
}
]
}
POST /v1/chat/completions
This is the OpenAI-compatible chat route.
Minimal request:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "content-type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Give me one sentence about local inference."}
]
}'
Minimal Python client example:
import json
import urllib.request
body = json.dumps(
{
"model": "BitNet-TRNQ",
"messages": [{"role": "user", "content": "Say hello."}],
}
).encode("utf-8")
request = urllib.request.Request(
"http://127.0.0.1:8000/v1/chat/completions",
data=body,
headers={"content-type": "application/json"},
method="POST",
)
with urllib.request.urlopen(request, timeout=60) as response:
payload = json.loads(response.read().decode("utf-8"))
print(payload["choices"][0]["message"]["content"])
Supported request fields:
| Field | Type | Meaning |
|---|---|---|
messages | array | required message list |
model | string | optional, but if present it must match the active model name |
stream | bool | enable SSE streaming |
temperature | float | 0.0 to 2.0 |
top_k | int | 1 to 200 |
top_p | float | > 0.0 and <= 1.0 |
repetition_penalty | float | > 0.0 and <= 2.0 |
rep_penalty_lookback | int | >= 0 |
max_tokens | int | 1 to 8192 |
Notes:
- Typical clients should send only
system,user, andassistantroles. - When the LLM is using the search harness, the OpenAI route still streams assistant text only. Internal search progress is not exposed on this endpoint.
- Requests larger than the JSON body cap are rejected before processing.
Streaming
Set "stream": true to receive server-sent events:
curl -N http://127.0.0.1:8000/v1/chat/completions \
-H "content-type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Say hi."}],
"stream": true
}'
The stream follows OpenAI-style chat chunks and ends with:
data: [DONE]
POST /v1/models/swap
This route exists only when the server was created with allow_hot_swap=True.
Example:
curl http://127.0.0.1:8000/v1/models/swap \
-H "content-type: application/json" \
-d '{
"model_dir": "Trillim/BitNet-TRNQ",
"lora_dir": "Trillim/BitNet-GenZ-LoRA-TRNQ"
}'
Supported request fields:
| Field | Type | Meaning |
|---|---|---|
model_dir | string | required store ID for the next base model |
num_threads | int | optional runtime worker thread count |
lora_dir | string | optional adapter store ID |
lora_quant | string | optional LoRA runtime quantization |
unembed_quant | string | optional unembedding quantization |
harness_name | string | default or search |
search_provider | string | ddgs or brave |
search_token_budget | int | requested search-context budget |
Important behavior:
- Omitted init-time fields reset to Trillim defaults. They do not inherit the previous runtime’s values.
- The effective search token budget is clamped to one quarter of the active model context window.
- Existing chat sessions become stale once swap handoff begins.
search_provider: "brave"requiresSEARCH_API_KEYin the server environment.
Voice Routes
Voice routes exist only when the server includes STT() and TTS(). The CLI adds them with --voice.
Install the extra first:
uv add "trillim[voice]"
POST /v1/audio/transcriptions
This is a raw-body route. Send audio bytes directly and set content-type to audio/* or application/octet-stream.
curl "http://127.0.0.1:8000/v1/audio/transcriptions?language=en" \
-H "content-type: audio/wav" \
--data-binary @recording.wav
Response:
{"text": "transcribed text"}
Key facts:
- max upload size:
64 MiB languageis optional- only one STT request is processed at a time
POST /v1/audio/speech
This is also a raw-body route. Send UTF-8 text as the body.
curl -N http://127.0.0.1:8000/v1/audio/speech \
-H "voice: alba" \
-H "speed: 1.0" \
--data-binary "Hello from Trillim."
Response format:
event: audiowith base64-encoded PCM chunksevent: donewhen synthesis is completeevent: errorif the session fails after streaming has started
Important facts:
- the HTTP response is SSE, not WAV
- PCM is
24 kHz, mono,16-bit - max text body size:
6 MiB - only one live TTS session is allowed at a time
If you want a ready-to-write WAV payload in Python, prefer await tts.synthesize_wav(...) from the SDK.
GET /v1/voices
Lists built-in and custom voice names:
curl http://127.0.0.1:8000/v1/voices
POST /v1/voices
Register a custom voice by sending raw audio bytes and a name header:
curl http://127.0.0.1:8000/v1/voices \
-H "name: myvoice" \
--data-binary @voice-sample.wav
Response:
{"name": "myvoice", "status": "created"}
Important facts:
- max upload size:
10 MiB - max serialized custom voice state:
64 MiB - custom voice names and
voiceselectors accept only ASCII letters and digits - custom voice storage lives under
~/.trillim/voices - voice cloning support requires accepting the
kyutai/pocket-ttsterms and authenticating with Hugging Face - if a reference sample exceeds the serialized voice-state cap, Trillim rejects it and you should retry with a shorter sample
One-time setup for custom voice cloning:
hf auth login
If hf is not installed globally, uvx hf auth login works as well.
DELETE /v1/voices/{voice_name}
Delete a custom voice:
curl -X DELETE http://127.0.0.1:8000/v1/voices/myvoice
Error Codes
The server maps the public failure modes consistently:
| Status | Meaning |
|---|---|
400 | invalid input, bad JSON, model mismatch, or validation failure |
409 | session conflict such as closed, stale, or exhausted chat state |
429 | component busy or not admitting more work |
504 | progress timeout |
503 | startup failure, worker failure, or other service-side error |
Use the SDK if you want the native Python exception types instead of HTTP status codes.