Trillim's Tokens
Trillim Adds Support For Bonsai
April 9, 2026
Trillim v0.8.0 adds support for PrismML's Bonsai models, bringing DarkNet's CPU-first runtime to a new class of 1-bit Qwen3-based models.
On March 31, 2026, PrismML released Bonsai, a new family of Qwen3-based 1-bit language models. That immediately caught our attention at Trillim.
The Bonsai models are interesting for the same reason BitNet was interesting: they push useful language models much closer to the hardware people actually have. In Bonsai, weights and embeddings are binary. In BitNet, the core idea is ternary weights. Both families reduce dependence on expensive floating-point matrix multiplies and fit naturally with CPU-first inference.
PrismML also published a whitepaper for Bonsai 8B, and the benchmark quality was strong enough that we wanted to move quickly.
That led directly to v0.8.0: Trillim now supports Bonsai through the same managed bundle flow as the rest of the product. You can quantize compatible checkpoints into Local/...-TRNQ, load them through the CLI, use them from the Python SDK, or serve them over the local OpenAI-compatible server. If you are new to Trillim, start with the install docs and then head to the CLI reference.
Why we cared about Bonsai
Trillim started because we were not satisfied with the state of local BitNet inference.
Microsoft’s bitnet.cpp was an important research project and it demonstrated that carefully designed low-bit kernels could work well in practice. But it also had real product limitations for the workflow we cared about: it supported only a narrow subset of models, it was not designed as a general local AI surface, and at the time we started Trillim it did not cover some features we considered important for real use, including LoRA adapters and broader unembedding quantization support.
So we built our own runtime. That runtime became DarkNet: the CPU inference engine behind Trillim. It gave us control over kernels, model support, adapters, serving surfaces, and the full path from quantization to local deployment. We have already shown in our earlier DarkNet vs bitnet.cpp benchmark write-up that this approach lets us move faster and ship stronger CPU inference performance.
That same logic applies to Bonsai. If a new 1-bit model family is genuinely useful on consumer hardware, we want it inside the same runtime and the same product surface instead of forcing users onto a separate toolchain.
What ships in v0.8.0
This release adds Bonsai support directly to DarkNet and Trillim’s bundle workflow.
- compatible Bonsai checkpoints can be quantized into Trillim-managed bundles
- those bundles load through the same
chat,serve, and SDK flows as existing Trillim models - DarkNet now includes dedicated CPU kernels for Bonsai on AVX2 and Arm NEON
The practical result is simple: if you are running on CPU, Trillim now gives you a fast path for Bonsai without asking you to learn a new local stack.
There is one important caveat. These results are about CPU inference. Right now our focus is AVX2 and Arm NEON. If you are relying on Apple GPU cores, AVX512, or AVX-VNNI, this release does not yet represent our end state. We are actively working on AVX512, AVX-VNNI, and Metal support next.
Benchmark framing
We benchmarked DarkNet against PrismML’s Bonsai runtime in two configurations:
- the base
bonsai.cpppath from PrismML’s fork ofllama.cpp - a manually patched version with unmerged pull requests applied, including AVX2 kernels where relevant
For these tables:
pp 512means prefill throughput with a 512-token prompttg 256means decode throughput over 256 output tokens
AVX2 Results
These AVX2 runs were collected on a consumer Intel laptop. That means there is some unavoidable variance from thermal behavior, boost behavior, and scheduler noise. We followed the same general methodology as in our earlier DarkNet vs bitnet.cpp benchmark post, so while the numbers are not perfectly noiseless, the comparison is still fair.
The base bonsai.cpp path was not useful here because it does not have a meaningful AVX2 fast path. In practice it falls back to a generic implementation that is too slow to make the results worth reporting.
bonsai.cpp with unmerged PRs
| Model | pp 512 t/s | tg 256 t/s |
|---|---|---|
| Bonsai 1.7B | 93.61 | 44.29 |
| Bonsai 4B | 36.70 | 16.11 |
| Bonsai 8B | 17.99 | 10.00 |
DarkNet
| Model | pp 512 t/s | tg 256 t/s |
|---|---|---|
| Bonsai 1.7B | 126.25 | 43.76 |
| Bonsai 4B | 49.48 | 16.43 |
| Bonsai 8B | 25.59 | 8.65 |
The AVX2 story is straightforward. DarkNet is clearly faster on prefill, while decode is roughly even once you account for normal laptop noise. The biggest practical win is that Trillim ships a real AVX2 path instead of falling back to something too slow to matter.
Arm Results
These Arm runs were collected on an Apple Mac Studio with an M3 Ultra. The machine is stable under sustained load, so we used a continuous benchmark process: for each category we ran the 1.7B, 4B, and 8B models while sweeping thread counts, and we repeated each run five times to smooth out variation.
For fairness, both bonsai.cpp variants were compiled without Metal support because DarkNet is currently CPU-only. When we ship Metal support, we will publish that comparison separately.
Base bonsai.cpp
Bonsai-1.7B
| threads | pp 512 t/s | tg 256 t/s |
|---|---|---|
| 1 | 21.57 | 16.67 |
| 4 | 77.82 | 54.05 |
| 8 | 154.05 | 95.01 |
| 10 | 189.51 | 109.75 |
| 20 | 358.81 | 145.44 |
Bonsai-4B
| threads | pp 512 t/s | tg 256 t/s |
|---|---|---|
| 1 | 8.38 | 7.14 |
| 4 | 30.23 | 23.78 |
| 8 | 59.82 | 43.40 |
| 10 | 73.58 | 50.73 |
| 20 | 141.05 | 71.74 |
Bonsai-8B
| threads | pp 512 t/s | tg 256 t/s |
|---|---|---|
| 1 | 4.44 | 3.80 |
| 4 | 16.27 | 13.06 |
| 8 | 32.33 | 24.56 |
| 10 | 40.06 | 28.80 |
| 20 | 77.98 | 46.04 |
bonsai.cpp with unmerged PRs
Bonsai-1.7B
| threads | pp 512 t/s | tg 256 t/s |
|---|---|---|
| 1 | 21.50 | 17.81 |
| 4 | 77.86 | 60.13 |
| 8 | 152.93 | 102.06 |
| 10 | 188.19 | 117.21 |
| 20 | 358.59 | 141.08 |
Bonsai-4B
| threads | pp 512 t/s | tg 256 t/s |
|---|---|---|
| 1 | 8.26 | 7.73 |
| 4 | 30.21 | 26.47 |
| 8 | 59.46 | 47.02 |
| 10 | 73.43 | 54.62 |
| 20 | 140.96 | 75.08 |
Bonsai-8B
| threads | pp 512 t/s | tg 256 t/s |
|---|---|---|
| 1 | 4.43 | 4.18 |
| 4 | 16.30 | 14.58 |
| 8 | 32.32 | 26.90 |
| 10 | 40.03 | 31.27 |
| 20 | 77.94 | 48.69 |
DarkNet
Bonsai-1.7B
| threads | pp 512 t/s | tg 256 t/s |
|---|---|---|
| 1 | 68.47 | 19.08 |
| 4 | 243.49 | 64.41 |
| 8 | 467.64 | 111.72 |
| 10 | 529.39 | 124.38 |
| 20 | 851.07 | 152.11 |
Bonsai-4B
| threads | pp 512 t/s | tg 256 t/s |
|---|---|---|
| 1 | 26.64 | 8.10 |
| 4 | 96.66 | 28.46 |
| 8 | 188.43 | 50.82 |
| 10 | 218.25 | 58.50 |
| 20 | 387.01 | 82.68 |
Bonsai-8B
| threads | pp 512 t/s | tg 256 t/s |
|---|---|---|
| 1 | 15.44 | 4.67 |
| 4 | 56.55 | 16.75 |
| 8 | 112.28 | 30.86 |
| 10 | 133.34 | 36.51 |
| 20 | 241.28 | 55.08 |
On Apple Silicon CPU-only runs, DarkNet pulls ahead very clearly on prefill and still wins on decode, though the decode gains are much smaller. That pattern holds across the thread sweep and becomes more pronounced as thread counts rise.
Summary At 20 Threads
| Category | DarkNet over base | DarkNet over PR |
|---|---|---|
| Bonsai-1.7B pp 512 | 2.37x | 2.37x |
| Bonsai-1.7B tg 256 | 1.05x | 1.08x |
| Bonsai-4B pp 512 | 2.74x | 2.75x |
| Bonsai-4B tg 256 | 1.15x | 1.10x |
| Bonsai-8B pp 512 | 3.09x | 3.10x |
| Bonsai-8B tg 256 | 1.20x | 1.13x |
That is the main takeaway from this release. On CPU, especially on Arm, DarkNet is not just compatible with Bonsai, it is materially faster where prefill matters most, and it still preserves a decode advantage in the aggregate.
Where this leaves Trillim
Trillim was never meant to be only a benchmark project. The point was to build a local AI stack that makes high-performance CPU inference practical to use.
That means two things have to be true at the same time:
- the runtime needs to be fast
- the product surface needs to be easy to install, easy to load, and easy to integrate
That is why Bonsai support matters to us. It is not only about adding another model family. It is about folding a new class of efficient models into the same local workflow: quantize, load, chat, serve, and embed.
If you want to try it, start with the install docs, then use the CLI reference or the Python Components guide depending on how you prefer to work.
Enjoy v0.8.0.