Trillim Adds Support For Bonsai

On March 31, 2026, PrismML released Bonsai, a new family of Qwen3-based 1-bit language models. That immediately caught our attention at Trillim.

The Bonsai models are interesting for the same reason BitNet was interesting: they push useful language models much closer to the hardware people actually have. In Bonsai, weights and embeddings are binary. In BitNet, the core idea is ternary weights. Both families reduce dependence on expensive floating-point matrix multiplies and fit naturally with CPU-first inference.

PrismML also published a whitepaper for Bonsai 8B, and the benchmark quality was strong enough that we wanted to move quickly.

That led directly to v0.8.0: Trillim now supports Bonsai through the same managed bundle flow as the rest of the product. You can quantize compatible checkpoints into Local/...-TRNQ, load them through the CLI, use them from the Python SDK, or serve them over the local OpenAI-compatible server. If you are new to Trillim, start with the install docs and then head to the CLI reference.

Why we cared about Bonsai

Trillim started because we were not satisfied with the state of local BitNet inference.

Microsoft’s bitnet.cpp was an important research project and it demonstrated that carefully designed low-bit kernels could work well in practice. But it also had real product limitations for the workflow we cared about: it supported only a narrow subset of models, it was not designed as a general local AI surface, and at the time we started Trillim it did not cover some features we considered important for real use, including LoRA adapters and broader unembedding quantization support.

So we built our own runtime. That runtime became DarkNet: the CPU inference engine behind Trillim. It gave us control over kernels, model support, adapters, serving surfaces, and the full path from quantization to local deployment. We have already shown in our earlier DarkNet vs bitnet.cpp benchmark write-up that this approach lets us move faster and ship stronger CPU inference performance.

That same logic applies to Bonsai. If a new 1-bit model family is genuinely useful on consumer hardware, we want it inside the same runtime and the same product surface instead of forcing users onto a separate toolchain.

What ships in `v0.8.0`

This release adds Bonsai support directly to DarkNet and Trillim’s bundle workflow.

compatible Bonsai checkpoints can be quantized into Trillim-managed bundles
those bundles load through the same chat, serve, and SDK flows as existing Trillim models
DarkNet now includes dedicated CPU kernels for Bonsai on AVX2 and Arm NEON

The practical result is simple: if you are running on CPU, Trillim now gives you a fast path for Bonsai without asking you to learn a new local stack.

There is one important caveat. These results are about CPU inference. Right now our focus is AVX2 and Arm NEON. If you are relying on Apple GPU cores, AVX512, or AVX-VNNI, this release does not yet represent our end state. We are actively working on AVX512, AVX-VNNI, and Metal support next.

Benchmark framing

We benchmarked DarkNet against PrismML’s Bonsai runtime in two configurations:

the base bonsai.cpp path from PrismML’s fork of llama.cpp
a manually patched version with unmerged pull requests applied, including AVX2 kernels where relevant

For these tables:

pp 512 means prefill throughput with a 512-token prompt
tg 256 means decode throughput over 256 output tokens

AVX2 Results

These AVX2 runs were collected on a consumer Intel laptop. That means there is some unavoidable variance from thermal behavior, boost behavior, and scheduler noise. We followed the same general methodology as in our earlier DarkNet vs bitnet.cpp benchmark post, so while the numbers are not perfectly noiseless, the comparison is still fair.

The base bonsai.cpp path was not useful here because it does not have a meaningful AVX2 fast path. In practice it falls back to a generic implementation that is too slow to make the results worth reporting.

`bonsai.cpp` with unmerged PRs

Model	pp 512 t/s	tg 256 t/s
Bonsai 1.7B	93.61	44.29
Bonsai 4B	36.70	16.11
Bonsai 8B	17.99	10.00

DarkNet

Model	pp 512 t/s	tg 256 t/s
Bonsai 1.7B	126.25	43.76
Bonsai 4B	49.48	16.43
Bonsai 8B	25.59	8.65

The AVX2 story is straightforward. DarkNet is clearly faster on prefill, while decode is roughly even once you account for normal laptop noise. The biggest practical win is that Trillim ships a real AVX2 path instead of falling back to something too slow to matter.

Arm Results

These Arm runs were collected on an Apple Mac Studio with an M3 Ultra. The machine is stable under sustained load, so we used a continuous benchmark process: for each category we ran the 1.7B, 4B, and 8B models while sweeping thread counts, and we repeated each run five times to smooth out variation.

For fairness, both bonsai.cpp variants were compiled without Metal support because DarkNet is currently CPU-only. When we ship Metal support, we will publish that comparison separately.

Base `bonsai.cpp`

Bonsai-1.7B

threads	pp 512 t/s	tg 256 t/s
1	21.57	16.67
4	77.82	54.05
8	154.05	95.01
10	189.51	109.75
20	358.81	145.44

Bonsai-4B

threads	pp 512 t/s	tg 256 t/s
1	8.38	7.14
4	30.23	23.78
8	59.82	43.40
10	73.58	50.73
20	141.05	71.74

Bonsai-8B

threads	pp 512 t/s	tg 256 t/s
1	4.44	3.80
4	16.27	13.06
8	32.33	24.56
10	40.06	28.80
20	77.98	46.04

`bonsai.cpp` with unmerged PRs

Bonsai-1.7B

threads	pp 512 t/s	tg 256 t/s
1	21.50	17.81
4	77.86	60.13
8	152.93	102.06
10	188.19	117.21
20	358.59	141.08

Bonsai-4B

threads	pp 512 t/s	tg 256 t/s
1	8.26	7.73
4	30.21	26.47
8	59.46	47.02
10	73.43	54.62
20	140.96	75.08

Bonsai-8B

threads	pp 512 t/s	tg 256 t/s
1	4.43	4.18
4	16.30	14.58
8	32.32	26.90
10	40.03	31.27
20	77.94	48.69

DarkNet

Bonsai-1.7B

threads	pp 512 t/s	tg 256 t/s
1	68.47	19.08
4	243.49	64.41
8	467.64	111.72
10	529.39	124.38
20	851.07	152.11

Bonsai-4B

threads	pp 512 t/s	tg 256 t/s
1	26.64	8.10
4	96.66	28.46
8	188.43	50.82
10	218.25	58.50
20	387.01	82.68

Bonsai-8B

threads	pp 512 t/s	tg 256 t/s
1	15.44	4.67
4	56.55	16.75
8	112.28	30.86
10	133.34	36.51
20	241.28	55.08

On Apple Silicon CPU-only runs, DarkNet pulls ahead very clearly on prefill and still wins on decode, though the decode gains are much smaller. That pattern holds across the thread sweep and becomes more pronounced as thread counts rise.

Summary At 20 Threads

Category	DarkNet over base	DarkNet over PR
Bonsai-1.7B pp 512	2.37x	2.37x
Bonsai-1.7B tg 256	1.05x	1.08x
Bonsai-4B pp 512	2.74x	2.75x
Bonsai-4B tg 256	1.15x	1.10x
Bonsai-8B pp 512	3.09x	3.10x
Bonsai-8B tg 256	1.20x	1.13x

That is the main takeaway from this release. On CPU, especially on Arm, DarkNet is not just compatible with Bonsai, it is materially faster where prefill matters most, and it still preserves a decode advantage in the aggregate.

Where this leaves Trillim

Trillim was never meant to be only a benchmark project. The point was to build a local AI stack that makes high-performance CPU inference practical to use.

That means two things have to be true at the same time:

the runtime needs to be fast
the product surface needs to be easy to install, easy to load, and easy to integrate

That is why Bonsai support matters to us. It is not only about adding another model family. It is about folding a new class of efficient models into the same local workflow: quantize, load, chat, serve, and embed.

If you want to try it, start with the install docs, then use the CLI reference or the Python Components guide depending on how you prefer to work.

Enjoy v0.8.0.

Trillim Adds Support For Bonsai

Why we cared about Bonsai

What ships in v0.8.0

Benchmark framing

AVX2 Results

bonsai.cpp with unmerged PRs

DarkNet

Arm Results

Base bonsai.cpp

bonsai.cpp with unmerged PRs

DarkNet

Summary At 20 Threads

Where this leaves Trillim

What ships in `v0.8.0`

`bonsai.cpp` with unmerged PRs

Base `bonsai.cpp`

`bonsai.cpp` with unmerged PRs