Lab notebook · 2026-06-04
Does this model
even exist?
A friend drops a HuggingFace link: gemma-4-26B 'TurboQuant'. “Fancy a few perf tests?” By the time we settle what's real, we land on a different discovery entirely.
The request that looked simple
“Can we run a few perf tests on the home box with this model?” — plus a link to a HuggingFace repo called gemma-4-26B-A4B-TurboQuant.
The box is Aemie: a home Ryzen AI 9 with an AMD Radeon 890M iGPU that runs my private Claude assistant and a whole inference stack. Not a throwaway sandbox.
So the first reflex wasn't to download — it was a question: is it actually what it looks like?
The name smelled off
Several small tells at once: majentik is an unknown uploader; “TurboQuant” framed as a finished product; and the promise that you install a pip package plus a llama.cpp fork to use it.
The golden rule: before you download or run anything on the box that runs your assistant, verify the source. An unknown pip package runs arbitrary code — on your machine, with your privileges.
Verification
What's real, and what's noise
The base model is real. google/gemma-4-26B-A4B — genuinely released by Google (2026-03), ~130,000 downloads, real weights.
The technique is real. TurboQuant is published Google research (ICLR 2026): random rotation + scalar quantization on the KV cache.
But the repo is empty. 'majentik/...-TurboQuant' holds zero weight files — just a README. 0 bytes stored, 0 downloads. A doc repo, not a model.
The framing overstates. “Bit-identical at 4-bit” — the paper actually claims ~3.5 bits/channel quality-neutrality. All 18 listed 'variants' are empty.
Supply-chain pause
Read it before you run it
I didn't install the turboquant pip package — I downloaded and read it statically. 926 lines, plain Python/torch.
Clean
- · No network calls
- · No subprocess / shell / eval
- · No install-time hooks
- · A faithful impl of the paper
But
unknown publisher (back2matching), brand-new package, no homepage. Clean source ≠ standing trust — it just means I read this version.
The build is tidy: random orthogonal rotation, per-coordinate optimal quantizer, a KIVI-style residual window for recent tokens.
The hardware wall: no CUDA
Aemie is an AMD iGPU on Vulkan. But the turboquant package is CUDA-default and transformers-based: here it would run on CPU only, in bf16 (~52 GB) — which doesn't even fit.
And the planar3 KV cache (TurboQuant's llama.cpp path) isn't upstream — it needs an unverified fork built from source. So the one path that would actually exercise the iGPU is also the riskiest.
Decision: take the trustworthy path. Benchmark the real base model with a reputable GGUF quant (ggml-org) on stock Vulkan llama.cpp — and see what upstream KV-quant gives, the thing TurboQuant would have to beat.
The twist
gemma-4 is mostly sliding-window attention
From the GGUF metadata: of 30 layers, only 5 are full-attention; the other 25 are SWA (1024-token window). However long the context, SWA layers cache only 1024 tokens.
Layer pattern (S = sliding, F = full)
Consequence: the KV cache — most models' memory pain point at long context — barely grows here. And the KV cache is exactly what TurboQuant would help. The twist: the thing we set out to test makes itself unnecessary.
KV memory, computed
Accounting for SWA, at 64k context, for three KV formats. (The 25 SWA layers contribute only ~0.2 GiB throughout.)
| context | KV f16 | KV q8_0 | KV q4_0 |
|---|---|---|---|
| 16k | 1.45 GiB | 0.77 GiB | 0.41 GiB |
| 65k | 5.20 GiB | 2.76 GiB | 1.46 GiB |
Even in f16 at 65k it's just 5.2 GiB — nothing on a box with 45 GB free. And stock upstream q4_0 takes that to 1.46 GiB with no fork. There's simply no memory problem to solve.
Speed 1 — prefill
gemma-4-26B-A4B-it Q4_K_M (15.63 GiB), Radeon 890M / Vulkan, prompt processing (t/s). The interesting bit: KV-quant doesn't help — at long context it actively hurts.
| KV | pp512 | pp16384 | pp65536 |
|---|---|---|---|
| f16 ★ | 467 | 372 | 240 |
| q8_0 | 449 | 312 | 172 |
| q4_0 | 433 | 333 | 185 |
In prefill, f16 wins everywhere. KV-quant is compute overhead that costs up to 28% at 65k for q8_0 (240 → 172).
Speed 2 — decode at depth
Token generation (t/s) as a function of context depth. Here it inverts: in decode, q4_0 is fastest, and its lead grows with depth.
| KV | @ 0 | @ 16k | @ 65k |
|---|---|---|---|
| f16 | 24.6 | 20.7 | 16.4 |
| q8_0 | 24.3 | 21.4 | 17.6 |
| q4_0 ★ | 25.5 | 21.9 | 18.5 |
Why the flip? Deep-context decode is memory-bandwidth-bound on KV reads: the smaller (q4_0) cache moves fewer bytes → faster. +13% at 65k (16.4 → 18.5). And upstream already delivers it — no fork.
Context
How does it stack up to our current house model?
Important: our house Qwen is not the factory 35B — in an earlier experiment we pruned it to K=187 (256→187 experts, 21.3→15.9 GiB), and that's been live since May 2026. So we're comparing gemma-4 against our already-trimmed house model — which happens to be almost exactly the same size.
| Qwen3.6-A3B · aemie K=187 | gemma-4-26B-A4B | |
|---|---|---|
| Active / total | 3B / ~26.6B | 4B / 26B |
| GGUF (Q4) | 15.9 GiB | 15.6 GiB |
| Decode | ~21.9 t/s | ~24.6 t/s |
At nearly identical size (15.6 vs 15.9 GiB), gemma-4 decodes ~12% faster — and that's against our already-pruned house model. What took us a night of expert-surgery (the size), gemma delivers out of the box. A serious candidate for the house stack.
Decode is pruning-independent (only 8 experts active per token), so K=187 ≈ K=192 speed. The factory MTP variant reaches 25.8 t/s with speculative decoding, but the deployed pruned edition isn't MTP. Decode figures come from llama-bench tg runs → indicative.
Verdict
The TurboQuant fork isn't worth it here
SWA keeps the KV small to begin with. Just 5.2 GiB at 65k in f16. No memory problem to solve.
Upstream q4_0 already covers the benefit. No fork: 1.46 GiB KV @65k and the fastest deep-context decode (+13%).
Our earlier Qwen benchmark said the same. “No tuning knob beats the upstream default by >1%.” The wall is memory bandwidth (LPDDR5x ~120 GB/s), not the KV format.
Takeaway
The good answer wasn't the one we were after
The original question was: “does TurboQuant speed this model up?” The real answer: the question is mis-framed — on this architecture (SWA) and this hardware (bandwidth-bound), there's nothing for it to speed up.
And the two real wins didn't come from the benchmark numbers: (1) vet a suspicious repo before you let anything onto your box — and (2) a model's architecture (here: sliding-window attention) tells you more about its behaviour than the name on the marketing page.
Oh, and gemma-4 is fast and small out of the box. It might be the next house model.