01 / 13

Lab notebook · 2026-06-04

Does this model
even exist?

A friend drops a HuggingFace link: gemma-4-26B 'TurboQuant'. “Fancy a few perf tests?” By the time we settle what's real, we land on a different discovery entirely.

Scroll, or press ↓
For everyone02 / 13

The request that looked simple

“Can we run a few perf tests on the home box with this model?” — plus a link to a HuggingFace repo called gemma-4-26B-A4B-TurboQuant.

The box is Aemie: a home Ryzen AI 9 with an AMD Radeon 890M iGPU that runs my private Claude assistant and a whole inference stack. Not a throwaway sandbox.

So the first reflex wasn't to download — it was a question: is it actually what it looks like?

For everyone03 / 13

The name smelled off

Several small tells at once: majentik is an unknown uploader; “TurboQuant” framed as a finished product; and the promise that you install a pip package plus a llama.cpp fork to use it.

The golden rule: before you download or run anything on the box that runs your assistant, verify the source. An unknown pip package runs arbitrary code — on your machine, with your privileges.

For everyone04 / 13

Verification

What's real, and what's noise

The base model is real. google/gemma-4-26B-A4B — genuinely released by Google (2026-03), ~130,000 downloads, real weights.

The technique is real. TurboQuant is published Google research (ICLR 2026): random rotation + scalar quantization on the KV cache.

But the repo is empty. 'majentik/...-TurboQuant' holds zero weight files — just a README. 0 bytes stored, 0 downloads. A doc repo, not a model.

!

The framing overstates. “Bit-identical at 4-bit” — the paper actually claims ~3.5 bits/channel quality-neutrality. All 18 listed 'variants' are empty.

Deep tech05 / 13

Supply-chain pause

Read it before you run it

I didn't install the turboquant pip package — I downloaded and read it statically. 926 lines, plain Python/torch.

Clean

  • · No network calls
  • · No subprocess / shell / eval
  • · No install-time hooks
  • · A faithful impl of the paper

But

unknown publisher (back2matching), brand-new package, no homepage. Clean source ≠ standing trust — it just means I read this version.

The build is tidy: random orthogonal rotation, per-coordinate optimal quantizer, a KIVI-style residual window for recent tokens.

Deep tech06 / 13

The hardware wall: no CUDA

Aemie is an AMD iGPU on Vulkan. But the turboquant package is CUDA-default and transformers-based: here it would run on CPU only, in bf16 (~52 GB) — which doesn't even fit.

And the planar3 KV cache (TurboQuant's llama.cpp path) isn't upstream — it needs an unverified fork built from source. So the one path that would actually exercise the iGPU is also the riskiest.

Decision: take the trustworthy path. Benchmark the real base model with a reputable GGUF quant (ggml-org) on stock Vulkan llama.cpp — and see what upstream KV-quant gives, the thing TurboQuant would have to beat.

Deep tech07 / 13

The twist

gemma-4 is mostly sliding-window attention

From the GGUF metadata: of 30 layers, only 5 are full-attention; the other 25 are SWA (1024-token window). However long the context, SWA layers cache only 1024 tokens.

Layer pattern (S = sliding, F = full)

SSSSSFSSSSSFSSSSSFSSSSSFSSSSSF

Consequence: the KV cache — most models' memory pain point at long context — barely grows here. And the KV cache is exactly what TurboQuant would help. The twist: the thing we set out to test makes itself unnecessary.

Deep tech08 / 13

KV memory, computed

Accounting for SWA, at 64k context, for three KV formats. (The 25 SWA layers contribute only ~0.2 GiB throughout.)

contextKV f16KV q8_0KV q4_0
16k1.45 GiB0.77 GiB0.41 GiB
65k5.20 GiB2.76 GiB1.46 GiB

Even in f16 at 65k it's just 5.2 GiB — nothing on a box with 45 GB free. And stock upstream q4_0 takes that to 1.46 GiB with no fork. There's simply no memory problem to solve.

Deep tech09 / 13

Speed 1 — prefill

gemma-4-26B-A4B-it Q4_K_M (15.63 GiB), Radeon 890M / Vulkan, prompt processing (t/s). The interesting bit: KV-quant doesn't help — at long context it actively hurts.

KVpp512pp16384pp65536
f16 ★467372240
q8_0449312172
q4_0433333185

In prefill, f16 wins everywhere. KV-quant is compute overhead that costs up to 28% at 65k for q8_0 (240 → 172).

Deep tech10 / 13

Speed 2 — decode at depth

Token generation (t/s) as a function of context depth. Here it inverts: in decode, q4_0 is fastest, and its lead grows with depth.

KV@ 0@ 16k@ 65k
f1624.620.716.4
q8_024.321.417.6
q4_0 ★25.521.918.5

Why the flip? Deep-context decode is memory-bandwidth-bound on KV reads: the smaller (q4_0) cache moves fewer bytes → faster. +13% at 65k (16.4 → 18.5). And upstream already delivers it — no fork.

11 / 13

Context

How does it stack up to our current house model?

Important: our house Qwen is not the factory 35B — in an earlier experiment we pruned it to K=187 (256→187 experts, 21.3→15.9 GiB), and that's been live since May 2026. So we're comparing gemma-4 against our already-trimmed house model — which happens to be almost exactly the same size.

Qwen3.6-A3B · aemie K=187gemma-4-26B-A4B
Active / total3B / ~26.6B4B / 26B
GGUF (Q4)15.9 GiB15.6 GiB
Decode~21.9 t/s~24.6 t/s

At nearly identical size (15.6 vs 15.9 GiB), gemma-4 decodes ~12% faster — and that's against our already-pruned house model. What took us a night of expert-surgery (the size), gemma delivers out of the box. A serious candidate for the house stack.

Decode is pruning-independent (only 8 experts active per token), so K=187 ≈ K=192 speed. The factory MTP variant reaches 25.8 t/s with speculative decoding, but the deployed pruned edition isn't MTP. Decode figures come from llama-bench tg runs → indicative.

12 / 13

Verdict

The TurboQuant fork isn't worth it here

1

SWA keeps the KV small to begin with. Just 5.2 GiB at 65k in f16. No memory problem to solve.

2

Upstream q4_0 already covers the benefit. No fork: 1.46 GiB KV @65k and the fastest deep-context decode (+13%).

3

Our earlier Qwen benchmark said the same. “No tuning knob beats the upstream default by >1%.” The wall is memory bandwidth (LPDDR5x ~120 GB/s), not the KV format.

13 / 13

Takeaway

The good answer wasn't the one we were after

The original question was: “does TurboQuant speed this model up?” The real answer: the question is mis-framed — on this architecture (SWA) and this hardware (bandwidth-bound), there's nothing for it to speed up.

And the two real wins didn't come from the benchmark numbers: (1) vet a suspicious repo before you let anything onto your box — and (2) a model's architecture (here: sliding-window attention) tells you more about its behaviour than the name on the marketing page.

Oh, and gemma-4 is fast and small out of the box. It might be the next house model.

gemma-4-26B-A4BTurboQuantsliding-window attentiongfx1150 (Radeon 890M)Vulkanllama.cpp