Lab notebook · 2026-05-28
Dropping the experts I never use
Trimming a 35-billion-parameter MoE model to fit my own home workload. One night. Two reboots. One bug that — had we missed it — would have ended the whole experiment with a false conclusion.
Two simple questions about a complicated model
Can we bake in the system prompt?
Every call sends the same long instruction. Can we bake it into the model's weights so we never have to send it?
Can we drop what we don't use?
If I always do similar things with it (coding, project management, family vacation), is there any part of the model I almost never wake up?
The post answers both — but we went deeper on question B.
What is a 'Mixture of Experts'?
Imagine a conference room with 256 specialists. Each is good at something: debugging, Hungarian grammar, Python syntax, and so on.
When the model has to pick the next word, a 'router' picks the 8 most relevant ones, and only they speak. The other 248 sit idle for that word.
This trick lets a 35-billion-parameter model run as fast as a 3-billion one — because only about 3 billion parameters are 'active' per word.
So the question is: of those 256 specialists, are there any I — in my typical work — almost never call?
The night's program
Bake — burn in the system prompt
A short finetune (called LoRA): we show the model how to respond as Aemie while leaving the system prompt out of the input. Over time the style moves into the weights.
Prune — cut out the cold experts
Measure which specialists I call how often, then surgically cut the rarest-used ones out of the model file. Smaller file, faster load, same quality (we hope).
Spoiler: The bake needs an NVIDIA GPU — at home I only have an inference-only iGPU — so we just prepared it. The prune actually ran. The rest of the slides are about that.
The headline
A quarter smaller, barely a dent in quality
| Variant | Size | Quality (PPL) | Prefill |
|---|---|---|---|
| Original (256 experts) | 22 GB | 8.82 | 353 t/s |
| K=224 (12.5% cut) | 19 GB (−14%) | 8.83 (+0.12%) | 371 (+5%) |
| K=192 ★ sweet spot | 17 GB (−23%) | 8.92 (+1.13%) | 387 (+10%) |
| K=128 (half cut) | 12 GB (−45%) | 9.33 (+5.83%) | 464 (+32%) |
The K=192 is the sweet spot: 23% smaller, with effectively unchanged quality (1.13% bump in perplexity — the 'how uncertain about the next word' measure). That's free money.
What does usage look like on one layer?
256 experts of a representative layer, sorted by use. A few stars on the left do most of the work; the cold tail trails off to the right.
The mildly surprising bit: there's no expert that NEVER fires. The training deliberately spreads the work. But the tail is thin enough to cut.
The profiler: a llama.cpp eval-callback
A new example in the llama.cpp tree that taps the router decisions on every forward pass.
// build_moe_ffn() in llama-graph.cpp:
// cb(selected_experts->src[0], "ffn_moe_argsort", il); // [256, n_tokens] I32, contiguous
// cb(selected_experts, "ffn_moe_topk", il); // [8, n_tokens] — VIEW (gotcha!)
static bool moe_cb_eval(ggml_tensor *t, bool ask, void *ud) {
if (ask) return strncmp(t->name, "ffn_moe_argsort", 15) == 0;
if (!is_argsort) return true;
int il = atoi(strrchr(t->name, '-') + 1);
int32_t *ids = (int32_t*) backend_get(t); // contiguous parent
for (int tok = 0; tok < n_tokens; ++tok)
for (int rank = 0; rank < n_used; ++rank) // top-8 per token
counts[il][ ids[tok * n_expert + rank] ]++;
return true;
}Inputs: existing GGUF + corpus. Output: a JSON with per-layer × per-expert selection counts.
Gotcha — almost killed the experiment
When perfect equality IS the bug
FIRST RUN · BROKEN
Every expert in every layer got exactly 1065–1068 selections. std/mean = 0.001.
Statistically impossible across 34k tokens → must be a bug.
FIXED RUN · REAL
min=0, max=92, std/mean=2.66 in deeper layers. Strongly skewed, real routing.
This shape is what made the rest of the pruning possible.
Diagnosis & fix
The ffn_moe_topk tensor is a non-contiguous view over the full argsort. The contiguous ggml_backend_tensor_get read the entire 0..255 permutation per token — so every expert appeared exactly once, evenly. Fix: read the contiguous parent ffn_moe_argsort ([256, n_tokens]) and take the first 8 rows per token — those are the real top-8.
The real routing mass
How much routing mass the top-K experts cover per layer, on the combined dev+pm+family corpus. The worst layer column is critical: it limits uniform GGUF pruning.
| K | mass(mean) | mass(worst layer) | % experts |
|---|---|---|---|
| 256 | 100.00% | 100.00% | 100.0% |
| 224 | 99.31% | 95.13% | 87.5% |
| 192 ★ | 97.65% | 88.65% | 75.0% |
| 160 | 94.95% | 81.04% | 62.5% |
| 128 | 90.80% | 72.43% | 50.0% |
| 96 | 84.02% | 62.38% | 37.5% |
- · per-layer max/min expert-usage ratio: median 2810 (uniform = 1.0)
- · deeper layers are more skewed: layer 0 std/mean=1.75, layer 39 = 3.19
- · 0 completely dead experts (intended side-effect of load-balancing loss)
Surgery: raw-slab gather, no requant
The trick that lets us skip dequant: in the 4 expert-bearing tensors the expert index is the outermost ggml dim. gguf-py's numpy byte-shape exposes it as axis 0 → data[keep_indices] is a fancy-index and DONE. The Q4_K/Q5_K/Q6_K block-quantization is preserved byte-for-byte.
# prune_gguf.py — the core
for t in reader.tensors:
data = t.data
if expert_tensor_re.match(t.name): # ffn_{gate,up,down}_exps
il = layer_from_name(t.name)
data = np.ascontiguousarray(data[keepset(il)]) # axis 0 = expert
elif router_re.match(t.name): # ffn_gate_inp
data = np.ascontiguousarray(data[keepset(il)])
writer.add_tensor_info(t.name, data.shape, data.dtype, data.nbytes, t.tensor_type)
# *_shexp (shared expert) -> copied verbatim, never pruned- · 4 affected tensors / MoE layer: gate_exps + up_exps + down_exps + gate_inp (router)
- ·
*_shexp(always-on shared expert) copied verbatim - · global metadata: expert_count = K, expert_used_count stays at 8
- · 23 GB → pruned GGUF: ~12 seconds including disk write
Speed: prefill scales, generation doesn't
llama-bench on gfx1150 Vulkan, ngl=99. Interesting pattern: prefill (prompt processing) scales with K — but generation is essentially flat, because only 8 experts are active per token regardless.
| model | size | params | pp256 t/s | tg64 t/s |
|---|---|---|---|---|
| baseline (256) | 21.3 GiB | 35.5 B | 352.8 ± 7.6 | 22.85 |
| 224 | 18.9 GiB | 31.4 B | 371.0 +5% | 22.80 |
| 192 ★ | 16.6 GiB | 27.2 B | 386.8 +10% | 21.86 |
| 128 | 11.9 GiB | 19.0 B | 464.3 +32% | 23.43 |
By the way: K=128 generates at the same speed (23 t/s) as the baseline 35 B — but at half the size. For certain use cases (fast MTP drafts, mobile), that's interesting.
One leftover mystery
K=128 + llama-perplexity + Vulkan = GPU freeze
Loading the K=128 file into the perplexity tool got stuck in uninterruptible kernel state during amdgpu buffer allocation. kill -9 didn't take. Two amdgpu_gpu_recover calls didn't free it. A reboot was required.
Name: llama-perplexit
State: D (disk sleep)
VmRSS: 12057500 kB
wchan: drm_suballoc_new
But interesting:
The llama-bench loaded and ran the same K=128 file on Vulkan with NO issue (32% faster prefill, 23 t/s gen). So K=128 isn't corrupt; some specific tool × Vulkan × this-shape interaction triggers an amdgpu buffer deadlock. For PPL on K=128 the CPU path (-ngl 0) is safe — that's where the 9.33 figure came from.
Two reboots in one night — and the stack came back both times by itself
The first reboot I triggered for the K=128 GPU freeze. The second — an hour later — was a classic power outage. (Really.)
After both boots, the full stack came back on its own: the LLM server (lemond), 7 user services, 13 cron timers, 11 docker containers. No manual restore script needed.
The systemd enabled + linger + docker restart-unless-stopped trifecta is a quiet defense-in-depth win. (Also: time to buy a UPS.)
Takeaway
The wrong question and the right question
My original question was: "Is there any expert in this model I NEVER use?" The answer, it turns out, is no — in a load-balancing-loss-trained model, every expert fires somewhere.
The right question is: "How much quality can we trade for X% size?" That has a precise answer: at K=192, 23% size cut, ~1% PPL cost, ~same generation speed. At K=128, half the size at ~6% PPL — a different category of use case (fast draft, mobile).
The bake (Track A) is still ahead — but it's already clear that bake and prune are synergistic: a style-fine-tuned model has more concentrated routing → more prunable experts.