01 / 14

Lab notebook · 2026-05-28

Dropping the experts I never use

Trimming a 35-billion-parameter MoE model to fit my own home workload. One night. Two reboots. One bug that — had we missed it — would have ended the whole experiment with a false conclusion.

Scroll, or press ↓

For everyone02 / 14

Two simple questions about a complicated model

Can we bake in the system prompt?

Every call sends the same long instruction. Can we bake it into the model's weights so we never have to send it?

Can we drop what we don't use?

If I always do similar things with it (coding, project management, family vacation), is there any part of the model I almost never wake up?

The post answers both — but we went deeper on question B.

For everyone03 / 14

What is a 'Mixture of Experts'?

Imagine a conference room with 256 specialists. Each is good at something: debugging, Hungarian grammar, Python syntax, and so on.

When the model has to pick the next word, a 'router' picks the 8 most relevant ones, and only they speak. The other 248 sit idle for that word.

This trick lets a 35-billion-parameter model run as fast as a 3-billion one — because only about 3 billion parameters are 'active' per word.

So the question is: of those 256 specialists, are there any I — in my typical work — almost never call?

For everyone04 / 14

The night's program

Bake — burn in the system prompt

A short finetune (called LoRA): we show the model how to respond as Aemie while leaving the system prompt out of the input. Over time the style moves into the weights.

Prune — cut out the cold experts

Measure which specialists I call how often, then surgically cut the rarest-used ones out of the model file. Smaller file, faster load, same quality (we hope).

Spoiler: The bake needs an NVIDIA GPU — at home I only have an inference-only iGPU — so we just prepared it. The prune actually ran. The rest of the slides are about that.

For everyone05 / 14

The headline

A quarter smaller, barely a dent in quality

Variant	Size	Quality (PPL)	Prefill
Original (256 experts)	22 GB	8.82	353 t/s
K=224 (12.5% cut)	19 GB (−14%)	8.83 (+0.12%)	371 (+5%)
K=192 ★ sweet spot	17 GB (−23%)	8.92 (+1.13%)	387 (+10%)
K=128 (half cut)	12 GB (−45%)	9.33 (+5.83%)	464 (+32%)

The K=192 is the sweet spot: 23% smaller, with effectively unchanged quality (1.13% bump in perplexity — the 'how uncertain about the next word' measure). That's free money.

For everyone06 / 14

What does usage look like on one layer?

256 experts of a representative layer, sorted by use. A few stars on the left do most of the work; the cold tail trails off to the right.

expert #1 (most used)256.

The mildly surprising bit: there's no expert that NEVER fires. The training deliberately spreads the work. But the tail is thin enough to cut.

Deep tech07 / 14

The profiler: a llama.cpp eval-callback

A new example in the llama.cpp tree that taps the router decisions on every forward pass.

// build_moe_ffn() in llama-graph.cpp:
//   cb(selected_experts->src[0], "ffn_moe_argsort", il);  // [256, n_tokens] I32, contiguous
//   cb(selected_experts,         "ffn_moe_topk",    il);  // [8, n_tokens] — VIEW (gotcha!)

static bool moe_cb_eval(ggml_tensor *t, bool ask, void *ud) {
    if (ask) return strncmp(t->name, "ffn_moe_argsort", 15) == 0;
    if (!is_argsort) return true;
    int il = atoi(strrchr(t->name, '-') + 1);
    int32_t *ids = (int32_t*) backend_get(t);             // contiguous parent
    for (int tok = 0; tok < n_tokens; ++tok)
        for (int rank = 0; rank < n_used; ++rank)         // top-8 per token
            counts[il][ ids[tok * n_expert + rank] ]++;
    return true;
}

Inputs: existing GGUF + corpus. Output: a JSON with per-layer × per-expert selection counts.

Deep tech08 / 14

Gotcha — almost killed the experiment

When perfect equality IS the bug

FIRST RUN · BROKEN

Every expert in every layer got exactly 1065–1068 selections. std/mean = 0.001.

Statistically impossible across 34k tokens → must be a bug.

FIXED RUN · REAL

min=0, max=92, std/mean=2.66 in deeper layers. Strongly skewed, real routing.

This shape is what made the rest of the pruning possible.

Diagnosis & fix

The ffn_moe_topk tensor is a non-contiguous view over the full argsort. The contiguous ggml_backend_tensor_get read the entire 0..255 permutation per token — so every expert appeared exactly once, evenly. Fix: read the contiguous parent ffn_moe_argsort ([256, n_tokens]) and take the first 8 rows per token — those are the real top-8.

Deep tech09 / 14

The real routing mass

How much routing mass the top-K experts cover per layer, on the combined dev+pm+family corpus. The worst layer column is critical: it limits uniform GGUF pruning.

K	mass(mean)	mass(worst layer)	% experts
256	100.00%	100.00%	100.0%
224	99.31%	95.13%	87.5%
192 ★	97.65%	88.65%	75.0%
160	94.95%	81.04%	62.5%
128	90.80%	72.43%	50.0%
96	84.02%	62.38%	37.5%

· per-layer max/min expert-usage ratio: median 2810 (uniform = 1.0)
· deeper layers are more skewed: layer 0 std/mean=1.75, layer 39 = 3.19
· 0 completely dead experts (intended side-effect of load-balancing loss)

Deep tech10 / 14

Surgery: raw-slab gather, no requant

The trick that lets us skip dequant: in the 4 expert-bearing tensors the expert index is the outermost ggml dim. gguf-py's numpy byte-shape exposes it as axis 0 → data[keep_indices] is a fancy-index and DONE. The Q4_K/Q5_K/Q6_K block-quantization is preserved byte-for-byte.

# prune_gguf.py — the core
for t in reader.tensors:
    data = t.data
    if expert_tensor_re.match(t.name):          # ffn_{gate,up,down}_exps
        il = layer_from_name(t.name)
        data = np.ascontiguousarray(data[keepset(il)])   # axis 0 = expert
    elif router_re.match(t.name):                # ffn_gate_inp
        data = np.ascontiguousarray(data[keepset(il)])
    writer.add_tensor_info(t.name, data.shape, data.dtype, data.nbytes, t.tensor_type)
    # *_shexp (shared expert) -> copied verbatim, never pruned

· 4 affected tensors / MoE layer: gate_exps + up_exps + down_exps + gate_inp (router)
· *_shexp (always-on shared expert) copied verbatim
· global metadata: expert_count = K, expert_used_count stays at 8
· 23 GB → pruned GGUF: ~12 seconds including disk write

Deep tech11 / 14

Speed: prefill scales, generation doesn't

llama-bench on gfx1150 Vulkan, ngl=99. Interesting pattern: prefill (prompt processing) scales with K — but generation is essentially flat, because only 8 experts are active per token regardless.

model	size	params	pp256 t/s	tg64 t/s
baseline (256)	21.3 GiB	35.5 B	352.8 ± 7.6	22.85
224	18.9 GiB	31.4 B	371.0 +5%	22.80
192 ★	16.6 GiB	27.2 B	386.8 +10%	21.86
128	11.9 GiB	19.0 B	464.3 +32%	23.43

By the way: K=128 generates at the same speed (23 t/s) as the baseline 35 B — but at half the size. For certain use cases (fast MTP drafts, mobile), that's interesting.

Deep tech12 / 14

One leftover mystery

K=128 + llama-perplexity + Vulkan = GPU freeze

Loading the K=128 file into the perplexity tool got stuck in uninterruptible kernel state during amdgpu buffer allocation. kill -9 didn't take. Two amdgpu_gpu_recover calls didn't free it. A reboot was required.

Name: llama-perplexit

State: D (disk sleep)

VmRSS: 12057500 kB

wchan: drm_suballoc_new

But interesting:

The llama-bench loaded and ran the same K=128 file on Vulkan with NO issue (32% faster prefill, 23 t/s gen). So K=128 isn't corrupt; some specific tool × Vulkan × this-shape interaction triggers an amdgpu buffer deadlock. For PPL on K=128 the CPU path (-ngl 0) is safe — that's where the 9.33 figure came from.

For everyone13 / 14

Two reboots in one night — and the stack came back both times by itself

The first reboot I triggered for the K=128 GPU freeze. The second — an hour later — was a classic power outage. (Really.)

After both boots, the full stack came back on its own: the LLM server (lemond), 7 user services, 13 cron timers, 11 docker containers. No manual restore script needed.

The systemd enabled + linger + docker restart-unless-stopped trifecta is a quiet defense-in-depth win. (Also: time to buy a UPS.)

14 / 14

Takeaway

The wrong question and the right question

My original question was: "Is there any expert in this model I NEVER use?" The answer, it turns out, is no — in a load-balancing-loss-trained model, every expert fires somewhere.

The right question is: "How much quality can we trade for X% size?" That has a precise answer: at K=192, 23% size cut, ~1% PPL cost, ~same generation speed. At K=128, half the size at ~6% PPL — a different category of use case (fast draft, mobile).

The bake (Track A) is still ahead — but it's already clear that bake and prune are synergistic: a style-fine-tuned model has more concentrated routing → more prunable experts.

Qwen3.6-35B-A3Bqwen35moegfx1150 (Radeon 890M)Vulkanllama.cpp

Back to the blog