01 / 14

Lab notebook · 2026-05-28

Dropping the experts I never use

Trimming a 35-billion-parameter MoE model to fit my own home workload. One night. Two reboots. One bug that — had we missed it — would have ended the whole experiment with a false conclusion.

Scroll down
For everyone02 / 14

Two simple questions about a complicated model

A

Can we bake in the system prompt?

Every call sends the same long instruction. Can we bake it into the model's weights so we never have to send it?

B

Can we drop what we don't use?

If I always do similar things with it (coding, project management, family vacation), is there any part of the model I almost never wake up?

The post answers both — but we went deeper on question B.

For everyone03 / 14

What is a 'Mixture of Experts'?

Imagine a conference room with 256 specialists. Each is good at something: debugging, Hungarian grammar, Python syntax, and so on.

When the model has to pick the next word, a 'router' picks the 8 most relevant ones, and only they speak. The other 248 sit idle for that word.

This trick lets a 35-billion-parameter model run as fast as a 3-billion one — because only about 3 billion parameters are 'active' per word.

So the question is: of those 256 specialists, are there any I — in my typical work — almost never call?

For everyone04 / 14

The night's program

A

Bake — burn in the system prompt

A short finetune (called LoRA): we show the model how to respond as Aemie while leaving the system prompt out of the input. Over time the style moves into the weights.

B

Prune — cut out the cold experts

Measure which specialists I call how often, then surgically cut the rarest-used ones out of the model file. Smaller file, faster load, same quality (we hope).

Spoiler: The bake needs an NVIDIA GPU — at home I only have an inference-only iGPU — so we just prepared it. The prune actually ran. The rest of the slides are about that.

For everyone05 / 14

The headline

A quarter smaller, barely a dent in quality

VariantSizeQuality (PPL)Prefill
Original (256 experts)22 GB8.82353 t/s
K=224 (12.5% cut)19 GB (−14%)8.83 (+0.12%)371 (+5%)
K=192 sweet spot17 GB (−23%)8.92 (+1.13%)387 (+10%)
K=128 (half cut)12 GB (−45%)9.33 (+5.83%)464 (+32%)

The K=192 is the sweet spot: 23% smaller, with effectively unchanged quality (1.13% bump in perplexity — the 'how uncertain about the next word' measure). That's free money.

For everyone06 / 14

What does usage look like on one layer?

256 experts of a representative layer, sorted by use. A few stars on the left do most of the work; the cold tail trails off to the right.

expert #1 (most used)256.

The mildly surprising bit: there's no expert that NEVER fires. The training deliberately spreads the work. But the tail is thin enough to cut.

Deep tech07 / 14

The profiler: a llama.cpp eval-callback

A new example in the llama.cpp tree that taps the router decisions on every forward pass.

// build_moe_ffn() in llama-graph.cpp:
//   cb(selected_experts->src[0], "ffn_moe_argsort", il);  // [256, n_tokens] I32, contiguous
//   cb(selected_experts,         "ffn_moe_topk",    il);  // [8, n_tokens] — VIEW (gotcha!)

static bool moe_cb_eval(ggml_tensor *t, bool ask, void *ud) {
    if (ask) return strncmp(t->name, "ffn_moe_argsort", 15) == 0;
    if (!is_argsort) return true;
    int il = atoi(strrchr(t->name, '-') + 1);
    int32_t *ids = (int32_t*) backend_get(t);             // contiguous parent
    for (int tok = 0; tok < n_tokens; ++tok)
        for (int rank = 0; rank < n_used; ++rank)         // top-8 per token
            counts[il][ ids[tok * n_expert + rank] ]++;
    return true;
}

Inputs: existing GGUF + corpus. Output: a JSON with per-layer × per-expert selection counts.

Deep tech08 / 14

Gotcha — almost killed the experiment

When perfect equality IS the bug

FIRST RUN · BROKEN

Every expert in every layer got exactly 1065–1068 selections. std/mean = 0.001.

Statistically impossible across 34k tokens → must be a bug.

FIXED RUN · REAL

min=0, max=92, std/mean=2.66 in deeper layers. Strongly skewed, real routing.

This shape is what made the rest of the pruning possible.

Diagnosis & fix

The ffn_moe_topk tensor is a non-contiguous view over the full argsort. The contiguous ggml_backend_tensor_get read the entire 0..255 permutation per token — so every expert appeared exactly once, evenly. Fix: read the contiguous parent ffn_moe_argsort ([256, n_tokens]) and take the first 8 rows per token — those are the real top-8.

Deep tech09 / 14

The real routing mass

How much routing mass the top-K experts cover per layer, on the combined dev+pm+family corpus. The worst layer column is critical: it limits uniform GGUF pruning.

Kmass(mean)mass(worst layer)% experts
256100.00%100.00%100.0%
22499.31%95.13%87.5%
192 ★97.65%88.65%75.0%
16094.95%81.04%62.5%
12890.80%72.43%50.0%
9684.02%62.38%37.5%
  • · per-layer max/min expert-usage ratio: median 2810 (uniform = 1.0)
  • · deeper layers are more skewed: layer 0 std/mean=1.75, layer 39 = 3.19
  • · 0 completely dead experts (intended side-effect of load-balancing loss)
Deep tech10 / 14

Surgery: raw-slab gather, no requant

The trick that lets us skip dequant: in the 4 expert-bearing tensors the expert index is the outermost ggml dim. gguf-py's numpy byte-shape exposes it as axis 0 → data[keep_indices] is a fancy-index and DONE. The Q4_K/Q5_K/Q6_K block-quantization is preserved byte-for-byte.

# prune_gguf.py — the core
for t in reader.tensors:
    data = t.data
    if expert_tensor_re.match(t.name):          # ffn_{gate,up,down}_exps
        il = layer_from_name(t.name)
        data = np.ascontiguousarray(data[keepset(il)])   # axis 0 = expert
    elif router_re.match(t.name):                # ffn_gate_inp
        data = np.ascontiguousarray(data[keepset(il)])
    writer.add_tensor_info(t.name, data.shape, data.dtype, data.nbytes, t.tensor_type)
    # *_shexp (shared expert) -> copied verbatim, never pruned
  • · 4 affected tensors / MoE layer: gate_exps + up_exps + down_exps + gate_inp (router)
  • · *_shexp (always-on shared expert) copied verbatim
  • · global metadata: expert_count = K, expert_used_count stays at 8
  • · 23 GB → pruned GGUF: ~12 seconds including disk write
Deep tech11 / 14

Speed: prefill scales, generation doesn't

llama-bench on gfx1150 Vulkan, ngl=99. Interesting pattern: prefill (prompt processing) scales with K — but generation is essentially flat, because only 8 experts are active per token regardless.

modelsizeparamspp256 t/stg64 t/s
baseline (256)21.3 GiB35.5 B352.8 ± 7.622.85
22418.9 GiB31.4 B371.0 +5%22.80
192 ★16.6 GiB27.2 B386.8 +10%21.86
12811.9 GiB19.0 B464.3 +32%23.43

By the way: K=128 generates at the same speed (23 t/s) as the baseline 35 B — but at half the size. For certain use cases (fast MTP drafts, mobile), that's interesting.

Deep tech12 / 14

One leftover mystery

K=128 + llama-perplexity + Vulkan = GPU freeze

Loading the K=128 file into the perplexity tool got stuck in uninterruptible kernel state during amdgpu buffer allocation. kill -9 didn't take. Two amdgpu_gpu_recover calls didn't free it. A reboot was required.

Name: llama-perplexit

State: D (disk sleep)

VmRSS: 12057500 kB

wchan: drm_suballoc_new

But interesting:

The llama-bench loaded and ran the same K=128 file on Vulkan with NO issue (32% faster prefill, 23 t/s gen). So K=128 isn't corrupt; some specific tool × Vulkan × this-shape interaction triggers an amdgpu buffer deadlock. For PPL on K=128 the CPU path (-ngl 0) is safe — that's where the 9.33 figure came from.

For everyone13 / 14

Two reboots in one night — and the stack came back both times by itself

The first reboot I triggered for the K=128 GPU freeze. The second — an hour later — was a classic power outage. (Really.)

After both boots, the full stack came back on its own: the LLM server (lemond), 7 user services, 13 cron timers, 11 docker containers. No manual restore script needed.

The systemd enabled + linger + docker restart-unless-stopped trifecta is a quiet defense-in-depth win. (Also: time to buy a UPS.)

14 / 14

Takeaway

The wrong question and the right question

My original question was: "Is there any expert in this model I NEVER use?" The answer, it turns out, is no — in a load-balancing-loss-trained model, every expert fires somewhere.

The right question is: "How much quality can we trade for X% size?" That has a precise answer: at K=192, 23% size cut, ~1% PPL cost, ~same generation speed. At K=128, half the size at ~6% PPL — a different category of use case (fast draft, mobile).

The bake (Track A) is still ahead — but it's already clear that bake and prune are synergistic: a style-fine-tuned model has more concentrated routing → more prunable experts.

Qwen3.6-35B-A3Bqwen35moegfx1150 (Radeon 890M)Vulkanllama.cpp