MTP Speculative Decoding Actually Works on MoE: 144 t/s on a 16GB GPU
TL;DR: MTP speculative decoding is the first technique that actually speeds up the Qwen 3.6 MoE on my RTX 5060 Ti. The 35B-A3B jumps from 98 to 144 t/s (1.47x). Combined with TurboQuant turbo3 KV cache, the sweet spot is 125 t/s at 98K context, still 2.4x faster than the baseline at 80K fill. The dense 27B gets 42% slower. Copy-paste server configs and benchmark data below.
Update (May 22, 2026): MTP has merged to mainline llama.cpp. The TurboQuant fork also pulled in upstream with MTP on its feature/turboquant-kv-cache branch. The custom mtp-turboquant fork I built for this post is no longer needed. Performance is unchanged on the newer build (b9418 vs b9320): 127 t/s burst, same 70% MTP acceptance, 98K context still fits on 16 GB. One note: the new TurboQuant build auto-upgrades K cache from turbo3 to q8_0 for GQA models — set TURBO_AUTO_ASYMMETRIC=0 to keep turbo3 and preserve your VRAM headroom.
Using multiple AI coding tools and losing track of sessions? I built VibeCockpit, one dashboard to search and resume sessions across Claude Code, Copilot, Codex, and more.
In my previous speculative decoding post, every approach either failed or made things slower. Draft models couldn’t handle the hybrid DeltaNet architecture. MTP changes the equation: instead of a separate draft model, Qwen 3.6 ships with a built-in prediction head that predicts 2-3 tokens ahead using the model’s own hidden states. No architecture mismatch, no alignment problems.
The Benchmark
Same hardware as all previous posts: RTX 5060 Ti 16 GB, GMKtec mini PC, Proxmox VM. MTP branch built from am17an/llama.cpp mtp-clean (PR #22673). MTP model variants from Unsloth.
MoE 35B-A3B (IQ3_S, 14 GB)
| Config | Context | Peak t/s | Speedup | Acceptance |
|---|---|---|---|---|
| Baseline q4_0 KV (no MTP) | 262K | 98 | 1.0x | — |
| MTP draft-2 + q4_0 KV | 65K | 144 | 1.47x | 81% |
| MTP draft-2 + turbo3 KV | 98K | 125 | 1.28x | 69% |
All configs use quantized KV cache. MTP’s prediction head adds ~2.8 GB VRAM overhead, so context drops from 262K to 65K (q4_0) or 98K (turbo3). Both are practical sweet spots.
How Speed Degrades With Context
Generation speed drops as context fills up. Here’s how each config holds up under load:
| Prompt Fill | Baseline q4_0 (262K) | Baseline turbo3 (262K) | MTP + q4_0 (65K) | MTP + turbo3 (98K) |
|---|---|---|---|---|
| Burst | 94 t/s | 89 t/s | 144 t/s | 125 t/s |
| ~10K tokens | 84 t/s | 68 t/s | 123 t/s | 112 t/s |
| ~40K tokens | 69 t/s | 41 t/s | 94 t/s | 82 t/s |
| ~55K tokens | 62 t/s | 34 t/s | 82 t/s | 70 t/s |
| ~80K+ tokens | 46 t/s | 25 t/s | — | 59 t/s |
MTP is faster at every context depth. The q4_0 baseline holds up better than turbo3, partly because turbo3’s Walsh-Hadamard dequantization costs more compute per attention op, and partly because the TurboQuant fork is based on a slightly older llama.cpp (April 18 vs April 25). The MTP + turbo3 build doesn’t suffer as much since it’s based on the newer MTP branch (May 14) with TurboQuant patched on top.
Dense 27B: MTP Makes It Worse
| Config | Peak t/s | Speedup |
|---|---|---|
| Baseline (no MTP) | 28.5 | 1.0x |
| MTP draft-2 | 16.4 | 0.58x |
42% slower. Both models are bandwidth-bound at batch size 1, but they consume very different amounts of bandwidth. The MoE activates only ~3B of 35B params per token, roughly 1.2 GB of weight reads at ~26% of the 5060 Ti’s 448 GB/s bandwidth. Plenty of headroom for MTP’s overhead. The dense model reads all 12 GB every token. At 28.5 t/s that’s ~342 GB/s, or ~76% of available bandwidth. MTP’s extra weight reads tip it into contention.
Server Configs
MTP-only (MTP branch, q4_0 KV, 65K context, 121 t/s):
~/llama-mtp/build/bin/llama-server \
-m Qwen3.6-35B-A3B-MTP-UD-IQ3_S.gguf \
-ngl 99 -c 65536 \
-ctk q4_0 -ctv q4_0 \
--spec-type draft-mtp --spec-draft-n-max 2 \
-np 1 --host 0.0.0.0 --port 11433
MTP + TurboQuant (merged fork, turbo3 KV, 98K context, 125 t/s):
TURBO_AUTO_ASYMMETRIC=0 ~/llama-mtp/build/bin/llama-server \
-m Qwen3.6-35B-A3B-MTP-UD-IQ3_S.gguf \
-ngl 99 -c 98304 \
-ctk turbo3 -ctv turbo3 \
--spec-type draft-mtp --spec-draft-n-max 2 \
-np 1 --host 0.0.0.0 --port 11433
TURBO_AUTO_ASYMMETRIC=0 prevents the TurboQuant build from auto-upgrading K cache from turbo3 to q8_0 on GQA models, which would consume too much VRAM for the 98K context window to fit.
Key flags: --spec-type draft-mtp (not mtp), --spec-draft-n-max 2 (draft-3 drops acceptance from 81% to 70% with no speed gain), -np 1 (required, no parallel slots with MTP yet). First request after startup is slow due to CUDA graph compilation; send a throwaway prompt to warm up.
Building It
MTP support hasn’t merged to mainline yet. Update: MTP merged to mainline in May 2026. The build instructions below still work, but you can now use mainline llama.cpp directly. For MTP-only with q4_0 KV:
git clone https://github.com/am17an/llama.cpp.git ~/llama-mtp
cd ~/llama-mtp && git checkout mtp-clean
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90 -DBUILD_SHARED_LIBS=OFF
cmake --build build -j$(nproc)
For MTP + TurboQuant turbo3 KV (98K context), I merged both forks into NJannasch/llama.cpp mtp-turboquant:
git clone -b mtp-turboquant https://github.com/NJannasch/llama.cpp.git ~/llama-mtp
cd ~/llama-mtp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90 -DBUILD_SHARED_LIBS=OFF
cmake --build build -j$(nproc)
Fair warning: this merge was done with the help of an LLM (Claude Code resolving CUDA kernel conflicts). It works on my machine: RTX 5060 Ti, CUDA, Linux. The diff is ~24K lines, but ~15K of that is Metal/Vulkan/HIP backends I never touch. Use at your own risk.
You also need the -MTP model variant from Unsloth. The standard GGUF won’t work with --spec-type draft-mtp. On Blackwell GPUs, use -DCMAKE_CUDA_ARCHITECTURES=90. Forward compatibility handles the rest.
The Takeaway
| Config | Context | Burst t/s | At ~40K fill | At ~80K+ fill | Requires |
|---|---|---|---|---|---|
| MTP + turbo3 | 98K | 125 | 82 | 59 | Merged fork |
| MTP + q4_0 KV | 65K | 144 | 94 | — | MTP branch |
| Baseline q4_0 KV | 262K | 94 | 69 | 46 | Mainline llama.cpp |
| Baseline turbo3 | 262K | 89 | 41 | 25 | TurboQuant fork |
MTP only helps models with spare bandwidth headroom. Need 100K+ context? Skip MTP and use the baseline at 262K. For everything else, MTP at 65-98K context is the new default. When the MTP PR merges to mainline, this becomes a one-flag upgrade. Update: MTP is now in mainline — it’s a one-flag upgrade: --spec-type draft-mtp.
This is post 10 in the Fast AI, Real Risks series. Previous: code generation showdown, Gemma 4 vs Qwen 3.6, same prompt, four rounds.
Using this setup for more than benchmarks? I run the same Qwen 3.6 MTP server as a 24/7 autonomous agent with Hermes, including kernel-level sandboxing for the cron jobs.
The views and opinions expressed here are my own and do not reflect those of my employer.