Does MTP speculative decoding work with Qwen 3.6 on llama.cpp?

Yes, but only on MoE models. Qwen 3.6-35B-A3B goes from 98 to 144 t/s (1.47x) with MTP draft-2 on an RTX 5060 Ti. The dense 27B model gets 42% slower because it uses ~76% of available bandwidth. MTP's extra weight reads create contention instead of savings.

What is MTP speculative decoding in llama.cpp?

Multi-Token Prediction (MTP) uses a prediction head built into the model itself, no separate draft model needed. The model predicts 2-3 tokens ahead, then verifies them in one pass. It works via --spec-type draft-mtp in the llama.cpp MTP branch.

What is the maximum context for MTP on a 16GB GPU?

With Qwen 3.6-35B-A3B MoE at IQ3_S quantization and q4_0 KV cache, the maximum stable context with MTP is 65K on an RTX 5060 Ti. Combined with TurboQuant turbo3 KV cache, MTP can reach 98K context at 125 t/s. The MTP prediction head adds ~2.8 GB VRAM overhead, reducing available context compared to the 262K baseline.

MTP Speculative Decoding Actually Works on MoE: 144 t/s on a 16GB GPU

TL;DR: MTP speculative decoding is the first technique that actually speeds up the Qwen 3.6 MoE on my RTX 5060 Ti. The 35B-A3B jumps from 98 to 144 t/s (1.47x). Combined with TurboQuant turbo3 KV cache, the sweet spot is 125 t/s at 98K context, still 2.4x faster than the baseline at 80K fill. The dense 27B gets 42% slower. Copy-paste server configs and benchmark data below.

Update (May 22, 2026): MTP has merged to mainline llama.cpp. The TurboQuant fork also pulled in upstream with MTP on its feature/turboquant-kv-cache branch. The custom mtp-turboquant fork I built for this post is no longer needed. Performance is unchanged on the newer build (b9418 vs b9320): 127 t/s burst, same 70% MTP acceptance, 98K context still fits on 16 GB. One note: the new TurboQuant build auto-upgrades K cache from turbo3 to q8_0 for GQA models — set TURBO_AUTO_ASYMMETRIC=0 to keep turbo3 and preserve your VRAM headroom.

Using multiple AI coding tools and losing track of sessions? I built VibeCockpit, one dashboard to search and resume sessions across Claude Code, Copilot, Codex, and more.

In my previous speculative decoding post, every approach either failed or made things slower. Draft models couldn’t handle the hybrid DeltaNet architecture. MTP changes the equation: instead of a separate draft model, Qwen 3.6 ships with a built-in prediction head that predicts 2-3 tokens ahead using the model’s own hidden states. No architecture mismatch, no alignment problems.

The Benchmark

Same hardware as all previous posts: RTX 5060 Ti 16 GB, GMKtec mini PC, Proxmox VM. MTP branch built from am17an/llama.cpp mtp-clean (PR #22673). MTP model variants from Unsloth.

MoE 35B-A3B (IQ3_S, 14 GB)

Config	Context	Peak t/s	Speedup	Acceptance
Baseline q4_0 KV (no MTP)	262K	98	1.0x	—
MTP draft-2 + q4_0 KV	65K	144	1.47x	81%
MTP draft-2 + turbo3 KV	98K	125	1.28x	69%

All configs use quantized KV cache. MTP’s prediction head adds ~2.8 GB VRAM overhead, so context drops from 262K to 65K (q4_0) or 98K (turbo3). Both are practical sweet spots.

How Speed Degrades With Context

Generation speed drops as context fills up. Here’s how each config holds up under load:

Prompt Fill	Baseline q4_0 (262K)	Baseline turbo3 (262K)	MTP + q4_0 (65K)	MTP + turbo3 (98K)
Burst	94 t/s	89 t/s	144 t/s	125 t/s
~10K tokens	84 t/s	68 t/s	123 t/s	112 t/s
~40K tokens	69 t/s	41 t/s	94 t/s	82 t/s
~55K tokens	62 t/s	34 t/s	82 t/s	70 t/s
~80K+ tokens	46 t/s	25 t/s	—	59 t/s

MTP is faster at every context depth. The q4_0 baseline holds up better than turbo3, partly because turbo3’s Walsh-Hadamard dequantization costs more compute per attention op, and partly because the TurboQuant fork is based on a slightly older llama.cpp (April 18 vs April 25). The MTP + turbo3 build doesn’t suffer as much since it’s based on the newer MTP branch (May 14) with TurboQuant patched on top.

Dense 27B: MTP Makes It Worse

Config	Peak t/s	Speedup
Baseline (no MTP)	28.5	1.0x
MTP draft-2	16.4	0.58x

42% slower. Both models are bandwidth-bound at batch size 1, but they consume very different amounts of bandwidth. The MoE activates only ~3B of 35B params per token, roughly 1.2 GB of weight reads at ~26% of the 5060 Ti’s 448 GB/s bandwidth. Plenty of headroom for MTP’s overhead. The dense model reads all 12 GB every token. At 28.5 t/s that’s ~342 GB/s, or ~76% of available bandwidth. MTP’s extra weight reads tip it into contention.

Server Configs

MTP-only (MTP branch, q4_0 KV, 65K context, 121 t/s):

~/llama-mtp/build/bin/llama-server \
  -m Qwen3.6-35B-A3B-MTP-UD-IQ3_S.gguf \
  -ngl 99 -c 65536 \
  -ctk q4_0 -ctv q4_0 \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  -np 1 --host 0.0.0.0 --port 11433

MTP + TurboQuant (merged fork, turbo3 KV, 98K context, 125 t/s):

TURBO_AUTO_ASYMMETRIC=0 ~/llama-mtp/build/bin/llama-server \
  -m Qwen3.6-35B-A3B-MTP-UD-IQ3_S.gguf \
  -ngl 99 -c 98304 \
  -ctk turbo3 -ctv turbo3 \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  -np 1 --host 0.0.0.0 --port 11433

TURBO_AUTO_ASYMMETRIC=0 prevents the TurboQuant build from auto-upgrading K cache from turbo3 to q8_0 on GQA models, which would consume too much VRAM for the 98K context window to fit.

Key flags: --spec-type draft-mtp (not mtp), --spec-draft-n-max 2 (draft-3 drops acceptance from 81% to 70% with no speed gain), -np 1 (required, no parallel slots with MTP yet). First request after startup is slow due to CUDA graph compilation; send a throwaway prompt to warm up.

Building It

~~MTP support hasn’t merged to mainline yet.~~ Update: MTP merged to mainline in May 2026. The build instructions below still work, but you can now use mainline llama.cpp directly. For MTP-only with q4_0 KV:

git clone https://github.com/am17an/llama.cpp.git ~/llama-mtp
cd ~/llama-mtp && git checkout mtp-clean

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90 -DBUILD_SHARED_LIBS=OFF
cmake --build build -j$(nproc)

For MTP + TurboQuant turbo3 KV (98K context), I merged both forks into NJannasch/llama.cpp mtp-turboquant:

git clone -b mtp-turboquant https://github.com/NJannasch/llama.cpp.git ~/llama-mtp
cd ~/llama-mtp

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90 -DBUILD_SHARED_LIBS=OFF
cmake --build build -j$(nproc)

Fair warning: this merge was done with the help of an LLM (Claude Code resolving CUDA kernel conflicts). It works on my machine: RTX 5060 Ti, CUDA, Linux. The diff is ~24K lines, but ~15K of that is Metal/Vulkan/HIP backends I never touch. Use at your own risk.

You also need the -MTP model variant from Unsloth. The standard GGUF won’t work with --spec-type draft-mtp. On Blackwell GPUs, use -DCMAKE_CUDA_ARCHITECTURES=90. Forward compatibility handles the rest.

The Takeaway

Config	Context	Burst t/s	At ~40K fill	At ~80K+ fill	Requires
MTP + turbo3	98K	125	82	59	Merged fork
MTP + q4_0 KV	65K	144	94	—	MTP branch
Baseline q4_0 KV	262K	94	69	46	Mainline llama.cpp
Baseline turbo3	262K	89	41	25	TurboQuant fork

MTP only helps models with spare bandwidth headroom. Need 100K+ context? Skip MTP and use the baseline at 262K. For everything else, MTP at 65-98K context is the new default. ~~When the MTP PR merges to mainline, this becomes a one-flag upgrade.~~ Update: MTP is now in mainline — it’s a one-flag upgrade: --spec-type draft-mtp.

This is post 10 in the Fast AI, Real Risks series. Previous: code generation showdown, Gemma 4 vs Qwen 3.6, same prompt, four rounds.

Using this setup for more than benchmarks? I run the same Qwen 3.6 MTP server as a 24/7 autonomous agent with Hermes, including kernel-level sandboxing for the cron jobs.

The views and opinions expressed here are my own and do not reflect those of my employer.