Does speculative decoding work with Qwen 3.6 on llama.cpp?

Draft-model speculative decoding still doesn't work due to the hybrid DeltaNet architecture. However, MTP (Multi-Token Prediction) speculative decoding now works via PR #22673 — the MoE model jumps from 98 to 144 t/s (1.47x speedup). Ngram-mod also works and gives a ~20% boost on repetitive code tasks.

MoE vs Dense model on 16GB GPU — which is faster?

MoE is significantly faster. Qwen 3.6 35B-A3B (MoE) runs at 98 t/s while the 27B dense runs at 31 t/s on the same RTX 5060 Ti. The MoE only activates 3B parameters per token, making it less bandwidth-bound.

What is the maximum context for Qwen 3.6 27B dense on 16GB?

Qwen 3.6 27B dense reaches 65K usable context on 16GB VRAM with parallel slots, or 131K context with a single slot (-np 1). The MoE model supports the full 262K on the same hardware.

Qwen 3.6 27B Dense on a 5060 Ti: Speculative Decoding, Ngram-Mod, and Why the MoE Still Wins

TL;DR: Qwen 3.6 27B dense runs at 31 t/s on my RTX 5060 Ti (16 GB) with up to 65K usable context (131K with -np 1). The 35B-A3B MoE hits 98 t/s with full 262K context. Ngram-mod gives a 20% boost to 34 t/s on repetitive code tasks (requires -np 2). Draft-model speculative decoding doesn’t work on these hybrid DeltaNet models yet, and using aggressively quantized Unsloth models adds another compatibility barrier on top. Higher quants (IQ4_XS) are actually slower and support less context than IQ3_XXS. The MoE’s speed advantage comes from the architecture itself, not a missing optimization. [Update: MTP speculative decoding now works — the MoE jumps to 144 t/s (1.47x), but the dense model gets 42% slower.]

Juggling sessions across Claude Code, Copilot, Codex, and local models? I built VibeCockpit — one dashboard to search and resume all your AI coding sessions.

I had the 35B MoE running at ~98 tokens per second and thought the 27B dense might be even better. It scores higher on every coding benchmark. Same VRAM budget, same GPU. The only problem: 31 t/s instead of 98.

Speculative decoding promises to fix exactly this. A small draft model guesses ahead, the large model verifies all guesses in one parallel pass, and you get 6x speedup for free. People on X/Twitter report 154 t/s on a 4090 with this technique. That would put the dense model well ahead of the MoE.

I tried to get it running on my 16 GB GPU. It took two days of patching source code, testing two llama.cpp forks, and fixing GGUF metadata before I understood why it does not work on this hardware.

Dense vs MoE: What Actually Happens Per Token

Before diving into speculative decoding, it helps to understand why these two models run at such different speeds on the same GPU.

27B Dense: All Parameters Fire

35B MoE: Router Picks 8 of 256

The dense model streams all 10.9 GB of weights through the GPU for every token. The MoE activates only ~3B of its 35B parameters per token, though all weights stay loaded in VRAM. Both models occupy similar memory. The MoE does far less math per token, which is why it runs 3.2x faster in practice (the theoretical parameter ratio is higher, but shared attention layers and routing overhead eat into the savings).

Both models use Qwen’s hybrid Gated DeltaNet architecture, alternating transformer attention layers with DeltaNet recurrent layers. That hybrid design turns out to be the key reason speculative decoding fails here.

The Numbers

Same hardware as the previous posts: headless GMKtec mini PC, RTX 5060 Ti 16GB over OCuLink, Proxmox VM, no desktop environment. llama.cpp b8929. Models from Unsloth (27B IQ3_XXS, 35B-A3B IQ3_S).

	27B Dense (IQ3_XXS)	35B MoE (IQ3_S)
Parameters active/token	27B (all)	~3B (8 of 256 experts)
VRAM (model weights)	10.9 GB	13.0 GB
Short generation	31 t/s	98 t/s
Medium (600 tokens)	29 t/s	95 t/s
Generation @ 20K ctx	28 t/s	90 t/s
Generation @ 50K ctx	23 t/s	70 t/s
Generation @ 108K ctx	18 t/s	53 t/s
Prefill (20K tokens)	893 t/s	1,637 t/s
Reliable max context	65K (np=2) / 131K (np=1)	262K
+ ngram-mod (creative prompt)	29 t/s (no change)	not tested
+ ngram-mod (repetitive code)	34 t/s (+17%)	not tested
+ ngram-mod (code refactoring)	34 t/s (+20%)	not tested

The MoE wins on raw speed. Ngram self-speculation (--spec-type ngram-mod) gives a 20% boost on the dense model when generating repetitive code (refactoring, adding similar methods). On creative prompts it does nothing since there are no patterns to match.

Two things to know: you need -np 2 for it to work at all on hybrid models (with -np 1 it silently disables), and aggressive quantization limits the gains. The 137 t/s result from Reddit was Q8_0 on 40 GB VRAM. With IQ3_XXS on 16 GB, 20% is the practical ceiling.

Does a Higher Quant Help?

I tested three quantization levels on the 27B dense to see if less aggressive quantization improves ngram-mod or unlocks better spec-dec. I also downloaded Qwen3.6-27B-IQ4_XS (14.4 GB) for a direct comparison.

Quant	Model Size	Creative	Code (ngram)	Refactor (ngram)	Max Context	VRAM at Max
IQ3_XXS	11.7 GB	28.6 t/s	33.5 t/s	34.4 t/s	65K (np=2) / 131K (np=1)	14,559 MiB (131K)
Q3_K_M (Qwen3.5)	13.0 GB	21.3 t/s	25.7 t/s	25.8 t/s	65K	~15,000 MiB
IQ4_XS	14.4 GB	24.4 t/s	29.2 t/s	29.9 t/s	16K	15,669 MiB

The counterintuitive result: IQ3_XXS is the best overall choice. It is faster per token (smaller model = faster streaming), gets the same ~20% ngram-mod boost, and supports 4x more context than IQ4_XS (65K vs 16K with ngram-mod, or 131K vs 16K without). The higher quants are slower because they are larger models that take longer to stream through the GPU per token, while the quality improvement doesn’t affect ngram-mod acceptance (all hit 100% on repetitive code). IQ4_XS fits on 16 GB but leaves only 400 MiB for KV cache, limiting context to 16K which is impractical for coding sessions.

How Speed Degrades as Context Fills

As the KV cache fills, attention computation grows and generation slows. Here is the 27B dense (IQ3_XXS) at different fill levels, measured with dedicated prompts of the exact token count:

Context Fill	Gen Speed	Prefill	Slowdown vs Short
Short (~500 tok)	31 t/s	893 t/s	baseline
20K tokens	28 t/s	893 t/s	-10%
50K tokens	23 t/s	777 t/s	-26%
108K tokens	18 t/s	646 t/s	-42%

At 108K context (a realistic limit for a large coding session), the 27B dense drops to 18 t/s. Still usable, but noticeably slower. The MoE drops by a similar percentage but from a much higher starting point: 53 t/s at 108K, still nearly 3x faster than the dense model’s peak.

With -np 2 (required for ngram-mod), the context is split between 2 slots: 65K per slot. For single long sessions that need the full context, use -np 1 (no ngram-mod) to get the full 131K. I initially tested 196K allocation and it loaded, but it OOMs during actual generation past ~50K fill. 131K is the reliable maximum for the dense model on 16 GB.

Copy-Paste Server Configuration

If you just want to run these models, here are the commands. The “why” for each flag is below.

27B Dense (131K context, 65K per slot with ngram-mod):

llama-server \
  -m Qwen3.6-27B-UD-IQ3_XXS.gguf \
  -ngl 99 -c 131072 \
  -fa on -ctk q4_0 -ctv q4_0 \
  --spec-type ngram-mod --spec-ngram-size-n 24 \
  --draft-min 12 --draft-max 48 \
  --context-shift --cache-reuse 512 \
  --no-mmap -np 2 -t 6 --jinja \
  --host 0.0.0.0 --port 11433

35B MoE (262K context, 98 t/s):

llama-server \
  -m Qwen3.6-35B-A3B-UD-IQ3_S.gguf \
  -ngl 99 -c 262144 \
  -fa on -ctk q4_0 -ctv q4_0 \
  --context-shift --cache-reuse 512 \
  --no-mmap -np 1 -t 6 --jinja \
  --host 0.0.0.0 --port 11433

Key flags:

-fa on : Flash Attention. Non-negotiable for performance.
-ctk q4_0 -ctv q4_0 : KV cache quantization. Do not use q5_1 which causes a 14x prefill slowdown.
-np 2 : Required for ngram-mod on hybrid models. With -np 1, the recurrent state has only 1 cell, which silently disables all speculative decoding.
--context-shift : Graceful handling when hitting the context limit instead of a hard error.
--cache-reuse 512 : Reuses KV cache across similar prompts, saving prefill time.
--no-mmap : Keeps weights fully in VRAM instead of memory-mapping from disk.

Models: Qwen3.6-27B-UD-IQ3_XXS and Qwen3.6-35B-A3B-UD-IQ3_S from Unsloth. For the initial hardware setup, see my Qwen 3.5 post.

The Speculative Decoding Attempt

The promise: run a tiny 0.8B draft model to guess the next 12 tokens, verify all guesses in one forward pass. At 85% acceptance, that is ~6 tokens per expensive pass. On paper: 31 t/s x 6 = 186 t/s.

I tested every approach across mainline llama.cpp, a patched version, and ik_llama.cpp.

Approach	Speed	Acceptance	Problem
Baseline (no spec-dec)	31 t/s	N/A	Reference
Draft via llama-server	22 t/s	”100%” (fake)	Server doesn’t batch-verify
Patched llama-speculative	”93 t/s”	1%	Garbage output (DeltaNet corruption)
ik_llama.cpp + fixed draft	7 t/s	1.6%	4x slower than baseline
ik_llama.cpp + autotune	19 t/s	26%	Still slower than baseline
Ngram-mod (creative)	29 t/s	0 drafts	No patterns to match
Ngram-mod (repetitive code)	34 t/s	100%	+20% on code tasks
35B MoE (no tricks needed)	98 t/s	N/A	Just works

No draft-model approach beat the baseline. The only win was ngram-mod on repetitive code (+20%), but it does nothing for general-purpose prompts.

Why Draft-Model Spec-Dec Fails Here

Three problems stack on top of each other. First, Qwen3.5/3.6’s hybrid DeltaNet architecture can’t cleanly roll back recurrent state when draft tokens get rejected. ik_llama.cpp works around this with 149 MB GPU checkpoints per step, but the overhead makes it net negative.

Second, there is no small Qwen3.6 model to use as a draft. The Qwen3.5-0.8B shares the tokenizer but has mismatched BOS metadata and different training data. Even after fixing the metadata with gguf-new-metadata, acceptance stayed below 4%.

Third, aggressive quantization breaks alignment. IQ3_XXS preserves 97% of benchmark quality but shifts token distributions enough that draft models can’t predict the output. Community reports confirm that Unsloth quants trigger constant ngram-mod hash pool resets. The 137 t/s Reddit result was Q8_0 on 40 GB. On 16 GB, Q4_K_M doesn’t fit alongside a draft model.

What Is Coming

PR	What	Why It Matters
#18039	EAGLE3 speculative decoding	Closest to merge, ggerganov collaborating
#22105	DFlash block-diffusion drafting	Reads target hidden states, up to 8x speedup. Qwen3.6-27B drafter exists (FP16 only, no quantized version yet)
#20700	Native MTP for Qwen3.5 dense	Uses built-in prediction head, no separate draft model
#22673	MTP for Qwen 3.6 (working!)	1.47x speedup on MoE. Full benchmarks here.
#18886	Generic MTP API for libllama	Official framework for all MTP models

DFlash is the most promising for hybrid models. Instead of an independent draft model that has to match the target’s token distribution, DFlash uses a lightweight diffusion model conditioned on the target’s hidden states. When EAGLE3 merges (DFlash depends on it), I will rerun these benchmarks. Update: MTP speculative decoding now works on the MoE model — 98 to 144 t/s — but makes the dense model 42% slower. Both are bandwidth-bound, but the MoE has spare headroom while the dense model is near saturation.

The Takeaway

I started this experiment expecting to make the dense 27B faster than the MoE through speculative decoding. Instead I learned that the MoE’s speed advantage is architectural, not a missing optimization. The 35B-A3B activates only ~3B of its 35B parameters per token by routing through 8 of 256 experts, giving it a 3.2x speed advantage on the same hardware. Speculative decoding tries to achieve something similar through software (verify multiple tokens per forward pass), but the MoE gets it for free through model design.

On 16 GB, the choice comes down to what matters more: output quality (27B dense at 31 t/s, 65-131K context) or speed plus context (35B MoE at 98 t/s with 262K context). For daily coding use, the MoE wins. For the hardest reasoning tasks where every percentage point of quality matters, the dense model is worth the wait.

This is a fast-moving space. EAGLE3, DFlash, and native MTP support are all in active development, and the community is actively experimenting with new approaches on r/LocalLLaMA. I will keep an eye on whether anyone cracks speculative decoding for dense hybrid models on consumer GPUs and update this post when the tooling catches up.

This is post 7 in the Fast AI, Real Risks series. The previous post pushed the MoE to 400K context with TurboQuant. Next up: a code generation showdown — Gemma 4 vs Qwen 3.6, same prompt, one shot, visual results.

The views and opinions expressed here are my own and do not reflect those of my employer.