Qwen 3.6 27B Dense on a 5060 Ti: Speculative Decoding, Ngram-Mod, and Why the MoE Still Wins
TL;DR: Qwen 3.6 27B dense runs at 31 t/s on my RTX 5060 Ti (16 GB) with up to 65K usable context (131K with
-np 1). The 35B-A3B MoE hits 98 t/s with full 262K context. Ngram-mod gives a 20% boost to 34 t/s on repetitive code tasks (requires-np 2). Draft-model speculative decoding doesn’t work on these hybrid DeltaNet models yet, and using aggressively quantized Unsloth models adds another compatibility barrier on top. Higher quants (IQ4_XS) are actually slower and support less context than IQ3_XXS. The MoE’s speed advantage comes from the architecture itself, not a missing optimization. [Update: MTP speculative decoding now works — the MoE jumps to 144 t/s (1.47x), but the dense model gets 42% slower.]
Juggling sessions across Claude Code, Copilot, Codex, and local models? I built VibeCockpit — one dashboard to search and resume all your AI coding sessions.
I had the 35B MoE running at ~98 tokens per second and thought the 27B dense might be even better. It scores higher on every coding benchmark. Same VRAM budget, same GPU. The only problem: 31 t/s instead of 98.
Speculative decoding promises to fix exactly this. A small draft model guesses ahead, the large model verifies all guesses in one parallel pass, and you get 6x speedup for free. People on X/Twitter report 154 t/s on a 4090 with this technique. That would put the dense model well ahead of the MoE.
I tried to get it running on my 16 GB GPU. It took two days of patching source code, testing two llama.cpp forks, and fixing GGUF metadata before I understood why it does not work on this hardware.
Dense vs MoE: What Actually Happens Per Token
Before diving into speculative decoding, it helps to understand why these two models run at such different speeds on the same GPU.
27B Dense: All Parameters Fire
35B MoE: Router Picks 8 of 256
The dense model streams all 10.9 GB of weights through the GPU for every token. The MoE activates only ~3B of its 35B parameters per token, though all weights stay loaded in VRAM. Both models occupy similar memory. The MoE does far less math per token, which is why it runs 3.2x faster in practice (the theoretical parameter ratio is higher, but shared attention layers and routing overhead eat into the savings).
Both models use Qwen’s hybrid Gated DeltaNet architecture, alternating transformer attention layers with DeltaNet recurrent layers. That hybrid design turns out to be the key reason speculative decoding fails here.
The Numbers
Same hardware as the previous posts: headless GMKtec mini PC, RTX 5060 Ti 16GB over OCuLink, Proxmox VM, no desktop environment. llama.cpp b8929. Models from Unsloth (27B IQ3_XXS, 35B-A3B IQ3_S).
| 27B Dense (IQ3_XXS) | 35B MoE (IQ3_S) | |
|---|---|---|
| Parameters active/token | 27B (all) | ~3B (8 of 256 experts) |
| VRAM (model weights) | 10.9 GB | 13.0 GB |
| Short generation | 31 t/s | 98 t/s |
| Medium (600 tokens) | 29 t/s | 95 t/s |
| Generation @ 20K ctx | 28 t/s | 90 t/s |
| Generation @ 50K ctx | 23 t/s | 70 t/s |
| Generation @ 108K ctx | 18 t/s | 53 t/s |
| Prefill (20K tokens) | 893 t/s | 1,637 t/s |
| Reliable max context | 65K (np=2) / 131K (np=1) | 262K |
| + ngram-mod (creative prompt) | 29 t/s (no change) | not tested |
| + ngram-mod (repetitive code) | 34 t/s (+17%) | not tested |
| + ngram-mod (code refactoring) | 34 t/s (+20%) | not tested |
The MoE wins on raw speed. Ngram self-speculation (--spec-type ngram-mod) gives a 20% boost on the dense model when generating repetitive code (refactoring, adding similar methods). On creative prompts it does nothing since there are no patterns to match.
Two things to know: you need -np 2 for it to work at all on hybrid models (with -np 1 it silently disables), and aggressive quantization limits the gains. The 137 t/s result from Reddit was Q8_0 on 40 GB VRAM. With IQ3_XXS on 16 GB, 20% is the practical ceiling.
Does a Higher Quant Help?
I tested three quantization levels on the 27B dense to see if less aggressive quantization improves ngram-mod or unlocks better spec-dec. I also downloaded Qwen3.6-27B-IQ4_XS (14.4 GB) for a direct comparison.
| Quant | Model Size | Creative | Code (ngram) | Refactor (ngram) | Max Context | VRAM at Max |
|---|---|---|---|---|---|---|
| IQ3_XXS | 11.7 GB | 28.6 t/s | 33.5 t/s | 34.4 t/s | 65K (np=2) / 131K (np=1) | 14,559 MiB (131K) |
| Q3_K_M (Qwen3.5) | 13.0 GB | 21.3 t/s | 25.7 t/s | 25.8 t/s | 65K | ~15,000 MiB |
| IQ4_XS | 14.4 GB | 24.4 t/s | 29.2 t/s | 29.9 t/s | 16K | 15,669 MiB |
The counterintuitive result: IQ3_XXS is the best overall choice. It is faster per token (smaller model = faster streaming), gets the same ~20% ngram-mod boost, and supports 4x more context than IQ4_XS (65K vs 16K with ngram-mod, or 131K vs 16K without). The higher quants are slower because they are larger models that take longer to stream through the GPU per token, while the quality improvement doesn’t affect ngram-mod acceptance (all hit 100% on repetitive code). IQ4_XS fits on 16 GB but leaves only 400 MiB for KV cache, limiting context to 16K which is impractical for coding sessions.
How Speed Degrades as Context Fills
As the KV cache fills, attention computation grows and generation slows. Here is the 27B dense (IQ3_XXS) at different fill levels, measured with dedicated prompts of the exact token count:
| Context Fill | Gen Speed | Prefill | Slowdown vs Short |
|---|---|---|---|
| Short (~500 tok) | 31 t/s | 893 t/s | baseline |
| 20K tokens | 28 t/s | 893 t/s | -10% |
| 50K tokens | 23 t/s | 777 t/s | -26% |
| 108K tokens | 18 t/s | 646 t/s | -42% |
At 108K context (a realistic limit for a large coding session), the 27B dense drops to 18 t/s. Still usable, but noticeably slower. The MoE drops by a similar percentage but from a much higher starting point: 53 t/s at 108K, still nearly 3x faster than the dense model’s peak.
With -np 2 (required for ngram-mod), the context is split between 2 slots: 65K per slot. For single long sessions that need the full context, use -np 1 (no ngram-mod) to get the full 131K. I initially tested 196K allocation and it loaded, but it OOMs during actual generation past ~50K fill. 131K is the reliable maximum for the dense model on 16 GB.
Copy-Paste Server Configuration
If you just want to run these models, here are the commands. The “why” for each flag is below.
27B Dense (131K context, 65K per slot with ngram-mod):
llama-server \
-m Qwen3.6-27B-UD-IQ3_XXS.gguf \
-ngl 99 -c 131072 \
-fa on -ctk q4_0 -ctv q4_0 \
--spec-type ngram-mod --spec-ngram-size-n 24 \
--draft-min 12 --draft-max 48 \
--context-shift --cache-reuse 512 \
--no-mmap -np 2 -t 6 --jinja \
--host 0.0.0.0 --port 11433
35B MoE (262K context, 98 t/s):
llama-server \
-m Qwen3.6-35B-A3B-UD-IQ3_S.gguf \
-ngl 99 -c 262144 \
-fa on -ctk q4_0 -ctv q4_0 \
--context-shift --cache-reuse 512 \
--no-mmap -np 1 -t 6 --jinja \
--host 0.0.0.0 --port 11433
Key flags:
-fa on: Flash Attention. Non-negotiable for performance.-ctk q4_0 -ctv q4_0: KV cache quantization. Do not use q5_1 which causes a 14x prefill slowdown.-np 2: Required for ngram-mod on hybrid models. With-np 1, the recurrent state has only 1 cell, which silently disables all speculative decoding.--context-shift: Graceful handling when hitting the context limit instead of a hard error.--cache-reuse 512: Reuses KV cache across similar prompts, saving prefill time.--no-mmap: Keeps weights fully in VRAM instead of memory-mapping from disk.
Models: Qwen3.6-27B-UD-IQ3_XXS and Qwen3.6-35B-A3B-UD-IQ3_S from Unsloth. For the initial hardware setup, see my Qwen 3.5 post.
The Speculative Decoding Attempt
The promise: run a tiny 0.8B draft model to guess the next 12 tokens, verify all guesses in one forward pass. At 85% acceptance, that is ~6 tokens per expensive pass. On paper: 31 t/s x 6 = 186 t/s.
I tested every approach across mainline llama.cpp, a patched version, and ik_llama.cpp.
| Approach | Speed | Acceptance | Problem |
|---|---|---|---|
| Baseline (no spec-dec) | 31 t/s | N/A | Reference |
| Draft via llama-server | 22 t/s | ”100%” (fake) | Server doesn’t batch-verify |
| Patched llama-speculative | ”93 t/s” | 1% | Garbage output (DeltaNet corruption) |
| ik_llama.cpp + fixed draft | 7 t/s | 1.6% | 4x slower than baseline |
| ik_llama.cpp + autotune | 19 t/s | 26% | Still slower than baseline |
| Ngram-mod (creative) | 29 t/s | 0 drafts | No patterns to match |
| Ngram-mod (repetitive code) | 34 t/s | 100% | +20% on code tasks |
| 35B MoE (no tricks needed) | 98 t/s | N/A | Just works |
No draft-model approach beat the baseline. The only win was ngram-mod on repetitive code (+20%), but it does nothing for general-purpose prompts.
Why Draft-Model Spec-Dec Fails Here
Three problems stack on top of each other. First, Qwen3.5/3.6’s hybrid DeltaNet architecture can’t cleanly roll back recurrent state when draft tokens get rejected. ik_llama.cpp works around this with 149 MB GPU checkpoints per step, but the overhead makes it net negative.
Second, there is no small Qwen3.6 model to use as a draft. The Qwen3.5-0.8B shares the tokenizer but has mismatched BOS metadata and different training data. Even after fixing the metadata with gguf-new-metadata, acceptance stayed below 4%.
Third, aggressive quantization breaks alignment. IQ3_XXS preserves 97% of benchmark quality but shifts token distributions enough that draft models can’t predict the output. Community reports confirm that Unsloth quants trigger constant ngram-mod hash pool resets. The 137 t/s Reddit result was Q8_0 on 40 GB. On 16 GB, Q4_K_M doesn’t fit alongside a draft model.
What Is Coming
| PR | What | Why It Matters |
|---|---|---|
| #18039 | EAGLE3 speculative decoding | Closest to merge, ggerganov collaborating |
| #22105 | DFlash block-diffusion drafting | Reads target hidden states, up to 8x speedup. Qwen3.6-27B drafter exists (FP16 only, no quantized version yet) |
| #20700 | Native MTP for Qwen3.5 dense | Uses built-in prediction head, no separate draft model |
| #22673 | MTP for Qwen 3.6 (working!) | 1.47x speedup on MoE. Full benchmarks here. |
| #18886 | Generic MTP API for libllama | Official framework for all MTP models |
DFlash is the most promising for hybrid models. Instead of an independent draft model that has to match the target’s token distribution, DFlash uses a lightweight diffusion model conditioned on the target’s hidden states. When EAGLE3 merges (DFlash depends on it), I will rerun these benchmarks. Update: MTP speculative decoding now works on the MoE model — 98 to 144 t/s — but makes the dense model 42% slower. Both are bandwidth-bound, but the MoE has spare headroom while the dense model is near saturation.
The Takeaway
I started this experiment expecting to make the dense 27B faster than the MoE through speculative decoding. Instead I learned that the MoE’s speed advantage is architectural, not a missing optimization. The 35B-A3B activates only ~3B of its 35B parameters per token by routing through 8 of 256 experts, giving it a 3.2x speed advantage on the same hardware. Speculative decoding tries to achieve something similar through software (verify multiple tokens per forward pass), but the MoE gets it for free through model design.
On 16 GB, the choice comes down to what matters more: output quality (27B dense at 31 t/s, 65-131K context) or speed plus context (35B MoE at 98 t/s with 262K context). For daily coding use, the MoE wins. For the hardest reasoning tasks where every percentage point of quality matters, the dense model is worth the wait.
This is a fast-moving space. EAGLE3, DFlash, and native MTP support are all in active development, and the community is actively experimenting with new approaches on r/LocalLLaMA. I will keep an eye on whether anyone cracks speculative decoding for dense hybrid models on consumer GPUs and update this post when the tooling catches up.
This is post 7 in the Fast AI, Real Risks series. The previous post pushed the MoE to 400K context with TurboQuant. Next up: a code generation showdown — Gemma 4 vs Qwen 3.6, same prompt, one shot, visual results.
The views and opinions expressed here are my own and do not reflect those of my employer.