Does Gemma 4 MTP speculative decoding work on a 16GB GPU?

Yes. Gemma 4 26B-A4B MoE at IQ3_XXS plus the Q8_0 drafter (441 MB) fits on an RTX 5060 Ti with 32K context. It reaches 133 t/s on code generation (1.32x over baseline) and 116 t/s on general text (1.15x). 65K context OOMs due to ik_llama.cpp's large compute buffers.

Gemma 4 MTP vs Qwen 3.6 MTP — which is faster?

Qwen 3.6 MTP is faster. Qwen hits 144 t/s (1.47x) with 81% acceptance and 65-98K context. Gemma 4 peaks at 133 t/s (1.32x) with 73% acceptance and only 32K context. Qwen's built-in MTP heads also work on mainline llama.cpp, while Gemma 4 requires the ik_llama.cpp fork.

How does Gemma 4 MTP differ from Qwen MTP?

Qwen 3.6 grafts MTP prediction heads directly into the model GGUF (~2.8 GB overhead). Gemma 4 uses a separate 441 MB drafter model loaded alongside the main model. Despite the smaller drafter, Gemma 4 MTP uses more total VRAM due to ik_llama.cpp's additional compute buffers.

Gemma 4 MTP vs Qwen 3.6: Same GPU, Different Speedups

TL;DR: Gemma 4 MTP works on the RTX 5060 Ti but doesn’t beat Qwen. Code generation hits 133 t/s (1.32x), general text 116 t/s (1.15x). Qwen 3.6 MTP still wins at 144 t/s (1.47x) with more context. The 441 MB drafter sounds lightweight, but the fork’s compute buffers eat the savings. Commands and benchmark data below.

Using multiple AI coding tools and losing track of sessions? I built VibeCockpit, one dashboard to search and resume sessions across Claude Code, Copilot, Codex, and more.

This is post 11 in the Fast AI, Real Risks series. Previous: MTP on Qwen 3.6 — 144 t/s, 1.47x speedup.

How Gemma 4 MTP Differs

Qwen 3.6 grafts MTP prediction heads directly into the model — one GGUF, --spec-type draft-mtp, done. Gemma 4 takes a different approach: Google trained a separate assistant model (~0.4B params, 441 MB at Q8_0) that consumes the target model’s hidden states and predicts the next token. You load both GGUFs together.

The separate drafter sounds like it should be lighter — 441 MB vs Qwen’s ~2.8 GB overhead. In practice, ik_llama.cpp’s additional compute buffers (~528 MB for the drafter) eat the savings, and context maxes out at 32K instead of Qwen’s 65-98K.

Mainline llama.cpp does not support Gemma 4 MTP yet. You need ik_llama.cpp (PR #1744, merged May 10, 2026).

The Benchmark

Same hardware as all previous posts: RTX 5060 Ti 16 GB, GMKtec mini PC, Proxmox VM. All benchmarks run on ik_llama.cpp (required for Gemma 4 MTP). Model: Gemma 4 26B-A4B IQ3_XXS (11.2 GB). Drafter: gemma-4-26B-A4B-it-assistant Q8_0 (441 MB). The same model runs at 99 t/s on mainline with 256K context — ik_llama.cpp’s baseline is ~101 t/s at burst, within margin.

Config	General	Technical	Code	History	Context
Baseline (no MTP)	101 t/s	101 t/s	101 t/s	101 t/s	256K
MTP draft-3, p-min 0.0	116 (59%)	116 (59%)	133 (73%)	112 (56%)	32K
MTP draft-2, p-min 0.0	114 (64%)	124 (74%)	133 (83%)	115 (65%)	32K
MTP draft-3, p-min 0.8	115 (69%)	123 (74%)	125 (76%)	120 (74%)	32K
MTP draft-8, p-min 0.0	73 (25%)	80 (29%)	90 (33%)	64 (21%)	32K
MTP draft-3, q8_0 KV, 16K	112 (54%)	113 (56%)	127 (66%)	113 (55%)	16K

Percentages are MTP acceptance rates. Higher draft-max values (8) collapse acceptance below 33% and actually make inference slower than the baseline. The sweet spot is draft-max 3 for code, draft-max 2 for mixed content.

vs Qwen 3.6 MTP

	Qwen 3.6 MoE	Gemma 4 MoE
Baseline	98 t/s	101 t/s
Best MTP	144 t/s (1.47x)	133 t/s (1.32x)
Best acceptance	81%	73%
Max context with MTP	65-98K	32K
Drafter overhead	~2.8 GB (built-in)	~0.5 GB + 0.5 GB buffers
Requires	Mainline llama.cpp	ik_llama.cpp fork

Qwen wins on every axis. The lower acceptance rate on Gemma 4 is likely due to the IQ3_XXS quantization — the drafter was trained on full-precision hidden states, and heavily quantized representations reduce prediction accuracy. Qwen’s built-in MTP heads were quantized together with the model weights, so they stay aligned.

Server Config

# Build ik_llama.cpp
git clone https://github.com/ikawrakow/ik_llama.cpp ~/ik_llama.cpp
cd ~/ik_llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90 -DBUILD_SHARED_LIBS=OFF
cmake --build build -j$(nproc) --target llama-server

# Download models
hf download unsloth/gemma-4-26B-A4B-it-GGUF \
  gemma-4-26B-A4B-it-UD-IQ3_XXS.gguf --local-dir ~/.cache/llama.cpp/

hf download Radamanthys11/Gemma-4-26B-A4B-it-assistant-GGUF \
  gemma-4-26B-A4B-it-assistant-Q8_0.gguf --local-dir ~/.cache/llama.cpp/

# Run (best config: draft-max 3 for code, 2 for mixed)
~/ik_llama.cpp/build/bin/llama-server \
  -m ~/.cache/llama.cpp/gemma-4-26B-A4B-it-UD-IQ3_XXS.gguf \
  -md ~/.cache/llama.cpp/gemma-4-26B-A4B-it-assistant-Q8_0.gguf \
  --spec-type mtp --draft-max 3 --draft-p-min 0.0 \
  -ngl 99 -ngld 99 \
  -c 32768 -ctk q4_0 -ctv q4_0 \
  -fa on --jinja -np 1 \
  --host 0.0.0.0 --port 8080

Key flags: --spec-type mtp (not draft-mtp), -md loads the drafter GGUF, -ngld 99 offloads drafter to GPU, --draft-p-min 0.0 verifies all drafts (ik_llama.cpp defaults to 0.8 which hurts code throughput).

VRAM Budget

Component	Size
Main model (IQ3_XXS)	11.2 GB
Drafter (Q8_0)	0.4 GB
KV cache (q4_0, 32K)	1.98 GB
Compute buffers (main + drafter)	1.6 GB
Total	~15.2 GB

65K context OOMs. The compute buffers are the surprise — ik_llama.cpp allocates ~528 MB for the drafter alone, plus ~1,040 MB for the main model. On 15.5 GB usable VRAM, 32K is the practical ceiling.

The Takeaway

Gemma 4 MTP works, but it’s not worth switching from Qwen 3.6 MTP on 16 GB VRAM. You trade 224K of context window for a smaller speedup, need a fork instead of mainline, and get lower acceptance rates due to quantization mismatch. The separate drafter architecture is elegant but doesn’t translate to a VRAM advantage in practice.

If you’re already running Gemma 4 for other reasons (quality, licensing, 256K context), MTP at draft-max 3 is a free 1.32x on code. Otherwise, Qwen 3.6 MTP remains the better choice.

Previous posts: MTP on Qwen 3.6, Gemma 4 context window, code generation showdown.

The views and opinions expressed here are my own and do not reflect those of my employer.