Gemma 4 MTP vs Qwen 3.6: Same GPU, Different Speedups
TL;DR: Gemma 4 MTP works on the RTX 5060 Ti but doesn’t beat Qwen. Code generation hits 133 t/s (1.32x), general text 116 t/s (1.15x). Qwen 3.6 MTP still wins at 144 t/s (1.47x) with more context. The 441 MB drafter sounds lightweight, but the fork’s compute buffers eat the savings. Commands and benchmark data below.
Using multiple AI coding tools and losing track of sessions? I built VibeCockpit, one dashboard to search and resume sessions across Claude Code, Copilot, Codex, and more.
This is post 11 in the Fast AI, Real Risks series. Previous: MTP on Qwen 3.6 — 144 t/s, 1.47x speedup.
How Gemma 4 MTP Differs
Qwen 3.6 grafts MTP prediction heads directly into the model — one GGUF, --spec-type draft-mtp, done. Gemma 4 takes a different approach: Google trained a separate assistant model (~0.4B params, 441 MB at Q8_0) that consumes the target model’s hidden states and predicts the next token. You load both GGUFs together.
The separate drafter sounds like it should be lighter — 441 MB vs Qwen’s ~2.8 GB overhead. In practice, ik_llama.cpp’s additional compute buffers (~528 MB for the drafter) eat the savings, and context maxes out at 32K instead of Qwen’s 65-98K.
Mainline llama.cpp does not support Gemma 4 MTP yet. You need ik_llama.cpp (PR #1744, merged May 10, 2026).
The Benchmark
Same hardware as all previous posts: RTX 5060 Ti 16 GB, GMKtec mini PC, Proxmox VM. All benchmarks run on ik_llama.cpp (required for Gemma 4 MTP). Model: Gemma 4 26B-A4B IQ3_XXS (11.2 GB). Drafter: gemma-4-26B-A4B-it-assistant Q8_0 (441 MB). The same model runs at 99 t/s on mainline with 256K context — ik_llama.cpp’s baseline is ~101 t/s at burst, within margin.
| Config | General | Technical | Code | History | Context |
|---|---|---|---|---|---|
| Baseline (no MTP) | 101 t/s | 101 t/s | 101 t/s | 101 t/s | 256K |
| MTP draft-3, p-min 0.0 | 116 (59%) | 116 (59%) | 133 (73%) | 112 (56%) | 32K |
| MTP draft-2, p-min 0.0 | 114 (64%) | 124 (74%) | 133 (83%) | 115 (65%) | 32K |
| MTP draft-3, p-min 0.8 | 115 (69%) | 123 (74%) | 125 (76%) | 120 (74%) | 32K |
| MTP draft-8, p-min 0.0 | 73 (25%) | 80 (29%) | 90 (33%) | 64 (21%) | 32K |
| MTP draft-3, q8_0 KV, 16K | 112 (54%) | 113 (56%) | 127 (66%) | 113 (55%) | 16K |
Percentages are MTP acceptance rates. Higher draft-max values (8) collapse acceptance below 33% and actually make inference slower than the baseline. The sweet spot is draft-max 3 for code, draft-max 2 for mixed content.
vs Qwen 3.6 MTP
| Qwen 3.6 MoE | Gemma 4 MoE | |
|---|---|---|
| Baseline | 98 t/s | 101 t/s |
| Best MTP | 144 t/s (1.47x) | 133 t/s (1.32x) |
| Best acceptance | 81% | 73% |
| Max context with MTP | 65-98K | 32K |
| Drafter overhead | ~2.8 GB (built-in) | ~0.5 GB + 0.5 GB buffers |
| Requires | Mainline llama.cpp | ik_llama.cpp fork |
Qwen wins on every axis. The lower acceptance rate on Gemma 4 is likely due to the IQ3_XXS quantization — the drafter was trained on full-precision hidden states, and heavily quantized representations reduce prediction accuracy. Qwen’s built-in MTP heads were quantized together with the model weights, so they stay aligned.
Server Config
# Build ik_llama.cpp
git clone https://github.com/ikawrakow/ik_llama.cpp ~/ik_llama.cpp
cd ~/ik_llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90 -DBUILD_SHARED_LIBS=OFF
cmake --build build -j$(nproc) --target llama-server
# Download models
hf download unsloth/gemma-4-26B-A4B-it-GGUF \
gemma-4-26B-A4B-it-UD-IQ3_XXS.gguf --local-dir ~/.cache/llama.cpp/
hf download Radamanthys11/Gemma-4-26B-A4B-it-assistant-GGUF \
gemma-4-26B-A4B-it-assistant-Q8_0.gguf --local-dir ~/.cache/llama.cpp/
# Run (best config: draft-max 3 for code, 2 for mixed)
~/ik_llama.cpp/build/bin/llama-server \
-m ~/.cache/llama.cpp/gemma-4-26B-A4B-it-UD-IQ3_XXS.gguf \
-md ~/.cache/llama.cpp/gemma-4-26B-A4B-it-assistant-Q8_0.gguf \
--spec-type mtp --draft-max 3 --draft-p-min 0.0 \
-ngl 99 -ngld 99 \
-c 32768 -ctk q4_0 -ctv q4_0 \
-fa on --jinja -np 1 \
--host 0.0.0.0 --port 8080
Key flags: --spec-type mtp (not draft-mtp), -md loads the drafter GGUF, -ngld 99 offloads drafter to GPU, --draft-p-min 0.0 verifies all drafts (ik_llama.cpp defaults to 0.8 which hurts code throughput).
VRAM Budget
| Component | Size |
|---|---|
| Main model (IQ3_XXS) | 11.2 GB |
| Drafter (Q8_0) | 0.4 GB |
| KV cache (q4_0, 32K) | 1.98 GB |
| Compute buffers (main + drafter) | 1.6 GB |
| Total | ~15.2 GB |
65K context OOMs. The compute buffers are the surprise — ik_llama.cpp allocates ~528 MB for the drafter alone, plus ~1,040 MB for the main model. On 15.5 GB usable VRAM, 32K is the practical ceiling.
The Takeaway
Gemma 4 MTP works, but it’s not worth switching from Qwen 3.6 MTP on 16 GB VRAM. You trade 224K of context window for a smaller speedup, need a fork instead of mainline, and get lower acceptance rates due to quantization mismatch. The separate drafter architecture is elegant but doesn’t translate to a VRAM advantage in practice.
If you’re already running Gemma 4 for other reasons (quality, licensing, 256K context), MTP at draft-max 3 is a free 1.32x on code. Otherwise, Qwen 3.6 MTP remains the better choice.
Previous posts: MTP on Qwen 3.6, Gemma 4 context window, code generation showdown.
The views and opinions expressed here are my own and do not reflect those of my employer.