Skip to main content
NJannasch.Dev
llama.cppGemmaMTPRTX 5060 Ti

Gemma 4 MTP Server (ik_llama.cpp)

Gemma 4 MTP uses a separate drafter model. Mainline llama.cpp doesn’t support this yet — requires ik_llama.cpp.

# Download models
hf download unsloth/gemma-4-26B-A4B-it-GGUF \
  gemma-4-26B-A4B-it-UD-IQ3_XXS.gguf --local-dir ~/.cache/llama.cpp/

hf download Radamanthys11/Gemma-4-26B-A4B-it-assistant-GGUF \
  gemma-4-26B-A4B-it-assistant-Q8_0.gguf --local-dir ~/.cache/llama.cpp/
# Run server
~/ik_llama.cpp/build/bin/llama-server \
  -m ~/.cache/llama.cpp/gemma-4-26B-A4B-it-UD-IQ3_XXS.gguf \
  -md ~/.cache/llama.cpp/gemma-4-26B-A4B-it-assistant-Q8_0.gguf \
  --spec-type mtp --draft-max 3 --draft-p-min 0.0 \
  -ngl 99 -ngld 99 \
  -c 32768 -ctk q4_0 -ctv q4_0 \
  -fa on --jinja -np 1 \
  --host 0.0.0.0 --port 8080

Key differences from Qwen MTP:

  • --spec-type mtp (not draft-mtp)
  • -md loads the separate drafter GGUF (441 MB at Q8_0)
  • -ngld 99 offloads drafter layers to GPU
  • --draft-p-min 0.0 verifies all drafts (ik_llama.cpp defaults to 0.8, which hurts code throughput)
  • Max 32K context (65K OOMs due to compute buffers)