llama.cppGemmaMTPRTX 5060 Ti
Gemma 4 MTP Server (ik_llama.cpp)
Gemma 4 MTP uses a separate drafter model. Mainline llama.cpp doesn’t support this yet — requires ik_llama.cpp.
# Download models
hf download unsloth/gemma-4-26B-A4B-it-GGUF \
gemma-4-26B-A4B-it-UD-IQ3_XXS.gguf --local-dir ~/.cache/llama.cpp/
hf download Radamanthys11/Gemma-4-26B-A4B-it-assistant-GGUF \
gemma-4-26B-A4B-it-assistant-Q8_0.gguf --local-dir ~/.cache/llama.cpp/
# Run server
~/ik_llama.cpp/build/bin/llama-server \
-m ~/.cache/llama.cpp/gemma-4-26B-A4B-it-UD-IQ3_XXS.gguf \
-md ~/.cache/llama.cpp/gemma-4-26B-A4B-it-assistant-Q8_0.gguf \
--spec-type mtp --draft-max 3 --draft-p-min 0.0 \
-ngl 99 -ngld 99 \
-c 32768 -ctk q4_0 -ctv q4_0 \
-fa on --jinja -np 1 \
--host 0.0.0.0 --port 8080
Key differences from Qwen MTP:
--spec-type mtp(notdraft-mtp)-mdloads the separate drafter GGUF (441 MB at Q8_0)-ngld 99offloads drafter layers to GPU--draft-p-min 0.0verifies all drafts (ik_llama.cpp defaults to 0.8, which hurts code throughput)- Max 32K context (65K OOMs due to compute buffers)