Skip to main content
NJannasch.Dev
llama.cppQwenMTPRTX 5060 Ti

Qwen 3.6 MTP Server (llama.cpp)

MTP speculative decoding with quantized KV cache. Requires the -MTP model variant from Unsloth.

llama-server \
  -m Qwen3.6-35B-A3B-MTP-UD-IQ3_S.gguf \
  -ngl 99 -c 65536 \
  -ctk q4_0 -ctv q4_0 \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  -np 1 --host 0.0.0.0 --port 11433

Key flags:

  • --spec-type draft-mtp (not mtp) enables Multi-Token Prediction
  • --spec-draft-n-max 2 — draft-3 drops acceptance from 81% to 70% with no speed gain
  • -ctk q4_0 -ctv q4_0 — quantized KV cache, fits 65K context on 16 GB
  • -np 1 — required, no parallel slots with MTP yet

First request after startup is slow (CUDA graph compilation). Send a throwaway prompt to warm up.