llama.cppQwenMTPRTX 5060 Ti
Qwen 3.6 MTP Server (llama.cpp)
MTP speculative decoding with quantized KV cache. Requires the -MTP model variant from Unsloth.
llama-server \
-m Qwen3.6-35B-A3B-MTP-UD-IQ3_S.gguf \
-ngl 99 -c 65536 \
-ctk q4_0 -ctv q4_0 \
--spec-type draft-mtp --spec-draft-n-max 2 \
-np 1 --host 0.0.0.0 --port 11433
Key flags:
--spec-type draft-mtp(notmtp) enables Multi-Token Prediction--spec-draft-n-max 2— draft-3 drops acceptance from 81% to 70% with no speed gain-ctk q4_0 -ctv q4_0— quantized KV cache, fits 65K context on 16 GB-np 1— required, no parallel slots with MTP yet
First request after startup is slow (CUDA graph compilation). Send a throwaway prompt to warm up.