llama.cppQwenTurboQuantRTX 5060 Ti
Qwen 3.6 with TurboQuant: 400K Context on 16 GB
Requires the TurboQuant fork (feature/turboquant-kv-cache branch).
export LLAMA_ARG_CONTEXT_SHIFT=1
./build/bin/llama-server \
-m ~/.cache/llama.cpp/Qwen3.6-35B-A3B-UD-IQ3_S.gguf \
-ngl 99 -fa 1 -c 262144 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--swa-full --ctx-checkpoints 64 --kv-unified \
--context-shift --cache-reuse 512 \
--perf --no-warmup --mlock \
--slot-prompt-similarity 0.0 \
-np 1 -b 512 -ub 256 -t 6 -tb 6 \
--threads-http 8 --no-mmap --jinja \
--port 11433 --host 0.0.0.0
For 400K context with YaRN rope scaling, swap KV type and context size:
-c 409600 \
--cache-type-k turbo3 --cache-type-v turbo3 \
--yarn-ext-factor 1.0 --yarn-attn-factor 1.0 \
Key tradeoffs:
- q4_0 at 262K: 97 t/s burst, 756 MB VRAM free — best daily driver
- turbo3 at 262K: 92 t/s, 1,452 MB free — more headroom
- turbo3 at 400K + YaRN: 94 t/s, 674 MB free — max context, tight on VRAM