Skip to main content
NJannasch.Dev
llama.cppQwenTurboQuantRTX 5060 Ti

Qwen 3.6 with TurboQuant: 400K Context on 16 GB

Requires the TurboQuant fork (feature/turboquant-kv-cache branch).

export LLAMA_ARG_CONTEXT_SHIFT=1

./build/bin/llama-server \
  -m ~/.cache/llama.cpp/Qwen3.6-35B-A3B-UD-IQ3_S.gguf \
  -ngl 99 -fa 1 -c 262144 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --swa-full --ctx-checkpoints 64 --kv-unified \
  --context-shift --cache-reuse 512 \
  --perf --no-warmup --mlock \
  --slot-prompt-similarity 0.0 \
  -np 1 -b 512 -ub 256 -t 6 -tb 6 \
  --threads-http 8 --no-mmap --jinja \
  --port 11433 --host 0.0.0.0

For 400K context with YaRN rope scaling, swap KV type and context size:

  -c 409600 \
  --cache-type-k turbo3 --cache-type-v turbo3 \
  --yarn-ext-factor 1.0 --yarn-attn-factor 1.0 \

Key tradeoffs:

  • q4_0 at 262K: 97 t/s burst, 756 MB VRAM free — best daily driver
  • turbo3 at 262K: 92 t/s, 1,452 MB free — more headroom
  • turbo3 at 400K + YaRN: 94 t/s, 674 MB free — max context, tight on VRAM