What is TurboQuant KV cache in llama.cpp?

TurboQuant uses q4_0 quantization for both key and value caches (-ctk q4_0 -ctv q4_0), reducing KV cache memory by ~75%. This enables much longer context windows on limited VRAM — pushing Qwen 3.6 35B from 262K to 400K context on a 16GB GPU.

Can Qwen 3.6 35B reach its full context window on 16GB VRAM?

Yes. With IQ3_S quantization, Qwen 3.6-35B-A3B reaches its full 262K native context on an RTX 5060 Ti (16GB). With TurboQuant KV cache compression, it pushes to 400K context on the same hardware.

How fast is Qwen 3.6 35B on an RTX 5060 Ti?

Qwen 3.6-35B-A3B generates at approximately 47-51 tokens per second on an RTX 5060 Ti with 16GB VRAM, making it practical for local coding and agentic workflows.

Qwen 3.6 35B on RTX 5060 Ti: Full 262K Context, TurboQuant to 400K, and What Actually Matters

This is a follow-up to my Qwen 3.5 benchmarks. Same GPU, same setup, new model, new llama.cpp, and some surprising results.

Update: I also benchmarked the 27B dense variant with speculative decoding. The dense model scores higher on coding benchmarks but runs at 31 t/s vs 98 t/s for the MoE. After testing every spec-dec trick available, I still recommend the MoE as the better daily driver on 16 GB. See also: code generation showdown — Gemma 4 vs Qwen 3.6, same prompt, one shot, visual results.

When Qwen 3.6 dropped, my first question was simple: does the 35B finally reach full context on my 16 GB card? On 3.5, I had to choose between the 35B at 163K or the 9B at 250K. I wanted both: the smarter model at the full context window.

Turns out, it fits. And with TurboQuant, it goes further than I expected, but there’s a catch nobody benchmarked yet.

What Changed

	Qwen 3.5	Qwen 3.6
Model	Qwen3.5-35B-A3B-UD-IQ3_XXS	Qwen3.6-35B-A3B-UD-IQ3_S
llama.cpp	b8177	b8838 (+660 versions)
Max context (35B)	163K	262K (native limit)
VRAM free at max context	348 MB	756 MB (q4_0)
Thinking mode	`--reasoning-format deepseek-legacy`	native

I’m using Unsloth’s GGUF quantizations rather than the official ones. Unsloth produces optimized GGUF files with better quantization accuracy at the same bit width, and they’re typically among the first to publish quants for new models. The IQ3_S quant is slightly higher quality than the IQ3_XXS I used on 3.5, while being a bit smaller in file size for this model.

The 3.6 model is slightly smaller in weights than the 3.5 (~700 MB less), and b8838 has a more efficient memory allocator. Together that freed enough VRAM to fit the full native context window.

The Benchmark

I wrote a benchmark script to test each configuration systematically. For every config it starts the server, runs three tests (short burst, medium 500-token generation, and a large prompt with generation afterward), then records VRAM usage and speeds.

Setup

# Update llama.cpp to b8838
cd ~/llama.cpp && git fetch --tags && git checkout b8838
cmake -B build -DGGML_CUDA=ON -DLLAMA_BUILD_BORINGSSL=ON
cmake --build build -j$(nproc)

# Download model via hf-mirror
wget -O ~/.cache/llama.cpp/Qwen3.6-35B-A3B-UD-IQ3_S.gguf \
  "https://hf-mirror.com/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-IQ3_S.gguf"

Server Launch (recommended daily config)

export LLAMA_ARG_CONTEXT_SHIFT=1

./build/bin/llama-server \
  -m ~/.cache/llama.cpp/Qwen3.6-35B-A3B-UD-IQ3_S.gguf \
  -ngl 99 -fa 1 -c 262144 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --swa-full --ctx-checkpoints 64 --kv-unified \
  --context-shift --cache-reuse 512 \
  --perf --no-warmup --mlock \
  --slot-prompt-similarity 0.0 \
  -np 1 -b 512 -ub 256 -t 6 -tb 6 \
  --threads-http 8 --no-mmap --jinja \
  --port 11433 --host 0.0.0.0

Results: Small and Medium Context

Config	VRAM Free	Short burst	Med (500t)	7.5K prefill	Gen @7.5K
b8838 163K q5_1/q4_1	1,240 MB	88 t/s	83 t/s	113 t/s	30 t/s
b8838 200K q5_1/q4_1	904 MB	88 t/s	83 t/s	114 t/s	30 t/s
b8838 262K q4_0/q4_0	756 MB	97 t/s	78 t/s	1,585 t/s	89 t/s
turbo3 262K	1,452 MB	92 t/s	92 t/s	1,581 t/s	74 t/s
turbo3 320K + YaRN	1,106 MB	90 t/s	92 t/s	1,576 t/s	74 t/s
turbo3 400K + YaRN	674 MB	94 t/s	92 t/s	1,577 t/s	75 t/s

Results: Heavy Context (~108K tokens in KV cache)

This is the test that matters for real coding sessions. Fill the KV cache with ~108K tokens, then generate:

Config	VRAM Free	108K prefill	Gen @108K context
b8838 262K q4_0/q4_0	756 MB	1,261 t/s	46 t/s
turbo3 262K	1,452 MB	1,247 t/s	21 t/s
turbo3 400K + YaRN	674 MB	1,239 t/s	21 t/s

The Surprise: KV Cache Type Matters More Than Expected

q5_1/q4_1 is a trap. It looks like the quality upgrade from q4_0, but it tanks performance:

Prefill: 113 t/s vs 1,585 t/s for q4_0. That’s 14x slower.
Generation after context: 30 t/s vs 89 t/s for q4_0.

I initially tested with q5_1/q4_1 because it sounded like better quality. The benchmarks show it’s catastrophically slower. The 14x prefill gap is so large that it’s likely not just an optimization difference but a different code path entirely, possibly q5_1 falling back to a non-flash-attention implementation or even CPU compute for certain operations. Worth investigating if you’re considering q5_1. For now: stick with q4_0.

TurboQuant turbo3 splits the difference:

Prefill matches q4_0 (~1,250 t/s vs ~1,580 t/s). Close enough.
Generation at light context (92 t/s) actually beats q4_0 (78 t/s).
But generation at heavy context (21 t/s) is half of q4_0 (46 t/s). The Hadamard decompress step gets expensive as the KV cache fills.

TurboQuant: Build and Test

TurboQuant implements Google DeepMind’s ICLR 2026 paper on KV cache compression. Not merged into mainline llama.cpp yet (Q3 2026 target), but the fork builds and runs.

# Clone and build the TurboQuant fork
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache

# Build with CUDA (use compute_90 for Blackwell/5060 Ti compatibility)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90
cmake --build build -j$(nproc)

New KV cache types: turbo2 (2-bit, 6.4x compression), turbo3 (3.5-bit, ~5x), turbo4 (4.5-bit, ~4x).

Context Window Progression

Config	Context	VRAM Free	Gen @7.5K	Gen @108K
Qwen 3.5, q4_0, b8177	163K	348 MB	47-51 t/s	~47 t/s
Qwen 3.6, q4_0, b8838	262K	756 MB	89 t/s	46 t/s
Qwen 3.6, turbo3, TQ fork	262K	1,452 MB	74 t/s	21 t/s
Qwen 3.6, turbo3 + YaRN	400K	674 MB	75 t/s	21 t/s
Qwen 3.6, turbo3 + YaRN	450K	414 MB	—	OOM on large prompt
Qwen 3.6, turbo3 + YaRN	500K	68 MB	OOM	OOM

Qwen 3.6 supports up to 1,010,000 tokens via YaRN rope scaling. On 16 GB VRAM, the practical ceiling is 400K with turbo3 or 262K with q4_0.

Native Thinking Mode

Qwen 3.6 has built-in reasoning. The model thinks before answering, and the thinking content is returned separately in the API response:

{
  "reasoning_content": "Here's a thinking process: 1. Analyze...",
  "content": "eBPF is a Linux kernel technology that allows..."
}

Disable per-request for faster direct answers:

{"chat_template_kwargs": {"enable_thinking": false}}

Using It with OpenCode

Once the server is running, point your coding tool at it. Here’s the OpenCode config:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llamacpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server",
      "options": {
        "baseURL": "http://<your-server-ip>:11433/v1"
      },
      "models": {
        "home-qwen": {
          "name": "Home Qwen 3.6",
          "limit": {
            "context": 262144,
            "output": 248320
          }
        }
      }
    }
  }
}

Any OpenAI-compatible client works the same way. Point it at http://<server>:11433/v1 and it sees the model.

Recommendation

Use case	Config	Context	Gen @108K	Why
Daily coding	q4_0/q4_0, b8838	262K	46 t/s	Full native context, fast at all context sizes, stable, no fork needed
Maximum context	turbo3, TQ fork + YaRN	400K	21 t/s	2.4x more context, but generation halves when cache is full
Light read-heavy tasks	turbo3, TQ fork	262K	21 t/s	1.4 GB headroom, great for ingesting docs where you generate short answers

For my daily use, I’m running q4_0 at 262K. The 3.5 gave me 163K at 47 t/s. The 3.6 gives me 262K at 46 t/s. 60% more context at the same speed, on the same GPU. That’s the real upgrade.

TurboQuant is worth watching. When it merges into mainline llama.cpp and gets the same kernel optimizations as q4_0, the generation speed gap should close. At that point, 400K+ on a 16 GB card becomes the default rather than the experiment.

Update: I combined TurboQuant with MTP speculative decoding — turbo3 KV cache + MTP draft-2 hits 125 t/s at 98K context. The merged fork is on my GitHub.

Benchmark Script

The script cycles through configurations, starts the server for each, runs short/medium/large-context tests, and prints a comparison table. Adapt the model path and configs for your setup:

llama-benchmark.py

#!/usr/bin/env python3
"""
Benchmark script for llama.cpp server configurations.
Tests different context sizes, KV cache types, and measures generation speed
across four scenarios:

  1. Short burst    - 20 token gen, nearly empty KV cache (peak speed)
  2. Medium         - 600 token gen, KV fills during output (sustained speed)
  3. 20K prompt     - ~20K token prefill + 50 token gen (moderate cache)
  4. 108K prompt    - ~108K token prefill + 100 token gen (heavy cache load)

The heavy context test (#4) is what matters for real coding sessions where
the KV cache is full. Burst speeds (#1) are misleading for daily use.

Usage:
  scp scripts/llama-benchmark.py nils@<server>:/tmp/bench.py
  ssh nils@<server> 'python3 /tmp/bench.py'

Adapt the model path and configs list at the bottom for your setup.
"""

Want to put this model to work beyond benchmarks? I run the same Qwen 3.6 as a 24/7 autonomous agent with cron jobs, Telegram/Signal delivery, and kernel-level sandboxing.

The views and opinions expressed here are my own and do not reflect those of my employer.