Qwen 3.6 35B on RTX 5060 Ti: Full 262K Context, TurboQuant to 400K, and What Actually Matters
This is a follow-up to my Qwen 3.5 benchmarks. Same GPU, same setup, new model, new llama.cpp, and some surprising results.
Update: I also benchmarked the 27B dense variant with speculative decoding. The dense model scores higher on coding benchmarks but runs at 31 t/s vs 98 t/s for the MoE. After testing every spec-dec trick available, I still recommend the MoE as the better daily driver on 16 GB. See also: code generation showdown — Gemma 4 vs Qwen 3.6, same prompt, one shot, visual results.
When Qwen 3.6 dropped, my first question was simple: does the 35B finally reach full context on my 16 GB card? On 3.5, I had to choose between the 35B at 163K or the 9B at 250K. I wanted both: the smarter model at the full context window.
Turns out, it fits. And with TurboQuant, it goes further than I expected, but there’s a catch nobody benchmarked yet.
What Changed
| Qwen 3.5 | Qwen 3.6 | |
|---|---|---|
| Model | Qwen3.5-35B-A3B-UD-IQ3_XXS | Qwen3.6-35B-A3B-UD-IQ3_S |
| llama.cpp | b8177 | b8838 (+660 versions) |
| Max context (35B) | 163K | 262K (native limit) |
| VRAM free at max context | 348 MB | 756 MB (q4_0) |
| Thinking mode | --reasoning-format deepseek-legacy | native |
I’m using Unsloth’s GGUF quantizations rather than the official ones. Unsloth produces optimized GGUF files with better quantization accuracy at the same bit width, and they’re typically among the first to publish quants for new models. The IQ3_S quant is slightly higher quality than the IQ3_XXS I used on 3.5, while being a bit smaller in file size for this model.
The 3.6 model is slightly smaller in weights than the 3.5 (~700 MB less), and b8838 has a more efficient memory allocator. Together that freed enough VRAM to fit the full native context window.
The Benchmark
I wrote a benchmark script to test each configuration systematically. For every config it starts the server, runs three tests (short burst, medium 500-token generation, and a large prompt with generation afterward), then records VRAM usage and speeds.
Setup
# Update llama.cpp to b8838
cd ~/llama.cpp && git fetch --tags && git checkout b8838
cmake -B build -DGGML_CUDA=ON -DLLAMA_BUILD_BORINGSSL=ON
cmake --build build -j$(nproc)
# Download model via hf-mirror
wget -O ~/.cache/llama.cpp/Qwen3.6-35B-A3B-UD-IQ3_S.gguf \
"https://hf-mirror.com/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-IQ3_S.gguf"
Server Launch (recommended daily config)
export LLAMA_ARG_CONTEXT_SHIFT=1
./build/bin/llama-server \
-m ~/.cache/llama.cpp/Qwen3.6-35B-A3B-UD-IQ3_S.gguf \
-ngl 99 -fa 1 -c 262144 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--swa-full --ctx-checkpoints 64 --kv-unified \
--context-shift --cache-reuse 512 \
--perf --no-warmup --mlock \
--slot-prompt-similarity 0.0 \
-np 1 -b 512 -ub 256 -t 6 -tb 6 \
--threads-http 8 --no-mmap --jinja \
--port 11433 --host 0.0.0.0
Results: Small and Medium Context
| Config | VRAM Free | Short burst | Med (500t) | 7.5K prefill | Gen @7.5K |
|---|---|---|---|---|---|
| b8838 163K q5_1/q4_1 | 1,240 MB | 88 t/s | 83 t/s | 113 t/s | 30 t/s |
| b8838 200K q5_1/q4_1 | 904 MB | 88 t/s | 83 t/s | 114 t/s | 30 t/s |
| b8838 262K q4_0/q4_0 | 756 MB | 97 t/s | 78 t/s | 1,585 t/s | 89 t/s |
| turbo3 262K | 1,452 MB | 92 t/s | 92 t/s | 1,581 t/s | 74 t/s |
| turbo3 320K + YaRN | 1,106 MB | 90 t/s | 92 t/s | 1,576 t/s | 74 t/s |
| turbo3 400K + YaRN | 674 MB | 94 t/s | 92 t/s | 1,577 t/s | 75 t/s |
Results: Heavy Context (~108K tokens in KV cache)
This is the test that matters for real coding sessions. Fill the KV cache with ~108K tokens, then generate:
| Config | VRAM Free | 108K prefill | Gen @108K context |
|---|---|---|---|
| b8838 262K q4_0/q4_0 | 756 MB | 1,261 t/s | 46 t/s |
| turbo3 262K | 1,452 MB | 1,247 t/s | 21 t/s |
| turbo3 400K + YaRN | 674 MB | 1,239 t/s | 21 t/s |
The Surprise: KV Cache Type Matters More Than Expected
q5_1/q4_1 is a trap. It looks like the quality upgrade from q4_0, but it tanks performance:
- Prefill: 113 t/s vs 1,585 t/s for q4_0. That’s 14x slower.
- Generation after context: 30 t/s vs 89 t/s for q4_0.
I initially tested with q5_1/q4_1 because it sounded like better quality. The benchmarks show it’s catastrophically slower. The 14x prefill gap is so large that it’s likely not just an optimization difference but a different code path entirely, possibly q5_1 falling back to a non-flash-attention implementation or even CPU compute for certain operations. Worth investigating if you’re considering q5_1. For now: stick with q4_0.
TurboQuant turbo3 splits the difference:
- Prefill matches q4_0 (~1,250 t/s vs ~1,580 t/s). Close enough.
- Generation at light context (92 t/s) actually beats q4_0 (78 t/s).
- But generation at heavy context (21 t/s) is half of q4_0 (46 t/s). The Hadamard decompress step gets expensive as the KV cache fills.
TurboQuant: Build and Test
TurboQuant implements Google DeepMind’s ICLR 2026 paper on KV cache compression. Not merged into mainline llama.cpp yet (Q3 2026 target), but the fork builds and runs.
# Clone and build the TurboQuant fork
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
# Build with CUDA (use compute_90 for Blackwell/5060 Ti compatibility)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90
cmake --build build -j$(nproc)
New KV cache types: turbo2 (2-bit, 6.4x compression), turbo3 (3.5-bit, ~5x), turbo4 (4.5-bit, ~4x).
Context Window Progression
| Config | Context | VRAM Free | Gen @7.5K | Gen @108K |
|---|---|---|---|---|
| Qwen 3.5, q4_0, b8177 | 163K | 348 MB | 47-51 t/s | ~47 t/s |
| Qwen 3.6, q4_0, b8838 | 262K | 756 MB | 89 t/s | 46 t/s |
| Qwen 3.6, turbo3, TQ fork | 262K | 1,452 MB | 74 t/s | 21 t/s |
| Qwen 3.6, turbo3 + YaRN | 400K | 674 MB | 75 t/s | 21 t/s |
| Qwen 3.6, turbo3 + YaRN | 450K | 414 MB | — | OOM on large prompt |
| Qwen 3.6, turbo3 + YaRN | 500K | 68 MB | OOM | OOM |
Qwen 3.6 supports up to 1,010,000 tokens via YaRN rope scaling. On 16 GB VRAM, the practical ceiling is 400K with turbo3 or 262K with q4_0.
Native Thinking Mode
Qwen 3.6 has built-in reasoning. The model thinks before answering, and the thinking content is returned separately in the API response:
{
"reasoning_content": "Here's a thinking process: 1. Analyze...",
"content": "eBPF is a Linux kernel technology that allows..."
}
Disable per-request for faster direct answers:
{"chat_template_kwargs": {"enable_thinking": false}}
Using It with OpenCode
Once the server is running, point your coding tool at it. Here’s the OpenCode config:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"llamacpp": {
"npm": "@ai-sdk/openai-compatible",
"name": "llama-server",
"options": {
"baseURL": "http://<your-server-ip>:11433/v1"
},
"models": {
"home-qwen": {
"name": "Home Qwen 3.6",
"limit": {
"context": 262144,
"output": 248320
}
}
}
}
}
}
Any OpenAI-compatible client works the same way. Point it at http://<server>:11433/v1 and it sees the model.
Recommendation
| Use case | Config | Context | Gen @108K | Why |
|---|---|---|---|---|
| Daily coding | q4_0/q4_0, b8838 | 262K | 46 t/s | Full native context, fast at all context sizes, stable, no fork needed |
| Maximum context | turbo3, TQ fork + YaRN | 400K | 21 t/s | 2.4x more context, but generation halves when cache is full |
| Light read-heavy tasks | turbo3, TQ fork | 262K | 21 t/s | 1.4 GB headroom, great for ingesting docs where you generate short answers |
For my daily use, I’m running q4_0 at 262K. The 3.5 gave me 163K at 47 t/s. The 3.6 gives me 262K at 46 t/s. 60% more context at the same speed, on the same GPU. That’s the real upgrade.
TurboQuant is worth watching. When it merges into mainline llama.cpp and gets the same kernel optimizations as q4_0, the generation speed gap should close. At that point, 400K+ on a 16 GB card becomes the default rather than the experiment.
Update: I combined TurboQuant with MTP speculative decoding — turbo3 KV cache + MTP draft-2 hits 125 t/s at 98K context. The merged fork is on my GitHub.
Benchmark Script
The script cycles through configurations, starts the server for each, runs short/medium/large-context tests, and prints a comparison table. Adapt the model path and configs for your setup:
#!/usr/bin/env python3
"""
Benchmark script for llama.cpp server configurations.
Tests different context sizes, KV cache types, and measures generation speed
across four scenarios:
1. Short burst - 20 token gen, nearly empty KV cache (peak speed)
2. Medium - 600 token gen, KV fills during output (sustained speed)
3. 20K prompt - ~20K token prefill + 50 token gen (moderate cache)
4. 108K prompt - ~108K token prefill + 100 token gen (heavy cache load)
The heavy context test (#4) is what matters for real coding sessions where
the KV cache is full. Burst speeds (#1) are misleading for daily use.
Usage:
scp scripts/llama-benchmark.py nils@<server>:/tmp/bench.py
ssh nils@<server> 'python3 /tmp/bench.py'
Adapt the model path and configs list at the bottom for your setup.
""" Want to put this model to work beyond benchmarks? I run the same Qwen 3.6 as a 24/7 autonomous agent with cron jobs, Telegram/Signal delivery, and kernel-level sandboxing.
The views and opinions expressed here are my own and do not reflect those of my employer.