Skip to main content
NJannasch.Dev

Qwen 3.6 35B on RTX 5060 Ti: Full 262K Context, TurboQuant to 400K, and What Actually Matters

· 9 min read
AIHomelabllama.cppBenchmarking

This is a follow-up to my Qwen 3.5 benchmarks. Same GPU, same setup, new model, new llama.cpp, and some surprising results.

Update: I also benchmarked the 27B dense variant with speculative decoding. The dense model scores higher on coding benchmarks but runs at 31 t/s vs 98 t/s for the MoE. After testing every spec-dec trick available, I still recommend the MoE as the better daily driver on 16 GB. See also: code generation showdown — Gemma 4 vs Qwen 3.6, same prompt, one shot, visual results.

When Qwen 3.6 dropped, my first question was simple: does the 35B finally reach full context on my 16 GB card? On 3.5, I had to choose between the 35B at 163K or the 9B at 250K. I wanted both: the smarter model at the full context window.

Turns out, it fits. And with TurboQuant, it goes further than I expected, but there’s a catch nobody benchmarked yet.

What Changed

Qwen 3.5Qwen 3.6
ModelQwen3.5-35B-A3B-UD-IQ3_XXSQwen3.6-35B-A3B-UD-IQ3_S
llama.cppb8177b8838 (+660 versions)
Max context (35B)163K262K (native limit)
VRAM free at max context348 MB756 MB (q4_0)
Thinking mode--reasoning-format deepseek-legacynative

I’m using Unsloth’s GGUF quantizations rather than the official ones. Unsloth produces optimized GGUF files with better quantization accuracy at the same bit width, and they’re typically among the first to publish quants for new models. The IQ3_S quant is slightly higher quality than the IQ3_XXS I used on 3.5, while being a bit smaller in file size for this model.

The 3.6 model is slightly smaller in weights than the 3.5 (~700 MB less), and b8838 has a more efficient memory allocator. Together that freed enough VRAM to fit the full native context window.

The Benchmark

I wrote a benchmark script to test each configuration systematically. For every config it starts the server, runs three tests (short burst, medium 500-token generation, and a large prompt with generation afterward), then records VRAM usage and speeds.

Setup

# Update llama.cpp to b8838
cd ~/llama.cpp && git fetch --tags && git checkout b8838
cmake -B build -DGGML_CUDA=ON -DLLAMA_BUILD_BORINGSSL=ON
cmake --build build -j$(nproc)

# Download model via hf-mirror
wget -O ~/.cache/llama.cpp/Qwen3.6-35B-A3B-UD-IQ3_S.gguf \
  "https://hf-mirror.com/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-IQ3_S.gguf"
export LLAMA_ARG_CONTEXT_SHIFT=1

./build/bin/llama-server \
  -m ~/.cache/llama.cpp/Qwen3.6-35B-A3B-UD-IQ3_S.gguf \
  -ngl 99 -fa 1 -c 262144 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --swa-full --ctx-checkpoints 64 --kv-unified \
  --context-shift --cache-reuse 512 \
  --perf --no-warmup --mlock \
  --slot-prompt-similarity 0.0 \
  -np 1 -b 512 -ub 256 -t 6 -tb 6 \
  --threads-http 8 --no-mmap --jinja \
  --port 11433 --host 0.0.0.0

Results: Small and Medium Context

ConfigVRAM FreeShort burstMed (500t)7.5K prefillGen @7.5K
b8838 163K q5_1/q4_11,240 MB88 t/s83 t/s113 t/s30 t/s
b8838 200K q5_1/q4_1904 MB88 t/s83 t/s114 t/s30 t/s
b8838 262K q4_0/q4_0756 MB97 t/s78 t/s1,585 t/s89 t/s
turbo3 262K1,452 MB92 t/s92 t/s1,581 t/s74 t/s
turbo3 320K + YaRN1,106 MB90 t/s92 t/s1,576 t/s74 t/s
turbo3 400K + YaRN674 MB94 t/s92 t/s1,577 t/s75 t/s

Results: Heavy Context (~108K tokens in KV cache)

This is the test that matters for real coding sessions. Fill the KV cache with ~108K tokens, then generate:

ConfigVRAM Free108K prefillGen @108K context
b8838 262K q4_0/q4_0756 MB1,261 t/s46 t/s
turbo3 262K1,452 MB1,247 t/s21 t/s
turbo3 400K + YaRN674 MB1,239 t/s21 t/s

The Surprise: KV Cache Type Matters More Than Expected

q5_1/q4_1 is a trap. It looks like the quality upgrade from q4_0, but it tanks performance:

  • Prefill: 113 t/s vs 1,585 t/s for q4_0. That’s 14x slower.
  • Generation after context: 30 t/s vs 89 t/s for q4_0.

I initially tested with q5_1/q4_1 because it sounded like better quality. The benchmarks show it’s catastrophically slower. The 14x prefill gap is so large that it’s likely not just an optimization difference but a different code path entirely, possibly q5_1 falling back to a non-flash-attention implementation or even CPU compute for certain operations. Worth investigating if you’re considering q5_1. For now: stick with q4_0.

TurboQuant turbo3 splits the difference:

  • Prefill matches q4_0 (~1,250 t/s vs ~1,580 t/s). Close enough.
  • Generation at light context (92 t/s) actually beats q4_0 (78 t/s).
  • But generation at heavy context (21 t/s) is half of q4_0 (46 t/s). The Hadamard decompress step gets expensive as the KV cache fills.

TurboQuant: Build and Test

TurboQuant implements Google DeepMind’s ICLR 2026 paper on KV cache compression. Not merged into mainline llama.cpp yet (Q3 2026 target), but the fork builds and runs.

# Clone and build the TurboQuant fork
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache

# Build with CUDA (use compute_90 for Blackwell/5060 Ti compatibility)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90
cmake --build build -j$(nproc)

New KV cache types: turbo2 (2-bit, 6.4x compression), turbo3 (3.5-bit, ~5x), turbo4 (4.5-bit, ~4x).

Context Window Progression

ConfigContextVRAM FreeGen @7.5KGen @108K
Qwen 3.5, q4_0, b8177163K348 MB47-51 t/s~47 t/s
Qwen 3.6, q4_0, b8838262K756 MB89 t/s46 t/s
Qwen 3.6, turbo3, TQ fork262K1,452 MB74 t/s21 t/s
Qwen 3.6, turbo3 + YaRN400K674 MB75 t/s21 t/s
Qwen 3.6, turbo3 + YaRN450K414 MBOOM on large prompt
Qwen 3.6, turbo3 + YaRN500K68 MBOOMOOM

Qwen 3.6 supports up to 1,010,000 tokens via YaRN rope scaling. On 16 GB VRAM, the practical ceiling is 400K with turbo3 or 262K with q4_0.

Native Thinking Mode

Qwen 3.6 has built-in reasoning. The model thinks before answering, and the thinking content is returned separately in the API response:

{
  "reasoning_content": "Here's a thinking process: 1. Analyze...",
  "content": "eBPF is a Linux kernel technology that allows..."
}

Disable per-request for faster direct answers:

{"chat_template_kwargs": {"enable_thinking": false}}

Using It with OpenCode

Once the server is running, point your coding tool at it. Here’s the OpenCode config:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llamacpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server",
      "options": {
        "baseURL": "http://<your-server-ip>:11433/v1"
      },
      "models": {
        "home-qwen": {
          "name": "Home Qwen 3.6",
          "limit": {
            "context": 262144,
            "output": 248320
          }
        }
      }
    }
  }
}

Any OpenAI-compatible client works the same way. Point it at http://<server>:11433/v1 and it sees the model.

Recommendation

Use caseConfigContextGen @108KWhy
Daily codingq4_0/q4_0, b8838262K46 t/sFull native context, fast at all context sizes, stable, no fork needed
Maximum contextturbo3, TQ fork + YaRN400K21 t/s2.4x more context, but generation halves when cache is full
Light read-heavy tasksturbo3, TQ fork262K21 t/s1.4 GB headroom, great for ingesting docs where you generate short answers

For my daily use, I’m running q4_0 at 262K. The 3.5 gave me 163K at 47 t/s. The 3.6 gives me 262K at 46 t/s. 60% more context at the same speed, on the same GPU. That’s the real upgrade.

TurboQuant is worth watching. When it merges into mainline llama.cpp and gets the same kernel optimizations as q4_0, the generation speed gap should close. At that point, 400K+ on a 16 GB card becomes the default rather than the experiment.

Update: I combined TurboQuant with MTP speculative decoding — turbo3 KV cache + MTP draft-2 hits 125 t/s at 98K context. The merged fork is on my GitHub.

Benchmark Script

The script cycles through configurations, starts the server for each, runs short/medium/large-context tests, and prints a comparison table. Adapt the model path and configs for your setup:

llama-benchmark.py
#!/usr/bin/env python3
"""
Benchmark script for llama.cpp server configurations.
Tests different context sizes, KV cache types, and measures generation speed
across four scenarios:

  1. Short burst    - 20 token gen, nearly empty KV cache (peak speed)
  2. Medium         - 600 token gen, KV fills during output (sustained speed)
  3. 20K prompt     - ~20K token prefill + 50 token gen (moderate cache)
  4. 108K prompt    - ~108K token prefill + 100 token gen (heavy cache load)

The heavy context test (#4) is what matters for real coding sessions where
the KV cache is full. Burst speeds (#1) are misleading for daily use.

Usage:
  scp scripts/llama-benchmark.py nils@<server>:/tmp/bench.py
  ssh nils@<server> 'python3 /tmp/bench.py'

Adapt the model path and configs list at the bottom for your setup.
"""

Want to put this model to work beyond benchmarks? I run the same Qwen 3.6 as a 24/7 autonomous agent with cron jobs, Telegram/Signal delivery, and kernel-level sandboxing.

The views and opinions expressed here are my own and do not reflect those of my employer.