Skip to main content
NJannasch.Dev

Gemma 4 on a 5060 Ti: 256K Context on 16GB — but Only if You Know the Architecture Trick

· 9 min read
AIHomelabllama.cppBenchmarking

Using multiple AI coding tools and losing track of sessions? I built VibeCockpit — one dashboard to search and resume sessions across Claude Code, Copilot, Codex, and more.

TL;DR: Gemma 4’s 26B MoE runs at 99 t/s with 256K context on my RTX 5060 Ti (16 GB). The 31B dense model manages 27 t/s and tops out at 65K context. The key insight: Gemma 4 uses iSWA (interleaved Sliding Window Attention), which creates a dual KV cache — and the --swa-full flag (commonly copied from Qwen configs) forces it to allocate the full context for every layer, OOMing immediately. Remove that flag and 256K fits with 2.7 GB to spare. Copy-paste server configs, full benchmark tables, and a head-to-head vs Qwen 3.6 below. [New: code generation showdown — how well do these models actually write code? Four rounds, visual results.]

After three rounds of Qwen benchmarks on this GPU, I wanted to see whether Google’s Gemma 4 could do anything different on the same hardware. The model family looked interesting: an MoE variant with 128 experts (same as Qwen), a dense 31B, multimodal vision support, and a claimed 256K context window. All under Apache 2.0.

The surprise was not the speed. It was the architecture.

Test Setup

Same machine as all my previous benchmarks:

  • RTX 5060 Ti 16 GB over OCuLink (PCIe 4.0 x4)
  • AMD Ryzen 7, headless (no compositor — ~15.5 GB usable VRAM)
  • llama.cpp mainline, built from source with CUDA
  • Unsloth GGUF quantizations via hf download

Models tested:

ModelTypeTotal / Active ParamsGGUF Size
gemma-4-26B-A4B-it-UD-IQ4_XSMoE25.2B / 3.8B13.4 GB
gemma-4-26B-A4B-it-UD-Q3_K_MMoE25.2B / 3.8B12.5 GB
gemma-4-26B-A4B-it-UD-IQ3_XXSMoE25.2B / 3.8B11.2 GB
gemma-4-31B-it-UD-IQ3_XXSDense30.7B / 30.7B11.8 GB
gemma-4-31B-it-Q3_K_MDense30.7B / 30.7B14.7 GB

The Result, Up Front

The MoE at IQ3_XXS is the winner. 256K context, 99 t/s sustained, 2.7 GB of VRAM to spare.

ConfigShortSustainedCode108K genMax ContextPeak VRAM
MoE IQ3_XXS q4_0-KV96 t/s99 t/s97 t/s49 t/s256K13.2 GB
MoE Q3_K_M q4_0-KV87 t/s89 t/s87 t/s46 t/s131K13.6 GB
MoE IQ4_XS q4_0-KV84 t/s85 t/s83 t/s65K13.9 GB
Dense IQ3_XXS q4_0-KV27 t/s26 t/s25 t/s65K13.8 GB
Dense Q3_K_M q4_0-KV19 t/s18 t/s18 t/s8K15.0 GB

The dense 31B is 3.5x slower and maxes out at 65K context. Same conclusion as Qwen’s dense vs MoE: on 16 GB, MoE is the only architecture that works for long context.

The Architecture Trick: iSWA and the Dual KV Cache

This is the part that took me a day to figure out, and it is the reason this post exists.

Gemma 4 uses iSWA — interleaved Sliding Window Attention. Instead of giving every layer full attention over the entire context (like a standard transformer), Gemma splits its layers into two types:

Layer TypeAttentionKV Cache SizeGemma 4 MoE (30 layers)
GlobalFull contextScales with context length5 layers
SWALast 1024 tokens onlyFixed at 1024 entries25 layers

The final layer is always global, so the model’s output has full context awareness. But the 25 SWA layers only look at a small local window. This is by design — Google found that most layers don’t need to see the entire history to do their job well.

The consequence for VRAM is dramatic. Toggle between modes to see why --swa-full — essential for Qwen, where it prevents sliding window amnesia — is a trap for Gemma 4:

With --swa-full, llama.cpp allocates the full context for ALL 30 layers — OOM. Without it, only the 5 global layers get full-context KV. The total KV cache drops from ~16 GB to ~2 GB. My first benchmark run had --swa-full on every config. 8 out of 10 failed to start. Once I removed it, all 11 completed — including the full 256K.

The VRAM scaling confirms this — the MoE at 256K uses less VRAM than the dense model at 131K:

ConfigStartupAfter 20KAfter 50KAfter 108K
MoE IQ3_XXS 256K13,15113,18913,30913,493
MoE IQ3_XXS 131K12,29112,32912,44912,633
MoE Q3_K_M 131K13,53713,57513,69513,877
MoE IQ4_XS 65K14,03114,06914,189
Dense IQ3_XXS 65K13,80713,86914,113
Dense IQ3_XXS 131K15,24715,30915,55315,845

All values in MiB. GPU total: 16,384 MiB.

Going from 65K to 256K context costs only ~1.3 GB extra at startup because the SWA cache is fixed-size — only the 5 global layers grow with context. And the 256K context is not just allocated — it actually works. A needle-in-a-haystack test (unique fact buried at 50% depth in filler text) passes at every depth from 5K to 200K tokens. No degradation, no hallucination.

Full MoE Benchmark Results

All tests run on the same machine, progressive context fill from empty to 108K tokens.

ConfigShortMedCode20K PP20K Gen50K PP50K Gen108K PP108K Gen
IQ4_XS fp16-KV 8K (swa-full)899289241072
IQ4_XS q4_0-KV 8K868584229481
IQ4_XS q4_0-KV 32K848583229281
IQ4_XS q4_0-KV 65K848583229381192362
Q3_K_M q4_0-KV 32K868987230084
Q3_K_M q4_0-KV 65K878987230084193863
Q3_K_M q4_0-KV 131K878987229984193763154846
IQ3_XXS q4_0-KV 65K969997201393173769
IQ3_XXS q4_0-KV 131K969997201294173768142249
IQ3_XXS q4_0-KV 196K969997201393173868142249
IQ3_XXS q4_0-KV 256K969997201293173769142249

All values in tokens/second. PP = prompt processing (prefill). Gen = generation after fill.

The IQ3_XXS is actually the fastest quantization — 99 t/s sustained vs 85 for IQ4_XS. The smaller model leaves more VRAM for compute buffers, which matters when the GPU memory controller is the bottleneck.

Generation degrades gracefully as context fills: 99 → 69 → 49 t/s from empty to 50K to 108K fill. Still very usable at 108K.

Dense 31B Results: Not Worth It on 16 GB

ConfigShortMedCode20K Gen50K Gen108K GenPeak VRAM
IQ3_XXS fp16-KV 8K2928272613.2 GB
IQ3_XXS q4_0-KV 32K2726252412.8 GB
IQ3_XXS q4_0-KV 65K272625241813.8 GB
IQ3_XXS q4_0-KV 131K27262418OOM15.5 GB
Q3_K_M q4_0-KV 8K1918181715.0 GB

3.5x slower, 65K max usable context. Same result as Qwen’s 27B dense — on 16 GB, this is a fundamental dense vs MoE tradeoff, not a Gemma vs Qwen thing.

Head-to-Head: Gemma 4 vs Qwen 3.6

I re-ran the Qwen 3.6 35B MoE as a reference alongside the Gemma benchmarks, same machine, same session.

MetricGemma 4 26B MoE (IQ3_XXS)Qwen 3.6 35B MoE (IQ3_S)
Architecture128 experts, 8 active + 1 shared128 experts, 8 active
Active params3.8B3B
Short burst96 t/s93 t/s
Sustained gen99 t/s94 t/s
Code gen97 t/s93 t/s
20K gen93 t/s90 t/s
50K gen69 t/s70 t/s
108K gen (131K ctx)49 t/s52 t/s
108K gen (max ctx)49 t/s46 t/s†
Prefill (20K)2012 t/s1631 t/s
Max context256K262K (400K with TurboQuant)
Peak VRAM (max ctx)13.2 GB~15.3 GB†

†Qwen 262K numbers from previous benchmarks. Same-session reference ran at 131K.

At short context, Gemma is consistently faster. At 131K context, Qwen edges ahead at 108K fill (52 vs 49 t/s). But at max context — 262K for Qwen, 256K for Gemma — the picture flips: Gemma holds 49 t/s while Qwen drops to 46 t/s. This is iSWA in action — Gemma’s generation speed barely changes regardless of how much context is allocated, because only 5 of 30 layers carry the full KV cache.

The prefill difference is significant — Gemma processes prompts 23% faster. Both models handle 256K+ context natively on this GPU.

PriorityPick
Fastest generation at 108K+ fillGemma 4
Fastest prefillGemma 4
Multimodal (image input)Gemma 4
Maximum context windowTie (256K Gemma / 262K Qwen, 400K Qwen with TurboQuant)
Coding benchmarksQwen 3.6
Ecosystem maturityQwen 3.6

I run Qwen as my daily driver because the ecosystem is more mature and coding performance edges ahead. But for document analysis and long-context retrieval, Gemma is the better pick. Having both downloaded and switchable is the right answer.

llama-server \
  -m ~/.cache/llama.cpp/gemma-4-26B-A4B-it-UD-IQ3_XXS.gguf \
  -ngl 99 -fa 1 -c 262144 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --kv-unified \
  --perf --no-warmup --mlock \
  -np 1 -b 512 -ub 256 -t 6 -tb 6 \
  --threads-http 8 --no-mmap --jinja \
  --port 11433 --host 0.0.0.0

The critical flags: no --swa-full (let iSWA use the compact cache), -c 262144 (full 256K), and q4_0 KV cache for the global layers. For better output quality at the cost of context (65K max, 85 t/s), swap the model to IQ4_XS.

What I Actually Learned

Dense vs MoE is settled on 16 GB. This is the third time I have benchmarked dense vs MoE on this GPU: Qwen 3.5, Qwen 3.6, now Gemma 4. Dense models run at 26-31 t/s with 65-131K context. MoE models run at 85-99 t/s with 131-256K context. On constrained VRAM, MoE wins unconditionally.

Smaller quants can be faster. The IQ3_XXS quantization was faster than IQ4_XS on every test — 99 t/s vs 85 t/s. On a memory-bandwidth-bound GPU, smaller weights mean more compute buffer headroom. Do not assume higher quant = better experience.

TurboQuant has diminishing returns with iSWA. TurboQuant extended Qwen’s context from 262K to 400K by compressing KV cache per layer. For Gemma 4, iSWA already solves 83% of the problem — TurboQuant can only compress the 5 global layers. With 2.7 GB to spare at 256K, the headroom is already there.

The views and opinions expressed here are my own and do not reflect those of my employer.