Can Gemma 4 run at 256K context on 16GB VRAM?

Yes. Gemma 4's 26B MoE reaches the full 256K context on an RTX 5060 Ti (16GB) with 2.7 GB to spare. The key is not using the --swa-full flag, which forces full context allocation for every layer and causes OOM.

What is iSWA in Gemma 4?

iSWA (interleaved Sliding Window Attention) is Gemma 4's attention architecture. It alternates between local sliding window layers (4K window) and global full-attention layers, creating a dual KV cache that uses far less memory than full attention on every layer.

Gemma 4 vs Qwen 3.6 on a consumer GPU — which is better?

Both MoE variants perform similarly: Gemma 4 at 99 t/s and Qwen 3.6 at 98 t/s on an RTX 5060 Ti. Gemma 4 supports 256K context vs Qwen's 262K. The choice comes down to task quality — see the code generation showdown for a head-to-head comparison.

Gemma 4 on a 5060 Ti: 256K Context on 16GB — but Only if You Know the Architecture Trick

Using multiple AI coding tools and losing track of sessions? I built VibeCockpit — one dashboard to search and resume sessions across Claude Code, Copilot, Codex, and more.

TL;DR: Gemma 4’s 26B MoE runs at 99 t/s with 256K context on my RTX 5060 Ti (16 GB). The 31B dense model manages 27 t/s and tops out at 65K context. The key insight: Gemma 4 uses iSWA (interleaved Sliding Window Attention), which creates a dual KV cache — and the --swa-full flag (commonly copied from Qwen configs) forces it to allocate the full context for every layer, OOMing immediately. Remove that flag and 256K fits with 2.7 GB to spare. Copy-paste server configs, full benchmark tables, and a head-to-head vs Qwen 3.6 below. [New: code generation showdown — how well do these models actually write code? Four rounds, visual results.]

After three rounds of Qwen benchmarks on this GPU, I wanted to see whether Google’s Gemma 4 could do anything different on the same hardware. The model family looked interesting: an MoE variant with 128 experts (same as Qwen), a dense 31B, multimodal vision support, and a claimed 256K context window. All under Apache 2.0.

The surprise was not the speed. It was the architecture.

Test Setup

Same machine as all my previous benchmarks:

RTX 5060 Ti 16 GB over OCuLink (PCIe 4.0 x4)
AMD Ryzen 7, headless (no compositor — ~15.5 GB usable VRAM)
llama.cpp mainline, built from source with CUDA
Unsloth GGUF quantizations via hf download

Models tested:

Model	Type	Total / Active Params	GGUF Size
`gemma-4-26B-A4B-it-UD-IQ4_XS`	MoE	25.2B / 3.8B	13.4 GB
`gemma-4-26B-A4B-it-UD-Q3_K_M`	MoE	25.2B / 3.8B	12.5 GB
`gemma-4-26B-A4B-it-UD-IQ3_XXS`	MoE	25.2B / 3.8B	11.2 GB
`gemma-4-31B-it-UD-IQ3_XXS`	Dense	30.7B / 30.7B	11.8 GB
`gemma-4-31B-it-Q3_K_M`	Dense	30.7B / 30.7B	14.7 GB

The Result, Up Front

The MoE at IQ3_XXS is the winner. 256K context, 99 t/s sustained, 2.7 GB of VRAM to spare.

Config	Short	Sustained	Code	108K gen	Max Context	Peak VRAM
MoE IQ3_XXS q4_0-KV	96 t/s	99 t/s	97 t/s	49 t/s	256K	13.2 GB
MoE Q3_K_M q4_0-KV	87 t/s	89 t/s	87 t/s	46 t/s	131K	13.6 GB
MoE IQ4_XS q4_0-KV	84 t/s	85 t/s	83 t/s	—	65K	13.9 GB
Dense IQ3_XXS q4_0-KV	27 t/s	26 t/s	25 t/s	—	65K	13.8 GB
Dense Q3_K_M q4_0-KV	19 t/s	18 t/s	18 t/s	—	8K	15.0 GB

The dense 31B is 3.5x slower and maxes out at 65K context. Same conclusion as Qwen’s dense vs MoE: on 16 GB, MoE is the only architecture that works for long context.

The Architecture Trick: iSWA and the Dual KV Cache

This is the part that took me a day to figure out, and it is the reason this post exists.

Gemma 4 uses iSWA — interleaved Sliding Window Attention. Instead of giving every layer full attention over the entire context (like a standard transformer), Gemma splits its layers into two types:

Layer Type	Attention	KV Cache Size	Gemma 4 MoE (30 layers)
Global	Full context	Scales with context length	5 layers
SWA	Last 1024 tokens only	Fixed at 1024 entries	25 layers

The final layer is always global, so the model’s output has full context awareness. But the 25 SWA layers only look at a small local window. This is by design — Google found that most layers don’t need to see the entire history to do their job well.

The consequence for VRAM is dramatic. Toggle between modes to see why --swa-full — essential for Qwen, where it prevents sliding window amnesia — is a trap for Gemma 4:

With --swa-full, llama.cpp allocates the full context for ALL 30 layers — OOM. Without it, only the 5 global layers get full-context KV. The total KV cache drops from ~16 GB to ~2 GB. My first benchmark run had --swa-full on every config. 8 out of 10 failed to start. Once I removed it, all 11 completed — including the full 256K.

The VRAM scaling confirms this — the MoE at 256K uses less VRAM than the dense model at 131K:

Config	Startup	After 20K	After 50K	After 108K
MoE IQ3_XXS 256K	13,151	13,189	13,309	13,493
MoE IQ3_XXS 131K	12,291	12,329	12,449	12,633
MoE Q3_K_M 131K	13,537	13,575	13,695	13,877
MoE IQ4_XS 65K	14,031	14,069	14,189	—
Dense IQ3_XXS 65K	13,807	13,869	14,113	—
Dense IQ3_XXS 131K	15,247	15,309	15,553	15,845

All values in MiB. GPU total: 16,384 MiB.

Going from 65K to 256K context costs only ~1.3 GB extra at startup because the SWA cache is fixed-size — only the 5 global layers grow with context. And the 256K context is not just allocated — it actually works. A needle-in-a-haystack test (unique fact buried at 50% depth in filler text) passes at every depth from 5K to 200K tokens. No degradation, no hallucination.

Full MoE Benchmark Results

All tests run on the same machine, progressive context fill from empty to 108K tokens.

Config	Short	Med	Code	20K PP	20K Gen	50K PP	50K Gen	108K PP	108K Gen
IQ4_XS fp16-KV 8K (swa-full)	89	92	89	2410	72	—	—	—	—
IQ4_XS q4_0-KV 8K	86	85	84	2294	81	—	—	—	—
IQ4_XS q4_0-KV 32K	84	85	83	2292	81	—	—	—	—
IQ4_XS q4_0-KV 65K	84	85	83	2293	81	1923	62	—	—
Q3_K_M q4_0-KV 32K	86	89	87	2300	84	—	—	—	—
Q3_K_M q4_0-KV 65K	87	89	87	2300	84	1938	63	—	—
Q3_K_M q4_0-KV 131K	87	89	87	2299	84	1937	63	1548	46
IQ3_XXS q4_0-KV 65K	96	99	97	2013	93	1737	69	—	—
IQ3_XXS q4_0-KV 131K	96	99	97	2012	94	1737	68	1422	49
IQ3_XXS q4_0-KV 196K	96	99	97	2013	93	1738	68	1422	49
IQ3_XXS q4_0-KV 256K	96	99	97	2012	93	1737	69	1422	49

All values in tokens/second. PP = prompt processing (prefill). Gen = generation after fill.

The IQ3_XXS is actually the fastest quantization — 99 t/s sustained vs 85 for IQ4_XS. The smaller model leaves more VRAM for compute buffers, which matters when the GPU memory controller is the bottleneck.

Generation degrades gracefully as context fills: 99 → 69 → 49 t/s from empty to 50K to 108K fill. Still very usable at 108K.

Dense 31B Results: Not Worth It on 16 GB

Config	Short	Med	Code	20K Gen	50K Gen	108K Gen	Peak VRAM
IQ3_XXS fp16-KV 8K	29	28	27	26	—	—	13.2 GB
IQ3_XXS q4_0-KV 32K	27	26	25	24	—	—	12.8 GB
IQ3_XXS q4_0-KV 65K	27	26	25	24	18	—	13.8 GB
IQ3_XXS q4_0-KV 131K	27	26	—	24	18	OOM	15.5 GB
Q3_K_M q4_0-KV 8K	19	18	18	17	—	—	15.0 GB

3.5x slower, 65K max usable context. Same result as Qwen’s 27B dense — on 16 GB, this is a fundamental dense vs MoE tradeoff, not a Gemma vs Qwen thing.

Head-to-Head: Gemma 4 vs Qwen 3.6

I re-ran the Qwen 3.6 35B MoE as a reference alongside the Gemma benchmarks, same machine, same session.

Metric	Gemma 4 26B MoE (IQ3_XXS)	Qwen 3.6 35B MoE (IQ3_S)
Architecture	128 experts, 8 active + 1 shared	128 experts, 8 active
Active params	3.8B	3B
Short burst	96 t/s	93 t/s
Sustained gen	99 t/s	94 t/s
Code gen	97 t/s	93 t/s
20K gen	93 t/s	90 t/s
50K gen	69 t/s	70 t/s
108K gen (131K ctx)	49 t/s	52 t/s
108K gen (max ctx)	49 t/s	46 t/s†
Prefill (20K)	2012 t/s	1631 t/s
Max context	256K	262K (400K with TurboQuant)
Peak VRAM (max ctx)	13.2 GB	~15.3 GB†

†Qwen 262K numbers from previous benchmarks. Same-session reference ran at 131K.

At short context, Gemma is consistently faster. At 131K context, Qwen edges ahead at 108K fill (52 vs 49 t/s). But at max context — 262K for Qwen, 256K for Gemma — the picture flips: Gemma holds 49 t/s while Qwen drops to 46 t/s. This is iSWA in action — Gemma’s generation speed barely changes regardless of how much context is allocated, because only 5 of 30 layers carry the full KV cache.

The prefill difference is significant — Gemma processes prompts 23% faster. Both models handle 256K+ context natively on this GPU.

Priority	Pick
Fastest generation at 108K+ fill	Gemma 4
Fastest prefill	Gemma 4
Multimodal (image input)	Gemma 4
Maximum context window	Tie (256K Gemma / 262K Qwen, 400K Qwen with TurboQuant)
Coding benchmarks	Qwen 3.6
Ecosystem maturity	Qwen 3.6

I run Qwen as my daily driver because the ecosystem is more mature and coding performance edges ahead. But for document analysis and long-context retrieval, Gemma is the better pick. Having both downloaded and switchable is the right answer.

The Recommended Configuration

llama-server \
  -m ~/.cache/llama.cpp/gemma-4-26B-A4B-it-UD-IQ3_XXS.gguf \
  -ngl 99 -fa 1 -c 262144 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --kv-unified \
  --perf --no-warmup --mlock \
  -np 1 -b 512 -ub 256 -t 6 -tb 6 \
  --threads-http 8 --no-mmap --jinja \
  --port 11433 --host 0.0.0.0

The critical flags: no --swa-full (let iSWA use the compact cache), -c 262144 (full 256K), and q4_0 KV cache for the global layers. For better output quality at the cost of context (65K max, 85 t/s), swap the model to IQ4_XS.

What I Actually Learned

Dense vs MoE is settled on 16 GB. This is the third time I have benchmarked dense vs MoE on this GPU: Qwen 3.5, Qwen 3.6, now Gemma 4. Dense models run at 26-31 t/s with 65-131K context. MoE models run at 85-99 t/s with 131-256K context. On constrained VRAM, MoE wins unconditionally.

Smaller quants can be faster. The IQ3_XXS quantization was faster than IQ4_XS on every test — 99 t/s vs 85 t/s. On a memory-bandwidth-bound GPU, smaller weights mean more compute buffer headroom. Do not assume higher quant = better experience.

TurboQuant has diminishing returns with iSWA. TurboQuant extended Qwen’s context from 262K to 400K by compressing KV cache per layer. For Gemma 4, iSWA already solves 83% of the problem — TurboQuant can only compress the 5 global layers. With 2.7 GB to spare at 256K, the headroom is already there.

The views and opinions expressed here are my own and do not reflect those of my employer.