Skip to main content
NJannasch.Dev
llama.cppGemmaRTX 5060 Ti

Gemma 4 256K Context Server (llama.cpp)

Gemma 4 uses interleaved Sliding Window Attention (iSWA). The critical mistake is adding --swa-full, which forces full attention on all layers and OOMs at high context.

llama-server \
  -m ~/.cache/llama.cpp/gemma-4-26B-A4B-it-UD-IQ3_XXS.gguf \
  -ngl 99 -fa 1 -c 262144 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --kv-unified \
  --perf --no-warmup --mlock \
  -np 1 -b 512 -ub 256 -t 6 -tb 6 \
  --threads-http 8 --no-mmap --jinja \
  --port 11433 --host 0.0.0.0

What matters:

  • No --swa-full — let iSWA use the compact sliding window cache (only 5 of 44 layers are global attention)
  • -fa 1 — flash attention, required for large context
  • -c 262144 — full 256K native context
  • IQ3_XXS (11.2 GB) is faster than IQ4_XS — 99 t/s vs 85 t/s. Smaller weights = more bandwidth headroom.

VRAM budget at 256K: model 11.2 GB + KV cache 1.6 GB + buffers 1.0 GB = ~13.8 GB, leaving 1.7 GB free on 16 GB.