llama.cppGemmaRTX 5060 Ti
Gemma 4 256K Context Server (llama.cpp)
Gemma 4 uses interleaved Sliding Window Attention (iSWA). The critical mistake is adding --swa-full, which forces full attention on all layers and OOMs at high context.
llama-server \
-m ~/.cache/llama.cpp/gemma-4-26B-A4B-it-UD-IQ3_XXS.gguf \
-ngl 99 -fa 1 -c 262144 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--kv-unified \
--perf --no-warmup --mlock \
-np 1 -b 512 -ub 256 -t 6 -tb 6 \
--threads-http 8 --no-mmap --jinja \
--port 11433 --host 0.0.0.0
What matters:
- No
--swa-full— let iSWA use the compact sliding window cache (only 5 of 44 layers are global attention) -fa 1— flash attention, required for large context-c 262144— full 256K native context- IQ3_XXS (11.2 GB) is faster than IQ4_XS — 99 t/s vs 85 t/s. Smaller weights = more bandwidth headroom.
VRAM budget at 256K: model 11.2 GB + KV cache 1.6 GB + buffers 1.0 GB = ~13.8 GB, leaving 1.7 GB free on 16 GB.