Snippets
Short, practical code references. Copy-paste ready.
OpenCode with Local llama.cpp (Qwen 3.6)
Connect OpenCode to a local llama.cpp server running Qwen 3.6 MTP. Zero API costs, 90K context, local-first AI coding.
Gemma 4 MTP Server (ik_llama.cpp)
Run Gemma 4 26B-A4B with MTP speculative decoding using ik_llama.cpp. Separate drafter model, 133 t/s on an NVIDIA RTX 5060 Ti 16 GB.
Qwen 3.6 MTP Server (llama.cpp)
Run Qwen 3.6 35B-A3B with MTP speculative decoding on llama.cpp. 144 t/s on an NVIDIA RTX 5060 Ti 16 GB.
Gemma 4 256K Context Server (llama.cpp)
llama-server config for Gemma 4 26B-A4B MoE with full 256K context on an NVIDIA RTX 5060 Ti 16 GB. The key: do NOT use --swa-full.
Qwen 3.6 with TurboQuant: 400K Context on 16 GB
llama-server config for Qwen 3.6 35B-A3B with TurboQuant turbo3 KV cache. 400K context window on an RTX 5060 Ti 16 GB.
Build llama.cpp with CUDA
CMake build commands for llama.cpp with CUDA GPU acceleration. Works for mainline, ik_llama.cpp, and TurboQuant forks.