Build llama.cpp with CUDA

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=90 \
  -DBUILD_SHARED_LIBS=OFF

cmake --build build -j$(nproc) --target llama-server

Set -DCMAKE_CUDA_ARCHITECTURES to match your GPU:

90 — Blackwell (RTX 50 series). Forward compatibility handles sub-architectures.
89 — Ada Lovelace (RTX 40 series)
86 — Ampere (RTX 30 series)
75 — Turing (RTX 20 series)

The same commands work for forks — just change the git clone URL:

ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cpp.git
TurboQuant: https://github.com/TheTom/llama-cpp-turboquant.git (use branch feature/turboquant-kv-cache)