---
title: "Building llama.cpp Locally (Backend Matrix)"
type: summary
tags: [build, cpu, cuda, metal, vulkan, rocm, sycl, musa, cann, blas, developer]
created: 2026-05-30
updated: 2026-05-30
sources: ["raw/docs-build.md"]
confidence: high
llama_build: "master (~2026-05)"
---

# Building llama.cpp Locally (Backend Matrix)

## Key Points
- Main product is the `llama` library (C-style interface in `include/llama.h`); repo also ships many example programs/tools (incl. an OpenAI-compatible HTTP server).
- Get the code: `git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp`.
- Canonical CMake build flow: `cmake -B build` then `cmake --build build --config Release`. Add `-j 8` (or use Ninja) for parallel compile; install `ccache` for faster repeated builds.
- Debug builds: single-config generators use `-DCMAKE_BUILD_TYPE=Debug`; multi-config (`-G "Xcode"`, Visual Studio) use `--config Debug`.
- Static build: `-DBUILD_SHARED_LIBS=OFF`.
- Windows: install Visual Studio 2022 (Desktop dev with C++, CMake tools, clang, MS-Build for LLVM). WoA/ARM64 uses presets `arm64-windows-llvm-release` (with `-D GGML_OPENMP=OFF`) or `x64-windows-llvm-release`.
- Optional HTTPS/TLS: install OpenSSL dev libs (`libssl-dev` / `openssl-devel` / `openssl`); without it the project still builds and runs but with no SSL support.
- BLAS (`-DGGML_BLAS=ON`): helps prompt processing at batch sizes > 32; does not affect generation speed. Select implementation via `-DGGML_BLAS_VENDOR=...` (OpenBLAS, `Intel10_64lp` for oneMKL, `Generic`, BLIS, etc.). Apple Accelerate is enabled by default on Mac.
- Metal: enabled by default on macOS (runs compute on GPU). Disable at compile with `-DGGML_METAL=OFF`; disable GPU inference at runtime with `--n-gpu-layers 0`.
- SYCL: supports Intel GPUs (Data Center Max/Flex, Arc, built-in/iGPU). See `docs/backend/SYCL.md`.
- CUDA: `-DGGML_CUDA=ON`. Non-native (all GPUs) build adds `-DGGML_NATIVE=OFF`; specify archs via `-DCMAKE_CUDA_ARCHITECTURES="86;89"`; pick a CUDA install via `-DCMAKE_CUDA_COMPILER=/opt/cuda-11.7/bin/nvcc`.
- CUDA runtime env vars: `CUDA_VISIBLE_DEVICES`, `CUDA_SCALE_LAUNCH_QUEUES=4x` (helps multi-GPU pipeline parallelism), `GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F`/`_16F`, `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` (RAM fallback on Linux), `GGML_CUDA_P2P` (peer access). Compile-time perf options: `GGML_CUDA_FORCE_MMQ`, `GGML_CUDA_FORCE_CUBLAS`, `GGML_CUDA_PEER_MAX_BATCH_SIZE` (default 128), `GGML_CUDA_FA_ALL_QUANTS`.
- MUSA (Moore Threads GPU): `-DGGML_MUSA=ON`; archs via `-DMUSA_ARCHITECTURES="21"`; runtime `MUSA_VISIBLE_DEVICES`. Reuses many CUDA options.
- HIP (AMD ROCm GPUs): `-DGGML_HIP=ON` with `-DGPU_TARGETS=gfx1030` (optional; omit to build for all detected GPUs). rocWMMA flash-attn boost via `-DGGML_HIP_ROCWMMA_FATTN=ON`. Runtime `HIP_VISIBLE_DEVICES`, `HSA_OVERRIDE_GFX_VERSION` (not on Windows). UMA via `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1`.
- Vulkan: `-DGGML_VULKAN=ON` (or `=1`). Needs Vulkan SDK + SPIRV-Headers (`spirv-headers` / `spirv-headers-devel`). On macOS uses MoltenVK or KosmicKrisp via `VK_ICD_FILENAMES`; combine with `-DGGML_METAL=OFF`.
- CANN (Ascend NPU): `-DGGML_CANN=on -DCMAKE_BUILD_TYPE=release`.
- ZenDNN (AMD EPYC CPUs): `-DGGML_ZENDNN=ON` (auto-downloads/builds ZenDNN on first build, 5-10 min).
- Arm KleidiAI (CPU microkernels): `-DGGML_CPU_KLEIDIAI=ON`; SME control via env `GGML_KLEIDIAI_SME`.
- OpenCL (Adreno GPU): `-DGGML_OPENCL=ON` (Android NDK / Windows ARM64 instructions provided).
- WebGPU: `-DGGML_WEBGPU=ON` (relies on Dawn; browser builds via Emscripten + emdawnwebgpu).
- OpenVINO (Intel CPU/GPU/NPU): see `docs/backend/OPENVINO.md` (in progress).
- Multiple backends can be built together (e.g. `-DGGML_CUDA=ON -DGGML_VULKAN=ON`); select at runtime with `--device` (`--list-devices` to enumerate). Fully disable GPU with `--device none` (even `-ngl 0` may still use GPU). Dynamic backend loading via `GGML_BACKEND_DL`.

## Relevant Concepts
- [[concepts/build-and-backends]] — this is the canonical build/backend reference: cmake flow plus the per-backend enable flags.
- [[concepts/server-api]] — the OpenAI-compatible server is one of the built tools; SSL/OpenSSL note applies.
- [[entities/binary-llama-cli]] — used in backend verification examples (`-ngl`, `--device none`).
- [[entities/backend-cpu]] — default build target; BLAS/KleidiAI/ZenDNN augment it.
- [[entities/backend-cuda]] — `-DGGML_CUDA=ON`, NVIDIA.
- [[entities/backend-metal]] — default on macOS.
- [[entities/backend-vulkan]] — `-DGGML_VULKAN=ON`, cross-vendor GPU.
- [[entities/backend-rocm]] — HIP path, `-DGGML_HIP=ON`, AMD.
- [[concepts/build-and-backends]] — `-DGGML_SYCL=ON` path, Intel GPU.

## Source Metadata
- Type: official documentation (mirror)
- Repo/path: ggml-org/llama.cpp + docs/build.md
- Fetched: 2026-05-30 from master
- URL: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md
