wikis / llama.cpp / wiki / summaries / docs-build.md view as markdown

Building llama.cpp Locally (Backend Matrix)

type: summaryconfidence: highupdated: 2026-05-30llama_build: master (~2026-05)sources: 1

Key Points

Main product is the llama library (C-style interface in include/llama.h); repo also ships many example programs/tools (incl. an OpenAI-compatible HTTP server).
Get the code: git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp.
Canonical CMake build flow: cmake -B build then cmake --build build --config Release. Add -j 8 (or use Ninja) for parallel compile; install ccache for faster repeated builds.
Debug builds: single-config generators use -DCMAKE_BUILD_TYPE=Debug; multi-config (-G "Xcode", Visual Studio) use --config Debug.
Static build: -DBUILD_SHARED_LIBS=OFF.
Windows: install Visual Studio 2022 (Desktop dev with C++, CMake tools, clang, MS-Build for LLVM). WoA/ARM64 uses presets arm64-windows-llvm-release (with -D GGML_OPENMP=OFF) or x64-windows-llvm-release.
Optional HTTPS/TLS: install OpenSSL dev libs (libssl-dev / openssl-devel / openssl); without it the project still builds and runs but with no SSL support.
BLAS (-DGGML_BLAS=ON): helps prompt processing at batch sizes > 32; does not affect generation speed. Select implementation via -DGGML_BLAS_VENDOR=... (OpenBLAS, Intel10_64lp for oneMKL, Generic, BLIS, etc.). Apple Accelerate is enabled by default on Mac.
Metal: enabled by default on macOS (runs compute on GPU). Disable at compile with -DGGML_METAL=OFF; disable GPU inference at runtime with --n-gpu-layers 0.
SYCL: supports Intel GPUs (Data Center Max/Flex, Arc, built-in/iGPU). See docs/backend/SYCL.md.
CUDA: -DGGML_CUDA=ON. Non-native (all GPUs) build adds -DGGML_NATIVE=OFF; specify archs via -DCMAKE_CUDA_ARCHITECTURES="86;89"; pick a CUDA install via -DCMAKE_CUDA_COMPILER=/opt/cuda-11.7/bin/nvcc.
CUDA runtime env vars: CUDA_VISIBLE_DEVICES, CUDA_SCALE_LAUNCH_QUEUES=4x (helps multi-GPU pipeline parallelism), GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F/_16F, GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 (RAM fallback on Linux), GGML_CUDA_P2P (peer access). Compile-time perf options: GGML_CUDA_FORCE_MMQ, GGML_CUDA_FORCE_CUBLAS, GGML_CUDA_PEER_MAX_BATCH_SIZE (default 128), GGML_CUDA_FA_ALL_QUANTS.
MUSA (Moore Threads GPU): -DGGML_MUSA=ON; archs via -DMUSA_ARCHITECTURES="21"; runtime MUSA_VISIBLE_DEVICES. Reuses many CUDA options.
HIP (AMD ROCm GPUs): -DGGML_HIP=ON with -DGPU_TARGETS=gfx1030 (optional; omit to build for all detected GPUs). rocWMMA flash-attn boost via -DGGML_HIP_ROCWMMA_FATTN=ON. Runtime HIP_VISIBLE_DEVICES, HSA_OVERRIDE_GFX_VERSION (not on Windows). UMA via GGML_CUDA_ENABLE_UNIFIED_MEMORY=1.
Vulkan: -DGGML_VULKAN=ON (or =1). Needs Vulkan SDK + SPIRV-Headers (spirv-headers / spirv-headers-devel). On macOS uses MoltenVK or KosmicKrisp via VK_ICD_FILENAMES; combine with -DGGML_METAL=OFF.
CANN (Ascend NPU): -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release.
ZenDNN (AMD EPYC CPUs): -DGGML_ZENDNN=ON (auto-downloads/builds ZenDNN on first build, 5-10 min).
Arm KleidiAI (CPU microkernels): -DGGML_CPU_KLEIDIAI=ON; SME control via env GGML_KLEIDIAI_SME.
OpenCL (Adreno GPU): -DGGML_OPENCL=ON (Android NDK / Windows ARM64 instructions provided).
WebGPU: -DGGML_WEBGPU=ON (relies on Dawn; browser builds via Emscripten + emdawnwebgpu).
OpenVINO (Intel CPU/GPU/NPU): see docs/backend/OPENVINO.md (in progress).
Multiple backends can be built together (e.g. -DGGML_CUDA=ON -DGGML_VULKAN=ON); select at runtime with --device (--list-devices to enumerate). Fully disable GPU with --device none (even -ngl 0 may still use GPU). Dynamic backend loading via GGML_BACKEND_DL.

Relevant Concepts

build and backends — this is the canonical build/backend reference: cmake flow plus the per-backend enable flags.
server api — the OpenAI-compatible server is one of the built tools; SSL/OpenSSL note applies.
binary llama cli — used in backend verification examples (-ngl, --device none).
backend cpu — default build target; BLAS/KleidiAI/ZenDNN augment it.
backend cuda — -DGGML_CUDA=ON, NVIDIA.
backend metal — default on macOS.
backend vulkan — -DGGML_VULKAN=ON, cross-vendor GPU.
backend rocm — HIP path, -DGGML_HIP=ON, AMD.
build and backends — -DGGML_SYCL=ON path, Intel GPU.

Source Metadata

Type: official documentation (mirror)
Repo/path: ggml-org/llama.cpp + docs/build.md
Fetched: 2026-05-30 from master
URL: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md