wikis / llama.cpp / wiki / summaries / docs-build.md view as markdown
Building llama.cpp Locally (Backend Matrix)
Key Points
- Main product is the
llamalibrary (C-style interface ininclude/llama.h); repo also ships many example programs/tools (incl. an OpenAI-compatible HTTP server). - Get the code:
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp. - Canonical CMake build flow:
cmake -B buildthencmake --build build --config Release. Add-j 8(or use Ninja) for parallel compile; installccachefor faster repeated builds. - Debug builds: single-config generators use
-DCMAKE_BUILD_TYPE=Debug; multi-config (-G "Xcode", Visual Studio) use--config Debug. - Static build:
-DBUILD_SHARED_LIBS=OFF. - Windows: install Visual Studio 2022 (Desktop dev with C++, CMake tools, clang, MS-Build for LLVM). WoA/ARM64 uses presets
arm64-windows-llvm-release(with-D GGML_OPENMP=OFF) orx64-windows-llvm-release. - Optional HTTPS/TLS: install OpenSSL dev libs (
libssl-dev/openssl-devel/openssl); without it the project still builds and runs but with no SSL support. - BLAS (
-DGGML_BLAS=ON): helps prompt processing at batch sizes > 32; does not affect generation speed. Select implementation via-DGGML_BLAS_VENDOR=...(OpenBLAS,Intel10_64lpfor oneMKL,Generic, BLIS, etc.). Apple Accelerate is enabled by default on Mac. - Metal: enabled by default on macOS (runs compute on GPU). Disable at compile with
-DGGML_METAL=OFF; disable GPU inference at runtime with--n-gpu-layers 0. - SYCL: supports Intel GPUs (Data Center Max/Flex, Arc, built-in/iGPU). See
docs/backend/SYCL.md. - CUDA:
-DGGML_CUDA=ON. Non-native (all GPUs) build adds-DGGML_NATIVE=OFF; specify archs via-DCMAKE_CUDA_ARCHITECTURES="86;89"; pick a CUDA install via-DCMAKE_CUDA_COMPILER=/opt/cuda-11.7/bin/nvcc. - CUDA runtime env vars:
CUDA_VISIBLE_DEVICES,CUDA_SCALE_LAUNCH_QUEUES=4x(helps multi-GPU pipeline parallelism),GGML_CUDA_FORCE_CUBLAS_COMPUTE_32F/_16F,GGML_CUDA_ENABLE_UNIFIED_MEMORY=1(RAM fallback on Linux),GGML_CUDA_P2P(peer access). Compile-time perf options:GGML_CUDA_FORCE_MMQ,GGML_CUDA_FORCE_CUBLAS,GGML_CUDA_PEER_MAX_BATCH_SIZE(default 128),GGML_CUDA_FA_ALL_QUANTS. - MUSA (Moore Threads GPU):
-DGGML_MUSA=ON; archs via-DMUSA_ARCHITECTURES="21"; runtimeMUSA_VISIBLE_DEVICES. Reuses many CUDA options. - HIP (AMD ROCm GPUs):
-DGGML_HIP=ONwith-DGPU_TARGETS=gfx1030(optional; omit to build for all detected GPUs). rocWMMA flash-attn boost via-DGGML_HIP_ROCWMMA_FATTN=ON. RuntimeHIP_VISIBLE_DEVICES,HSA_OVERRIDE_GFX_VERSION(not on Windows). UMA viaGGML_CUDA_ENABLE_UNIFIED_MEMORY=1. - Vulkan:
-DGGML_VULKAN=ON(or=1). Needs Vulkan SDK + SPIRV-Headers (spirv-headers/spirv-headers-devel). On macOS uses MoltenVK or KosmicKrisp viaVK_ICD_FILENAMES; combine with-DGGML_METAL=OFF. - CANN (Ascend NPU):
-DGGML_CANN=on -DCMAKE_BUILD_TYPE=release. - ZenDNN (AMD EPYC CPUs):
-DGGML_ZENDNN=ON(auto-downloads/builds ZenDNN on first build, 5-10 min). - Arm KleidiAI (CPU microkernels):
-DGGML_CPU_KLEIDIAI=ON; SME control via envGGML_KLEIDIAI_SME. - OpenCL (Adreno GPU):
-DGGML_OPENCL=ON(Android NDK / Windows ARM64 instructions provided). - WebGPU:
-DGGML_WEBGPU=ON(relies on Dawn; browser builds via Emscripten + emdawnwebgpu). - OpenVINO (Intel CPU/GPU/NPU): see
docs/backend/OPENVINO.md(in progress). - Multiple backends can be built together (e.g.
-DGGML_CUDA=ON -DGGML_VULKAN=ON); select at runtime with--device(--list-devicesto enumerate). Fully disable GPU with--device none(even-ngl 0may still use GPU). Dynamic backend loading viaGGML_BACKEND_DL.
Relevant Concepts
- build and backends โ this is the canonical build/backend reference: cmake flow plus the per-backend enable flags.
- server api โ the OpenAI-compatible server is one of the built tools; SSL/OpenSSL note applies.
- binary llama cli โ used in backend verification examples (
-ngl,--device none). - backend cpu โ default build target; BLAS/KleidiAI/ZenDNN augment it.
- backend cuda โ
-DGGML_CUDA=ON, NVIDIA. - backend metal โ default on macOS.
- backend vulkan โ
-DGGML_VULKAN=ON, cross-vendor GPU. - backend rocm โ HIP path,
-DGGML_HIP=ON, AMD. - build and backends โ
-DGGML_SYCL=ONpath, Intel GPU.
Source Metadata
- Type: official documentation (mirror)
- Repo/path: ggml-org/llama.cpp + docs/build.md
- Fetched: 2026-05-30 from master
- URL: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md
