# vLLM — full corpus # LLM Wiki An open-source template for building LLM-powered knowledge bases, following [Andrej Karpathy's "LLM Wiki" pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f). You provide raw sources. The LLM reads them, writes structured wiki pages, cross-links everything, and maintains it over time. You never edit the wiki directly — you curate sources and ask questions. ## How It Works The system has three layers: ``` raw/ Sources you collect (articles, transcripts, notes, PDFs) wiki/ LLM-written & maintained pages (summaries, concepts, entities, syntheses) CLAUDE.md Schema that tells the LLM how to structure everything ``` Three operations drive the workflow: | Operation | Trigger | What happens | |-----------|---------|--------------| | **Ingest** | "ingest raw/my-source.txt" | LLM reads the source, creates a summary page, creates/updates concept and entity pages, adds cross-links, updates the index and log | | **Query** | Ask any question | LLM searches the wiki, synthesizes an answer with citations, optionally creates a synthesis page for novel insights | | **Lint** | "lint" or "health check" | LLM audits all pages for orphans, contradictions, missing links, incomplete sections, and low-confidence claims — fixes what it can, reports the rest | ## Quick Start 1. **Clone this repo** ```bash git clone https://github.com/YOUR_USERNAME/llm-wiki.git my-knowledge-base cd my-knowledge-base ``` 2. **Customize CLAUDE.md** for your domain - Update the Purpose section with your topic - Replace the placeholder tagging taxonomy with your own categories - Adjust confidence level descriptions if needed - Everything else (workflows, page formats, linking rules) works as-is 3. **Drop sources into `raw/`** - Text files, transcripts, articles, notes — any plain text - These are immutable once added; the LLM never modifies them 4. **Tell the LLM to ingest** ``` ingest raw/my-first-source.txt ``` The LLM will create summary pages, concept pages, entity pages, cross-links, and update the index. 5. **Ask questions** ``` What are the key differences between X and Y? ``` The LLM answers from the wiki, citing specific pages. 6. **Run health checks** ``` lint ``` The LLM audits the wiki and fixes issues. ## Directory Structure ``` . ├── CLAUDE.md # Schema — the LLM's instructions ├── raw/ # Your source documents (immutable) └── wiki/ ├── index.md # Master catalog of all pages ├── log.md # Append-only activity log ├── dashboard.md # Dataview dashboard (Obsidian) ├── analytics.md # Charts View analytics (Obsidian) ├── flashcards.md # Spaced repetition cards ├── summaries/ # One page per source document ├── concepts/ # Concept and framework pages ├── entities/ # People, tools, organizations, etc. ├── syntheses/ # Cross-cutting analyses and comparisons ├── journal/ # Research/session journal entries │ └── template.md # Journal entry template └── presentations/ # Marp slide decks ``` ## Enhancements This template includes several extras beyond the core wiki pattern: ### Dataview Dashboard (`wiki/dashboard.md`) Live queries that surface low-confidence pages, recent updates, concepts by tag, and pages with the most sources. Requires the [Dataview](https://github.com/blacksmithgu/obsidian-dataview) Obsidian plugin. ### Charts View Analytics (`wiki/analytics.md`) Visual analytics with pie charts, bar charts, and word clouds. Requires the [Charts View](https://github.com/caronchen/obsidian-chartsview-plugin) Obsidian plugin. ### Mermaid Diagrams Use Mermaid code blocks in any wiki page to create flowcharts, sequence diagrams, or concept maps. Native support in Obsidian and GitHub. ### Marp Slides (`wiki/presentations/`) Create slide decks from markdown using [Marp](https://marp.app/). Drop presentation files in this directory. ### Research Journal (`wiki/journal/`) Track your research sessions, experiments, or applied work with the included template. The LLM can reference journal entries when answering queries. ### Spaced Repetition (`wiki/flashcards.md`) Flashcards in the format used by the [Spaced Repetition](https://github.com/st3v3nmw/obsidian-spaced-repetition) Obsidian plugin. Ask the LLM to generate flashcards from any wiki page. ### MCP Server This repo works with Claude Code's MCP server capabilities. Point an MCP-compatible client at this repo and the LLM can read/write the wiki programmatically. ## Customizing for Your Domain The schema in `CLAUDE.md` is domain-agnostic. To adapt it: 1. **Purpose** — Describe your knowledge domain in one paragraph 2. **Tagging taxonomy** — Replace placeholder categories with your own (e.g., for a cooking KB: `cuisine`, `technique`, `ingredient`, `equipment`) 3. **Confidence levels** — Adjust the descriptions to match your domain's evidence standards 4. **Entity types** — Update the entity page description to match what entities mean in your domain (people, tools, companies, etc.) 5. **Journal template** — Customize `wiki/journal/template.md` for your workflow Everything else — page format, linking conventions, workflows, rules — is universal and works across domains. ## Example Domains This template works for any knowledge-intensive topic: - **Research notes** — papers, experiments, methodologies - **Book analysis** — themes, characters, author techniques - **Competitive analysis** — companies, products, market trends - **Course notes** — lectures, readings, key concepts - **Personal development** — frameworks, habits, book summaries - **Technical documentation** — APIs, architectures, design patterns - **Hobby deep-dives** — any subject you want to master ## License MIT --- title: "vLLM KB — Master Index" type: index updated: 2026-06-09 vllm_version: "v0.22.1" --- # vLLM KB — Master Index **Domain:** vLLM — high-throughput LLM serving engine (vllm-project/vllm; docs.vllm.ai). OpenAI-compatible server, offline `LLM` API, quantization, distributed serving. **Corpus:** 124 provenance-stamped sources in `raw/` (96 official docs from the repo's docs/ tree, 20 well-discussed tracker issues, 8 release notes). **Pages:** 15 (11 concepts · 1 summary · 3 syntheses) ## Concepts - [[concepts/install]] — CUDA 12.9 wheels (compute 7.5+), uv install, fresh-env rule, ROCm/XPU/CPU, no-MPS - [[concepts/quickstart-and-serving]] — offline `LLM` (generate/chat/enqueue) vs `vllm serve` - [[concepts/openai-compatible-server]] — endpoints + caveats, YAML config, one-model-per-server - [[concepts/configuration]] — engine args, `VLLM_` env vars (+K8s naming trap), conserving memory, `-O0..-O3`, model resolution - [[concepts/models-and-support]] — native vs Transformers backend (<5% penalty), hardware-specific lists - [[concepts/pooling-models]] — embed/classify/score/reward, score_types, FAQ model picks - [[concepts/quantization]] — full method map (LLM Compressor, online FP8, KV cache, GGUF warning, AWQ deprecation) + decision rules - [[concepts/multimodal-and-lora]] — multimodal inputs + SSRF guard, per-request LoRA, prompt embeds, disagg prefill - [[concepts/parallelism-and-scaling]] — TP/PP/DP/EP/CP decision ladder, Ray multi-node troubleshooting, insecure-by-default warning - [[concepts/cli-reference]] — vllm {serve,chat,complete,run-batch,bench,collect-env} - [[concepts/observability-and-ops]] — /metrics, reproducibility flags, usage-stats telemetry, V1-only world - [[concepts/integrations-and-clients]] — Claude Code & Codex on your own models, LangChain/LlamaIndex ## Summaries - [[summaries/release-digest]] — v0.19.0 → v0.22.1: model-support cadence, toolchain floors (CUDA 13, C++20) ## Syntheses - [[syntheses/serving-decisions]] — mode → memory ladder → scale ladder - [[syntheses/troubleshooting-playbook]] — hangs, CUDA errors, env mismatches, OOM, bug-filing - [[syntheses/model-notes-from-the-tracker]] — gpt-oss hardware floor, model FAQ mega-threads, Blackwell notes ## Statistics - **Total pages**: 15 · **Sources ingested**: 124 (immutable raw/) - **Confidence**: 13 high · 2 medium ## Coverage notes Strong: install, serving, configuration, quantization, parallelism, CLI, ops, model realities. Excluded from v1 by design (declared scope): deployment/ (k8s/docker guides, 35 docs), design/ internals, contributing/, benchmarking deep-dive, training. Recency: fetched 2026-06-09; vLLM releases ~weekly — expect drift. --- title: "CLI Reference — vllm {serve,chat,complete,bench,run-batch}" type: concept tags: [cli, commands, bench] updated: 2026-06-09 confidence: high sources: [raw/github_doc-docs-cli-readme-md.md, raw/github_doc-docs-cli-serve-md.md, raw/github_doc-docs-cli-chat-md.md, raw/github_doc-docs-cli-run-batch-md.md, raw/github_doc-docs-cli-bench-serve-md.md, raw/github_doc-docs-cli-bench-throughput-md.md, raw/github_doc-docs-cli-bench-latency-md.md] --- # CLI Reference ```bash vllm --help vllm {chat,complete,serve,launch,bench,collect-env,run-batch} ``` | Command | Purpose | |---|---| | `vllm serve` | start the OpenAI-compatible API server ([[concepts/openai-compatible-server]]) | | `vllm chat` / `vllm complete` | interactive chat / completion against a server | | `vllm run-batch` | batch-process requests from a file | | `vllm bench {serve,throughput,latency}` | benchmarking suite (serving load, offline throughput, latency) | | `vllm collect-env` | environment report — attach it to bug reports ([[syntheses/troubleshooting-playbook]]) | Notes: - Complex/nested CLI values are passed as **JSON** (the docs' "JSON CLI arguments" tip applies to `serve`, `run-batch`, and the bench commands). - The full per-command flag references are auto-generated from argparse in the docs; engine flags overlap `LLM(...)` constructor args ([[concepts/configuration]]). - `vllm bench sweep` exists for parameter sweeps with plotting (see `raw/` cli/bench/sweep docs). ## Related [[concepts/quickstart-and-serving]] · [[concepts/configuration]] --- title: "Configuration — Engine Args, Env Vars, Memory" type: concept tags: [configuration, engine-args, env-vars, memory] updated: 2026-06-09 confidence: high sources: [raw/github_doc-docs-configuration-readme-md.md, raw/github_doc-docs-configuration-engine-args-md.md, raw/github_doc-docs-configuration-env-vars-md.md, raw/github_doc-docs-configuration-conserving-memory-md.md, raw/github_doc-docs-configuration-optimization-md.md, raw/github_doc-docs-configuration-model-resolution-md.md] --- # Configuration Three priority levels (highest → lowest, per the config README): request-level > engine/server args > defaults. ## Engine arguments Control engine behavior in both modes: constructor args to `LLM` (offline) and flags to `vllm serve` (online). Source of truth = the config classes in `vllm.config` (`EngineArgs` / `AsyncEngineArgs`); the docs render the full argparse reference. ## Environment variables All vLLM env vars are prefixed **`VLLM_`**. Two official warnings: - **`VLLM_PORT` / `VLLM_HOST_IP` are for vLLM's *internal* coordination — NOT the API server's host/port** (use `--host`/`--port` flags for that). - **Kubernetes: don't name a service `vllm`** — K8s injects service-name-prefixed env vars that collide with vLLM's. ## Conserving memory (OOM toolbox) - **Tensor parallelism**: `tensor_parallel_size=N` splits the model across GPUs. - Set `CUDA_VISIBLE_DEVICES` to pick devices — and **don't call CUDA-initializing functions before vLLM init** or you'll hit `Cannot re-initialize CUDA in forked subprocess`. - More knobs: [[syntheses/serving-decisions]] § memory. ## Optimization levels `-O0`…`-O3` trade startup time for performance: `-O0` none/fastest-startup; `-O1` simple compilation + PIECEWISE cudagraphs; **`-O2` default** (more fusions, FULL_AND_PIECEWISE cudagraphs); `-O3` aggressive (currently = `-O2`). Preemption: when KV-cache space runs out, vLLM preempts requests and recomputes them later. ## Model resolution vLLM resolves models by the `architectures` field in the repo's `config.json` against registered implementations — resolution failures usually trace to that field. ## Related [[concepts/openai-compatible-server]] · [[concepts/parallelism-and-scaling]] --- title: "Installation (GPU / CPU / Platforms)" type: concept tags: [install, cuda, rocm, hardware] updated: 2026-06-09 confidence: high sources: [raw/github_doc-docs-getting-started-installation-readme-md.md, raw/github_doc-docs-getting-started-installation-gpu-cuda-inc-md.md, raw/github_doc-docs-getting-started-installation-gpu-rocm-inc-md.md, raw/github_doc-docs-getting-started-quickstart-md.md, raw/github_issue-does-vllm-support-the-mac-metal-mps.md] --- # Installation **Prereqs:** Linux, Python **3.10–3.13**. (macOS: no MPS backend in core vLLM — maintainer-confirmed; Apple Silicon GPU acceleration comes via the separate **vLLM-Metal** project.) ## NVIDIA CUDA (the main path) vLLM ships pre-compiled C++/CUDA **12.9** binaries. GPU needs **compute capability 7.5+** (T4, RTX20xx, A100, L4, H100, B200…). Recommended install with `uv`: ```bash uv venv --python 3.12 --seed source .venv/bin/activate uv pip install vllm --torch-backend=auto ``` Gotchas (official): - **Use a fresh environment.** vLLM's compiled kernels are binary-incompatible across CUDA/PyTorch build variations; mixing with an existing torch install means building from source. - **Avoid conda-installed PyTorch** — it statically links NCCL, which can break vLLM's NCCL usage (issue #8420). ## AMD ROCm ROCm **6.3+** (MI350 needs 7.0+; Ryzen AI MAX needs 7.0.2+). Prebuilt wheels: `rocm700` (Python 3.12, vLLM 0.14.0–0.18.0) and `rocm721` (nightlies). Supported GPUs include MI200s (gfx90a), MI300 (gfx942), MI350 (gfx950), Radeon RX 7900/9000 series, Ryzen AI MAX / AI 300. ## Other platforms Intel XPU, and CPU (Intel/AMD x86, ARM AArch64, Apple Silicon, IBM Z) are supported; further hardware arrives as **out-of-tree hardware plugins** (Hardware-Pluggable RFC). ## Related [[concepts/quickstart-and-serving]] · [[syntheses/troubleshooting-playbook]] (build/run failures) --- title: "Integrations — Claude Code, Codex, LangChain, LlamaIndex" type: concept tags: [integrations, agents, langchain] updated: 2026-06-09 confidence: medium sources: [raw/github_doc-docs-serving-integrations-claude-code-md.md, raw/github_doc-docs-serving-integrations-codex-md.md, raw/github_doc-docs-serving-integrations-langchain-md.md, raw/github_doc-docs-serving-integrations-llamaindex-md.md] --- # Integrations vLLM's OpenAI-compatible server slots in as the backend for popular clients — official integration docs exist for: ## Agentic coding tools - **Claude Code** (Anthropic's terminal coding agent) — point it at a vLLM server to use **your own models instead of the Anthropic API**. - **Codex** (OpenAI's terminal coding agent) — same pattern, your models instead of the OpenAI API. Both follow from the compatible endpoints ([[concepts/openai-compatible-server]] — note vLLM also implements `/v1/responses`, which Codex-class tools use). ## Frameworks - **LangChain** and **LlamaIndex** — vLLM is available as a model backend in both (pip-installable integration; vLLM serves inference, the framework orchestrates — the division of labor the pooling docs spell out for RAG, [[concepts/pooling-models]]). ## Related [[concepts/openai-compatible-server]] · [[concepts/quickstart-and-serving]] --- title: "Models & Support (incl. Transformers Backend)" type: concept tags: [models, support, transformers] updated: 2026-06-09 confidence: high sources: [raw/github_doc-docs-models-supported-models-md.md, raw/github_doc-docs-models-generative-models-md.md, raw/github_doc-docs-models-hardware-supported-models-cpu-md.md] --- # Models & Support vLLM supports **generative** and **pooling** models, listed per task with their architectures in the supported-models doc. ## Two implementation paths 1. **Native vLLM implementations** (`vllm/model_executor/models`) — the listed text and multimodal models. Generative models implement `VllmModelForTextGeneration` and output log-probabilities from final hidden states. 2. **Transformers modeling backend** — models *not* natively implemented can run via their HuggingFace Transformers implementation, at **<5% performance penalty** vs a dedicated implementation (official figure). Works for embedding/language/vision-language modalities and encoder-only/decoder-only/MoE architectures. That second path is the answer to "is X supported?" for long-tail models: often yes, via Transformers backend, before a native implementation lands ([[syntheses/model-notes-from-the-tracker]] for per-model realities). ## Hardware-specific model lists Separate validated-model lists exist per hardware class (e.g. CPU/Xeon, XPU) — check those when off the NVIDIA path. ## Related [[concepts/pooling-models]] · [[concepts/multimodal-and-lora]] · [[concepts/configuration]] (model resolution) --- title: "Multimodal Inputs, LoRA & Prompt Embeddings" type: concept tags: [multimodal, lora, vlm, features] updated: 2026-06-09 confidence: high sources: [raw/github_doc-docs-features-multimodal-inputs-md.md, raw/github_doc-docs-features-lora-md.md, raw/github_doc-docs-features-prompt-embeds-md.md, raw/github_doc-docs-features-disagg-prefill-md.md] --- # Multimodal Inputs, LoRA & Prompt Embeddings ## Multimodal inputs Pass images/video/audio to multimodal models via `multi_modal_data` (schema: `vllm.inputs.MultiModalDataDict`) alongside the HF-format `prompt`. **Official security warning for serving VLMs:** set `--allowed-media-domains` (e.g. `upload.wikimedia.org github.com`) to stop the server fetching arbitrary URLs (**SSRF risk**), and `VLLM_MEDIA_URL_ALLOW_REDIRECTS=0` to block redirect-based bypasses — especially in containerized deployments with internal-network access. ## LoRA adapters Any model implementing `SupportsLoRA` can serve adapters **per-request with minimal overhead**: ```python llm = LLM(model=base, enable_lora=True) # per request: LoRARequest(name, id, path) ``` Download adapters locally first (e.g. `huggingface_hub.snapshot_download`). ## Prompt embeddings & disaggregated prefill - **Prompt embeds:** feed embedding tensors directly instead of token ids. - **Disaggregated prefill (experimental):** split prefill and decode across instances — pairs with the parallelism options in [[concepts/parallelism-and-scaling]]. ## Related [[concepts/models-and-support]] · [[syntheses/model-notes-from-the-tracker]] (VLM gotchas) --- title: "Observability & Ops — Metrics, Reproducibility, Usage Stats" type: concept tags: [metrics, monitoring, reproducibility, ops] updated: 2026-06-09 confidence: high sources: [raw/github_doc-docs-usage-metrics-md.md, raw/github_doc-docs-usage-reproducibility-md.md, raw/github_doc-docs-usage-usage-stats-md.md, raw/github_doc-docs-usage-readme-md.md, raw/github_doc-docs-usage-v1-guide-md.md] --- # Observability & Ops ## Production metrics The OpenAI-compatible server exposes health/system metrics at **`/metrics`** (Prometheus-style) — the monitoring hook for production deployments. ## Reproducibility **Not guaranteed by default** (performance trade-off). To get reproducible results: offline, set `VLLM_ENABLE_V1_MULTIPROCESSING=0` (deterministic scheduling) — see the reproducibility doc for the full conditions. ## Anonymous usage stats vLLM **collects anonymous usage data by default** (hardware/model-config telemetry; aggregated subsets published, e.g. the 2024 report at 2024.vllm.ai). Opt-out documented in the usage-stats page. ## V1 engine **V0 is fully deprecated** (RFC #18571) — V1 re-architected the scheduler, KV-cache manager, worker, sampler, and API server while keeping V0's models/kernels. If guidance references V0-only behavior, it's outdated. ## Related [[concepts/configuration]] · [[syntheses/troubleshooting-playbook]] --- title: "OpenAI-Compatible Server" type: concept tags: [server, api, openai-compatible] updated: 2026-06-09 confidence: high sources: [raw/github_doc-docs-serving-online-serving-openai-compatible-server-md.md, raw/github_doc-docs-configuration-serve-args-md.md, raw/github_doc-docs-usage-faq-md.md, raw/github_issue-support-multiple-models.md] --- # OpenAI-Compatible Server `vllm serve ` exposes an HTTP server implementing OpenAI's APIs against your local model. ## Supported endpoints (with documented caveats) | Endpoint | Notes (official) | |---|---| | `/v1/completions` | text-generation models; **`suffix` param not supported** | | `/v1/responses` | text-generation models | | `/v1/chat/completions` | needs a chat template; **`user` param ignored**; `parallel_tool_calls=false` forces ≤1 tool call (default `true` allows more, model-dependent) | | `/v1/embeddings` | pooling models ([[concepts/pooling-models]]) | ## Server arguments — three ways `vllm serve` flags (see [[concepts/cli-reference]]), or a **YAML config file**: ```yaml # config.yaml model: meta-llama/Llama-3.1-8B-Instruct host: "127.0.0.1" port: 6379 ``` Argument names must be the long form of the CLI flags. ## One model per server (FAQ) Serving **multiple models on one port is not supported** — run one server instance per model and put a router in front (official FAQ + a long-running feature request). Model swap = stop the old server, start a new one. ## Related [[concepts/configuration]] · [[syntheses/serving-decisions]] · [[concepts/integrations-and-clients]] --- title: "Parallelism & Scaling (TP / PP / DP / EP / CP)" type: concept tags: [distributed, tensor-parallel, scaling] updated: 2026-06-09 confidence: high sources: [raw/github_doc-docs-serving-parallelism-scaling-md.md, raw/github_doc-docs-serving-data-parallel-deployment-md.md, raw/github_doc-docs-serving-context-parallel-deployment-md.md, raw/github_doc-docs-serving-expert-parallel-deployment-md.md, raw/github_doc-docs-serving-distributed-troubleshooting-md.md, raw/github_doc-docs-usage-security-md.md] --- # Parallelism & Scaling ## Choosing a strategy (official decision ladder) 1. **Fits on one GPU** → no distribution. 2. **Too big for one GPU, fits on one node** → tensor parallelism: `tensor_parallel_size=`. 3. **Too big for one node** → TP + pipeline parallelism: `tensor_parallel_size=8, pipeline_parallel_size=2` for 2×8-GPU nodes. Watch startup logs for `GPU KV cache size: N tokens` and `Maximum concurrency for X tokens per request: Y×` to size capacity. ## The other axes - **Data parallel** — replicate weights across instances/GPUs for independent batches (dense + MoE). - **Expert parallel (EP)** — MoE experts on separate GPUs; "more efficient when used in conjunction with DP". - **Context parallel (CP)** — long-context serving; prefill and decode handled separately (TTFT amortization for long prefill). ## Multi-node troubleshooting (Ray) - Verify **inter-node GPU communication** first; pass env like `NCCL_SOCKET_IFNAME=eth0` at *cluster creation* so it propagates to all nodes (issue #6803). - `No available node types can fulfill resource request` despite enough GPUs → multiple IPs per node; set `VLLM_HOST_IP` per node (different value each), verify with `ray status` / `ray list nodes` (issue #7815). ## Security (official) **Inter-node communication is insecure by default** — isolate the cluster network. ## Related [[concepts/configuration]] · [[syntheses/serving-decisions]] --- title: "Pooling Models — Embeddings, Classify, Score, Reward" type: concept tags: [embeddings, pooling, rerank, classification] updated: 2026-06-09 confidence: high sources: [raw/github_doc-docs-models-pooling-models-readme-md.md, raw/github_doc-docs-models-pooling-models-embed-md.md, raw/github_doc-docs-models-pooling-models-classify-md.md, raw/github_doc-docs-models-pooling-models-scoring-md.md, raw/github_doc-docs-models-pooling-models-reward-md.md, raw/github_doc-docs-usage-faq-md.md] --- # Pooling Models Non-generative models for **NLU tasks** — classification and retrieval. Official caveat: pooling support exists "primarily for convenience" — **no performance guarantee over using HF Transformers / Sentence Transformers directly** (optimization is planned, issue #21796). ## The four usages | Usage | What it does | |---|---| | **Embed** | unstructured input → numerical embedding vectors (also via `/v1/embeddings`) | | **Classify** | predict the best label for an input | | **Score** | similarity between two prompts; three `score_type`s: `cross-encoder`, `late-interaction`, `bi-encoder` — the reranking piece of RAG | | **Reward** | score the quality of generated outputs (RM, human-preference proxy) | vLLM handles only the **model-inference component** of RAG (embedding + reranking); orchestration belongs to frameworks like LangChain ([[concepts/integrations-and-clients]]). ## Which embedding model? (FAQ) Officially suggested starters: `e5-mistral-7b-instruct`, `BAAI/bge-base-en-v1.5`. Generative models (Llama-3-8B etc.) *can* be auto-converted to embedders by extracting hidden states — but are "expected to be inferior" to purpose-trained embedding models. ## Related [[concepts/models-and-support]] · [[concepts/openai-compatible-server]] --- title: "Quantization — Methods & When to Use Which" type: concept tags: [quantization, fp8, int4, gguf, awq] updated: 2026-06-09 confidence: high sources: [raw/github_doc-docs-features-quantization-readme-md.md, raw/github_doc-docs-features-quantization-llm-compressor-readme-md.md, raw/github_doc-docs-features-quantization-llm-compressor-fp8-md.md, raw/github_doc-docs-features-quantization-llm-compressor-int4-md.md, raw/github_doc-docs-features-quantization-online-md.md, raw/github_doc-docs-features-quantization-quantized-kvcache-md.md, raw/github_doc-docs-features-quantization-bnb-md.md, raw/github_doc-docs-features-quantization-gguf-md.md, raw/github_doc-docs-features-quantization-auto-awq-md.md, raw/github_doc-docs-features-quantization-gptqmodel-md.md, raw/github_doc-docs-features-quantization-modelopt-md.md, raw/github_doc-docs-features-quantization-inc-md.md, raw/github_doc-docs-features-quantization-quark-md.md, raw/github_doc-docs-features-quantization-fp8-vit-attn-md.md] --- # Quantization Trades precision for memory. **Official tip: start with LLM Compressor** (the vLLM project's own library — FP4/FP8/INT8/INT4). ## Method map | Method | Notes (from the docs) | |---|---| | **LLM Compressor** (recommended) | FP8 W8A8: hardware-accelerated on H100/MI300x; **W8A8 official only on Hopper/Ada**, Turing/Ampere get W8A16 weight-only via Marlin kernels. INT4 W4A16: memory savings + low-QPS latency; ready-made HF collection of INT4 checkpoints exists | | **Online quantization** | quantize BF16/FP16 → FP8 **at load time**, no pre-quantized checkpoint or calibration: `LLM(model, quantization="fp8_per_tensor")` or `"fp8_per_block"` (128×128 weight blocks) | | **Quantized KV cache** | FP8 KV cache → more tokens in memory, longer contexts. Per-tensor or per-attention-head scales (per-head requires Flash Attention backend + llm-compressor calibration); with FA3, attention itself runs in FP8 | | **BitsAndBytes** | no calibration data needed | | **GGUF** | **"highly experimental and under-optimized"** (official warning) — use only as a memory-footprint reducer | | **AutoAWQ** | **deprecated** — AWQ functionality absorbed into LLM Compressor | | **GPTQModel** | INT4/INT8 GPTQ checkpoints (ModelCloud) | | **NVIDIA Model Optimizer** | PTQ + QAT for LLMs/VLMs/diffusion on NVIDIA | | **Intel AutoRound / Neural Compressor** | INT2–8, MXFP8/4, NVFP4, GGUF; strong at 2–3 bits | | **AMD Quark** | the quantization toolkit for AMD GPUs | | **FP8 ViT encoder attention** | for big-image VLM workloads where the vision encoder bottlenecks | ## Decision rules (as documented) - NVIDIA Hopper/Ada, want throughput → **LLM Compressor FP8 W8A8**; older NVIDIA → W8A16 or INT4. - No pre-quantized checkpoint handy → **online quantization** at load. - Memory-bound on context length → add **FP8 KV cache**. - AMD → **Quark**; Intel → **AutoRound**. - Avoid GGUF in vLLM unless you specifically need it (experimental). ## Related [[concepts/configuration]] (conserving memory) · [[summaries/release-digest]] --- title: "Quickstart — Offline Inference & Online Serving" type: concept tags: [quickstart, offline, serving, llm-class] updated: 2026-06-09 confidence: high sources: [raw/github_doc-docs-getting-started-quickstart-md.md, raw/github_doc-docs-serving-offline-inference-md.md, raw/github_doc-docs-serving-online-serving-readme-md.md] --- # Quickstart — Offline Inference & Online Serving vLLM has two usage modes: ## 1. Offline batched inference (the `LLM` class) ```python from vllm import LLM llm = LLM(model="...") ``` APIs by model type: - **Generative models:** `LLM.generate` (completions), `LLM.chat` (chat conversations) - **Async queue:** `LLM.enqueue` / `LLM.enqueue_chat` / `LLM.wait_for_completion` — enqueue without blocking, collect later - **Pooling models** (embeddings/classify/score — [[concepts/pooling-models]]) have their own APIs ## 2. Online serving (HTTP server) ```bash vllm serve ``` Starts the OpenAI-compatible server ([[concepts/openai-compatible-server]]) — compatible "with many interfaces" per the serving overview. Configuration via engine/server args ([[concepts/configuration]]). ## Related [[concepts/cli-reference]] · [[concepts/models-and-support]] --- title: "Activity Log" type: log --- # Activity Log Append-only record of all wiki changes. ## Format Each entry follows this format: ``` ### YYYY-MM-DD HH:MM — [Action Type] - **Source/Trigger**: what initiated the action - **Pages created**: list of new pages - **Pages updated**: list of updated pages - **Notes**: any contradictions flagged, decisions made ``` --- ### 2026-04-08 00:00 — Setup - **Source/Trigger**: Repository initialized - **Pages created**: index.md, log.md, dashboard.md, analytics.md, flashcards.md - **Pages updated**: none - **Notes**: Empty knowledge base ready for first source ingestion --- ## 2026-06-10 — removed Obsidian scaffolding from the served wiki Deleted `analytics.md`, `dashboard.md`, `flashcards.md` (Obsidian plugin pages — Dataview/Charts View/Spaced Repetition markup, unusable when served as plain Markdown to agents) and the `journal/` scaffold (template only). `CLAUDE.md` directory layout updated: production/planning material lives at repo root, never under `wiki/` (everything under `wiki/` is served publicly). --- title: "Release Digest — v0.19.0 → v0.22.1" type: summary tags: [releases, versions, changelog] updated: 2026-06-09 confidence: high sources: [raw/github_release-v0-19-0.md, raw/github_release-v0-19-1.md, raw/github_release-v0-20-0.md, raw/github_release-v0-20-1.md, raw/github_release-v0-20-2.md, raw/github_release-v0-21-0.md, raw/github_release-v0-22-0.md, raw/github_release-v0-22-1.md] --- # Release Digest — v0.19.0 → v0.22.1 (Apr–Jun 2026) Eight releases in ~9 weeks (full changelogs in `raw/github_release-*`). Current at fetch: **v0.22.1 (2026-06-05)**. | Release | Date | Highlights (from the notes) | |---|---|---| | v0.19.0 | 2026-04-03 | **Gemma 4 support**; zero-bubble async scheduling + speculative decoding | | v0.20.0 | 2026-04-27 | **DeepSeek V4 initial support**; **CUDA 13.0 becomes the default wheel/image** | | v0.20.1 | 2026-05-04 | base-model support; multi-stream pre-attention GEMM | | v0.20.2 | 2026-05-10 | DeepSeek V4 sparse-attention + KV-cache allocation fixes | | v0.21.0 | 2026-05-15 | **Transformers v4 deprecated**; **C++20 build requirement** | | v0.22.0 | 2026-05-29 | DeepSeek V4 hardening; **Model Runner V2 default for Qwen-family** | | v0.22.1 | 2026-06-05 | JetBrains **Mellum v2** (open-weights MoE); DeepSeek-V4 CUTLASS fix | ## Patterns worth knowing - **New-flagship-model support lands within weeks** of model releases (Gemma 4, DeepSeek V4, Mellum v2) and then hardens over 2–3 patch releases — if a brand-new model misbehaves, check whether you're on the release where its support *landed* vs. *matured* ([[syntheses/model-notes-from-the-tracker]]). - **Toolchain floors move fast**: CUDA 13.0 default, C++20 required, Transformers v4 deprecated — all within this window. Pin versions deliberately when building from source ([[concepts/install]]). --- title: "Model Notes from the Tracker — gpt-oss, Llama, Qwen, Gemma & Friends" type: synthesis tags: [models, gpt-oss, llama, qwen, compatibility] updated: 2026-06-09 confidence: medium sources: [raw/github_issue-bug-gpt-oss-on-ampere.md, raw/github_issue-bug-gpt-oss-fa3-not-detected-on-rtx-5090-blackwell-sinks-are.md, raw/github_issue-bug-vllm-vllm-openai-gptoss-assertionerror-sinks-are-only-su.md, raw/github_issue-bug-for-gpt-oss-120b-expected-2-output-messages-reasoning-an.md, raw/github_issue-model-meta-llama-3-1-know-issues-faq.md, raw/github_issue-llama3-2-vision-model-guides-and-issues.md, raw/github_issue-usage-qwen3-usage-guide.md, raw/github_issue-feature-support-gemma3-architecture.md, raw/github_issue-doc-steps-to-run-vllm-on-your-rtx5080-or-5090.md, raw/github_issue-bug-vllm-fails-to-run-internvl-hf-format-multimodal-model-bu.md, raw/github_issue-new-model-multimodal-embedding-model-gme.md, raw/github_issue-misc-throughput-latency-for-guided-json-with-100-gpu-cache-u.md] --- # Model Notes from the Tracker Per-model realities from well-discussed issues — the layer official docs don't capture. Confidence: medium (tracker threads, point-in-time). ## gpt-oss (a 4-issue cluster) - **Hardware floor:** the `0.10.1+gptoss` wheels were built **only for sm90/sm100 (Hopper/Blackwell)** — on Ampere (A100) or Ada (L40S) you hit `sinks are only supported...` assertions; a community-verified path was **building from source** per PR #22259. - **Blackwell (RTX 5090): FA3 not detected** reported even on supported hardware — attention-backend selection issues. - **Responses parsing:** "expected 2 output messages (reasoning and final)" errors reported for gpt-oss-120b — reasoning-output parsing is part of serving this model. - **Rule:** gpt-oss is hardware- and version-sensitive; check your compute capability against the wheel build before debugging anything else. ## Maintainer-run "known issues / FAQ" threads exist for major models Llama 3.1 (e.g. the `rope_scaling ... KeyError: 'type'` config error class), Llama 3.2 Vision (guides + issues), and Qwen3 (usage guide incl. MCP/spec-decode questions) each have a dedicated tracker mega-thread — **these threads are the fastest path when a major model misbehaves.** Gemma 3's architecture support arrived via a tracked feature request ([[summaries/release-digest]] for when each landed). ## Newer GPUs (RTX 5080/5090) A maintainer-authored doc-issue walks through running vLLM on Blackwell consumer cards; community alternative: NVIDIA's Triton image (`nvcr.io/nvidia/tritonserver:25.01-vllm-python-py3`), tested on a 5090. ## Multimodal / embedding edge InternVL in HF format failing to run, and the GME multimodal-embedding model request, mark the multimodal frontier; structured/guided JSON at 100% GPU-cache utilization has a dedicated throughput/latency thread. ## Related [[concepts/models-and-support]] · [[syntheses/troubleshooting-playbook]] --- title: "Serving Decisions — Mode, Memory, Scale" type: synthesis tags: [serving, decision, memory, scale] updated: 2026-06-09 confidence: high sources: [raw/github_doc-docs-serving-offline-inference-md.md, raw/github_doc-docs-serving-online-serving-openai-compatible-server-md.md, raw/github_doc-docs-serving-parallelism-scaling-md.md, raw/github_doc-docs-configuration-conserving-memory-md.md, raw/github_doc-docs-configuration-optimization-md.md, raw/github_doc-docs-usage-faq-md.md, raw/github_doc-docs-features-quantization-online-md.md, raw/github_doc-docs-features-quantization-quantized-kvcache-md.md, raw/github_issue-rfc-deprecating-vllm-v0.md, raw/github_issue-v1-feedback-thread.md] --- # Serving Decisions The cross-doc decision page: mode → memory → scale. ## 1. Mode | Situation | Use | |---|---| | Batch jobs in your own Python | offline `LLM` class (`generate`/`chat`/`enqueue`) | | Anything that speaks OpenAI | `vllm serve` (one model per server — multi-model needs N servers + a router, per FAQ) | | Many models, swap rarely | restart the server per swap (the documented answer) | Everything runs the **V1 engine** — V0 is deprecated (RFC + the V1 feedback mega-thread track the migration edge cases). ## 2. Memory ladder (cheapest first) 1. `tensor_parallel_size` across available GPUs 2. **Online FP8 quantization** at load (`quantization="fp8_per_tensor"`) — no checkpoint needed 3. **FP8 KV cache** — more tokens in memory, longer contexts 4. Pre-quantized INT4/INT8 checkpoints ([[concepts/quantization]] decision rules) 5. Accept preemption/recompute under KV pressure (engine does this automatically; `-O2` default optimization level) ## 3. Scale ladder Single GPU → TP (one node) → TP+PP (multi-node) → DP for replica throughput → EP for MoE → CP for long contexts. Details + sizing logs: [[concepts/parallelism-and-scaling]]. ## Related [[concepts/openai-compatible-server]] · [[concepts/configuration]] · [[concepts/observability-and-ops]] --- title: "Troubleshooting Playbook" type: synthesis tags: [troubleshooting, errors, hangs, oom] updated: 2026-06-09 confidence: high sources: [raw/github_doc-docs-usage-troubleshooting-md.md, raw/github_doc-docs-serving-distributed-troubleshooting-md.md, raw/github_doc-docs-configuration-conserving-memory-md.md, raw/github_issue-bug-docker-vllm-0-9-1-cuda-error-an-illegal-memory-access-sa.md, raw/github_issue-importerror-ramyapra-vllm-vllm-c-cpython-310-x86-64-linux-gn.md, raw/github_issue-arm-aarch-64-server-build-failed-host-os-ubuntu22-04-3.md, raw/github_doc-docs-getting-started-installation-gpu-cuda-inc-md.md] --- # Troubleshooting Playbook Built from the official troubleshooting docs + well-discussed tracker issues. First rule (official): **after debugging, unset debug env vars** — they slow the system if left on. ## Hangs - **Hang downloading a model** → pre-download with `huggingface-cli` and pass the local path; isolates network from vLLM. - **Hang loading from disk** → model on a slow shared/network filesystem; move to local disk; watch CPU memory (large models can swap-thrash). ## Crashes & errors - **`Cannot re-initialize CUDA in forked subprocess`** → something touched CUDA before vLLM init; don't call CUDA-initializing functions first, select devices with `CUDA_VISIBLE_DEVICES` instead. - **CUDA "illegal memory access" (Docker)** → version-specific bug class (e.g. the 0.9.1 issue); try the matching-CUDA image tag and check the issue tracker for your exact version. - **`ImportError: ..._C.cpython-310-x86_64-linux-gnu.so`** → binary/env mismatch — vLLM's compiled kernels are binary-incompatible across CUDA/PyTorch variants; reinstall in a **fresh environment** (the same rule the install docs state). - **ARM/aarch64 build failures** → reported on Ubuntu 22.04 hosts; check the issue for the build-flag state of play. - **NCCL problems with conda-installed PyTorch** → statically-linked NCCL conflict (issue #8420) — use pip/uv-installed torch. ## OOM Tensor-parallel across GPUs (`tensor_parallel_size`), then the conserving-memory knobs; quantization and FP8 KV cache are the bigger hammers ([[concepts/quantization]]). ## Multi-node See [[concepts/parallelism-and-scaling]] § troubleshooting (inter-node NCCL env propagation, `VLLM_HOST_IP` per node). ## When filing a bug Search existing issues first (official guidance); attach `vllm collect-env` output ([[concepts/cli-reference]]).