# vLLM — full corpus


<!-- ===== vllm/README.md ===== -->

# LLM Wiki

An open-source template for building LLM-powered knowledge bases, following [Andrej Karpathy's "LLM Wiki" pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f).

You provide raw sources. The LLM reads them, writes structured wiki pages, cross-links everything, and maintains it over time. You never edit the wiki directly — you curate sources and ask questions.

## How It Works

The system has three layers:

```
raw/              Sources you collect (articles, transcripts, notes, PDFs)
wiki/             LLM-written & maintained pages (summaries, concepts, entities, syntheses)
CLAUDE.md         Schema that tells the LLM how to structure everything
```

Three operations drive the workflow:

| Operation | Trigger | What happens |
|-----------|---------|--------------|
| **Ingest** | "ingest raw/my-source.txt" | LLM reads the source, creates a summary page, creates/updates concept and entity pages, adds cross-links, updates the index and log |
| **Query** | Ask any question | LLM searches the wiki, synthesizes an answer with citations, optionally creates a synthesis page for novel insights |
| **Lint** | "lint" or "health check" | LLM audits all pages for orphans, contradictions, missing links, incomplete sections, and low-confidence claims — fixes what it can, reports the rest |

## Quick Start

1. **Clone this repo**
   ```bash
   git clone https://github.com/YOUR_USERNAME/llm-wiki.git my-knowledge-base
   cd my-knowledge-base
   ```

2. **Customize CLAUDE.md** for your domain
   - Update the Purpose section with your topic
   - Replace the placeholder tagging taxonomy with your own categories
   - Adjust confidence level descriptions if needed
   - Everything else (workflows, page formats, linking rules) works as-is

3. **Drop sources into `raw/`**
   - Text files, transcripts, articles, notes — any plain text
   - These are immutable once added; the LLM never modifies them

4. **Tell the LLM to ingest**
   ```
   ingest raw/my-first-source.txt
   ```
   The LLM will create summary pages, concept pages, entity pages, cross-links, and update the index.

5. **Ask questions**
   ```
   What are the key differences between X and Y?
   ```
   The LLM answers from the wiki, citing specific pages.

6. **Run health checks**
   ```
   lint
   ```
   The LLM audits the wiki and fixes issues.

## Directory Structure

```
.
├── CLAUDE.md                      # Schema — the LLM's instructions
├── raw/                           # Your source documents (immutable)
└── wiki/
    ├── index.md                   # Master catalog of all pages
    ├── log.md                     # Append-only activity log
    ├── dashboard.md               # Dataview dashboard (Obsidian)
    ├── analytics.md               # Charts View analytics (Obsidian)
    ├── flashcards.md              # Spaced repetition cards
    ├── summaries/                 # One page per source document
    ├── concepts/                  # Concept and framework pages
    ├── entities/                  # People, tools, organizations, etc.
    ├── syntheses/                 # Cross-cutting analyses and comparisons
    ├── journal/                   # Research/session journal entries
    │   └── template.md            # Journal entry template
    └── presentations/             # Marp slide decks
```

## Enhancements

This template includes several extras beyond the core wiki pattern:

### Dataview Dashboard (`wiki/dashboard.md`)
Live queries that surface low-confidence pages, recent updates, concepts by tag, and pages with the most sources. Requires the [Dataview](https://github.com/blacksmithgu/obsidian-dataview) Obsidian plugin.

### Charts View Analytics (`wiki/analytics.md`)
Visual analytics with pie charts, bar charts, and word clouds. Requires the [Charts View](https://github.com/caronchen/obsidian-chartsview-plugin) Obsidian plugin.

### Mermaid Diagrams
Use Mermaid code blocks in any wiki page to create flowcharts, sequence diagrams, or concept maps. Native support in Obsidian and GitHub.

### Marp Slides (`wiki/presentations/`)
Create slide decks from markdown using [Marp](https://marp.app/). Drop presentation files in this directory.

### Research Journal (`wiki/journal/`)
Track your research sessions, experiments, or applied work with the included template. The LLM can reference journal entries when answering queries.

### Spaced Repetition (`wiki/flashcards.md`)
Flashcards in the format used by the [Spaced Repetition](https://github.com/st3v3nmw/obsidian-spaced-repetition) Obsidian plugin. Ask the LLM to generate flashcards from any wiki page.

### MCP Server
This repo works with Claude Code's MCP server capabilities. Point an MCP-compatible client at this repo and the LLM can read/write the wiki programmatically.

## Customizing for Your Domain

The schema in `CLAUDE.md` is domain-agnostic. To adapt it:

1. **Purpose** — Describe your knowledge domain in one paragraph
2. **Tagging taxonomy** — Replace placeholder categories with your own (e.g., for a cooking KB: `cuisine`, `technique`, `ingredient`, `equipment`)
3. **Confidence levels** — Adjust the descriptions to match your domain's evidence standards
4. **Entity types** — Update the entity page description to match what entities mean in your domain (people, tools, companies, etc.)
5. **Journal template** — Customize `wiki/journal/template.md` for your workflow

Everything else — page format, linking conventions, workflows, rules — is universal and works across domains.

## Example Domains

This template works for any knowledge-intensive topic:

- **Research notes** — papers, experiments, methodologies
- **Book analysis** — themes, characters, author techniques
- **Competitive analysis** — companies, products, market trends
- **Course notes** — lectures, readings, key concepts
- **Personal development** — frameworks, habits, book summaries
- **Technical documentation** — APIs, architectures, design patterns
- **Hobby deep-dives** — any subject you want to master

## License

MIT


<!-- ===== vllm/wiki/index.md ===== -->

---
title: "vLLM KB — Master Index"
type: index
updated: 2026-06-09
vllm_version: "v0.22.1"
---

# vLLM KB — Master Index

**Domain:** vLLM — high-throughput LLM serving engine (vllm-project/vllm; docs.vllm.ai). OpenAI-compatible server, offline `LLM` API, quantization, distributed serving.
**Corpus:** 124 provenance-stamped sources in `raw/` (96 official docs from the repo's docs/ tree, 20 well-discussed tracker issues, 8 release notes).
**Pages:** 15 (11 concepts · 1 summary · 3 syntheses)

## Concepts

- [[concepts/install]] — CUDA 12.9 wheels (compute 7.5+), uv install, fresh-env rule, ROCm/XPU/CPU, no-MPS
- [[concepts/quickstart-and-serving]] — offline `LLM` (generate/chat/enqueue) vs `vllm serve`
- [[concepts/openai-compatible-server]] — endpoints + caveats, YAML config, one-model-per-server
- [[concepts/configuration]] — engine args, `VLLM_` env vars (+K8s naming trap), conserving memory, `-O0..-O3`, model resolution
- [[concepts/models-and-support]] — native vs Transformers backend (<5% penalty), hardware-specific lists
- [[concepts/pooling-models]] — embed/classify/score/reward, score_types, FAQ model picks
- [[concepts/quantization]] — full method map (LLM Compressor, online FP8, KV cache, GGUF warning, AWQ deprecation) + decision rules
- [[concepts/multimodal-and-lora]] — multimodal inputs + SSRF guard, per-request LoRA, prompt embeds, disagg prefill
- [[concepts/parallelism-and-scaling]] — TP/PP/DP/EP/CP decision ladder, Ray multi-node troubleshooting, insecure-by-default warning
- [[concepts/cli-reference]] — vllm {serve,chat,complete,run-batch,bench,collect-env}
- [[concepts/observability-and-ops]] — /metrics, reproducibility flags, usage-stats telemetry, V1-only world
- [[concepts/integrations-and-clients]] — Claude Code & Codex on your own models, LangChain/LlamaIndex

## Summaries

- [[summaries/release-digest]] — v0.19.0 → v0.22.1: model-support cadence, toolchain floors (CUDA 13, C++20)

## Syntheses

- [[syntheses/serving-decisions]] — mode → memory ladder → scale ladder
- [[syntheses/troubleshooting-playbook]] — hangs, CUDA errors, env mismatches, OOM, bug-filing
- [[syntheses/model-notes-from-the-tracker]] — gpt-oss hardware floor, model FAQ mega-threads, Blackwell notes

## Statistics

- **Total pages**: 15 · **Sources ingested**: 124 (immutable raw/)
- **Confidence**: 13 high · 2 medium

## Coverage notes

Strong: install, serving, configuration, quantization, parallelism, CLI, ops, model realities. Excluded from v1 by design (declared scope): deployment/ (k8s/docker guides, 35 docs), design/ internals, contributing/, benchmarking deep-dive, training. Recency: fetched 2026-06-09; vLLM releases ~weekly — expect drift.


<!-- ===== vllm/wiki/concepts/cli-reference.md ===== -->

---
title: "CLI Reference — vllm {serve,chat,complete,bench,run-batch}"
type: concept
tags: [cli, commands, bench]
updated: 2026-06-09
confidence: high
sources: [raw/github_doc-docs-cli-readme-md.md, raw/github_doc-docs-cli-serve-md.md, raw/github_doc-docs-cli-chat-md.md, raw/github_doc-docs-cli-run-batch-md.md, raw/github_doc-docs-cli-bench-serve-md.md, raw/github_doc-docs-cli-bench-throughput-md.md, raw/github_doc-docs-cli-bench-latency-md.md]
---

# CLI Reference

```bash
vllm --help
vllm {chat,complete,serve,launch,bench,collect-env,run-batch}
```

| Command | Purpose |
|---|---|
| `vllm serve` | start the OpenAI-compatible API server ([[concepts/openai-compatible-server]]) |
| `vllm chat` / `vllm complete` | interactive chat / completion against a server |
| `vllm run-batch` | batch-process requests from a file |
| `vllm bench {serve,throughput,latency}` | benchmarking suite (serving load, offline throughput, latency) |
| `vllm collect-env` | environment report — attach it to bug reports ([[syntheses/troubleshooting-playbook]]) |

Notes:
- Complex/nested CLI values are passed as **JSON** (the docs' "JSON CLI arguments" tip applies to `serve`, `run-batch`, and the bench commands).
- The full per-command flag references are auto-generated from argparse in the docs; engine flags overlap `LLM(...)` constructor args ([[concepts/configuration]]).
- `vllm bench sweep` exists for parameter sweeps with plotting (see `raw/` cli/bench/sweep docs).

## Related

[[concepts/quickstart-and-serving]] · [[concepts/configuration]]


<!-- ===== vllm/wiki/concepts/configuration.md ===== -->

---
title: "Configuration — Engine Args, Env Vars, Memory"
type: concept
tags: [configuration, engine-args, env-vars, memory]
updated: 2026-06-09
confidence: high
sources: [raw/github_doc-docs-configuration-readme-md.md, raw/github_doc-docs-configuration-engine-args-md.md, raw/github_doc-docs-configuration-env-vars-md.md, raw/github_doc-docs-configuration-conserving-memory-md.md, raw/github_doc-docs-configuration-optimization-md.md, raw/github_doc-docs-configuration-model-resolution-md.md]
---

# Configuration

Three priority levels (highest → lowest, per the config README): request-level > engine/server args > defaults.

## Engine arguments

Control engine behavior in both modes: constructor args to `LLM` (offline) and flags to `vllm serve` (online). Source of truth = the config classes in `vllm.config` (`EngineArgs` / `AsyncEngineArgs`); the docs render the full argparse reference.

## Environment variables

All vLLM env vars are prefixed **`VLLM_`**. Two official warnings:
- **`VLLM_PORT` / `VLLM_HOST_IP` are for vLLM's *internal* coordination — NOT the API server's host/port** (use `--host`/`--port` flags for that).
- **Kubernetes: don't name a service `vllm`** — K8s injects service-name-prefixed env vars that collide with vLLM's.

## Conserving memory (OOM toolbox)

- **Tensor parallelism**: `tensor_parallel_size=N` splits the model across GPUs.
- Set `CUDA_VISIBLE_DEVICES` to pick devices — and **don't call CUDA-initializing functions before vLLM init** or you'll hit `Cannot re-initialize CUDA in forked subprocess`.
- More knobs: [[syntheses/serving-decisions]] § memory.

## Optimization levels

`-O0`…`-O3` trade startup time for performance: `-O0` none/fastest-startup; `-O1` simple compilation + PIECEWISE cudagraphs; **`-O2` default** (more fusions, FULL_AND_PIECEWISE cudagraphs); `-O3` aggressive (currently = `-O2`). Preemption: when KV-cache space runs out, vLLM preempts requests and recomputes them later.

## Model resolution

vLLM resolves models by the `architectures` field in the repo's `config.json` against registered implementations — resolution failures usually trace to that field.

## Related

[[concepts/openai-compatible-server]] · [[concepts/parallelism-and-scaling]]


<!-- ===== vllm/wiki/concepts/install.md ===== -->

---
title: "Installation (GPU / CPU / Platforms)"
type: concept
tags: [install, cuda, rocm, hardware]
updated: 2026-06-09
confidence: high
sources: [raw/github_doc-docs-getting-started-installation-readme-md.md, raw/github_doc-docs-getting-started-installation-gpu-cuda-inc-md.md, raw/github_doc-docs-getting-started-installation-gpu-rocm-inc-md.md, raw/github_doc-docs-getting-started-quickstart-md.md, raw/github_issue-does-vllm-support-the-mac-metal-mps.md]
---

# Installation

**Prereqs:** Linux, Python **3.10–3.13**. (macOS: no MPS backend in core vLLM — maintainer-confirmed; Apple Silicon GPU acceleration comes via the separate **vLLM-Metal** project.)

## NVIDIA CUDA (the main path)

vLLM ships pre-compiled C++/CUDA **12.9** binaries. GPU needs **compute capability 7.5+** (T4, RTX20xx, A100, L4, H100, B200…). Recommended install with `uv`:

```bash
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
```

Gotchas (official):
- **Use a fresh environment.** vLLM's compiled kernels are binary-incompatible across CUDA/PyTorch build variations; mixing with an existing torch install means building from source.
- **Avoid conda-installed PyTorch** — it statically links NCCL, which can break vLLM's NCCL usage (issue #8420).

## AMD ROCm

ROCm **6.3+** (MI350 needs 7.0+; Ryzen AI MAX needs 7.0.2+). Prebuilt wheels: `rocm700` (Python 3.12, vLLM 0.14.0–0.18.0) and `rocm721` (nightlies). Supported GPUs include MI200s (gfx90a), MI300 (gfx942), MI350 (gfx950), Radeon RX 7900/9000 series, Ryzen AI MAX / AI 300.

## Other platforms

Intel XPU, and CPU (Intel/AMD x86, ARM AArch64, Apple Silicon, IBM Z) are supported; further hardware arrives as **out-of-tree hardware plugins** (Hardware-Pluggable RFC).

## Related

[[concepts/quickstart-and-serving]] · [[syntheses/troubleshooting-playbook]] (build/run failures)


<!-- ===== vllm/wiki/concepts/integrations-and-clients.md ===== -->

---
title: "Integrations — Claude Code, Codex, LangChain, LlamaIndex"
type: concept
tags: [integrations, agents, langchain]
updated: 2026-06-09
confidence: medium
sources: [raw/github_doc-docs-serving-integrations-claude-code-md.md, raw/github_doc-docs-serving-integrations-codex-md.md, raw/github_doc-docs-serving-integrations-langchain-md.md, raw/github_doc-docs-serving-integrations-llamaindex-md.md]
---

# Integrations

vLLM's OpenAI-compatible server slots in as the backend for popular clients — official integration docs exist for:

## Agentic coding tools

- **Claude Code** (Anthropic's terminal coding agent) — point it at a vLLM server to use **your own models instead of the Anthropic API**.
- **Codex** (OpenAI's terminal coding agent) — same pattern, your models instead of the OpenAI API.

Both follow from the compatible endpoints ([[concepts/openai-compatible-server]] — note vLLM also implements `/v1/responses`, which Codex-class tools use).

## Frameworks

- **LangChain** and **LlamaIndex** — vLLM is available as a model backend in both (pip-installable integration; vLLM serves inference, the framework orchestrates — the division of labor the pooling docs spell out for RAG, [[concepts/pooling-models]]).

## Related

[[concepts/openai-compatible-server]] · [[concepts/quickstart-and-serving]]


<!-- ===== vllm/wiki/concepts/models-and-support.md ===== -->

---
title: "Models & Support (incl. Transformers Backend)"
type: concept
tags: [models, support, transformers]
updated: 2026-06-09
confidence: high
sources: [raw/github_doc-docs-models-supported-models-md.md, raw/github_doc-docs-models-generative-models-md.md, raw/github_doc-docs-models-hardware-supported-models-cpu-md.md]
---

# Models & Support

vLLM supports **generative** and **pooling** models, listed per task with their architectures in the supported-models doc.

## Two implementation paths

1. **Native vLLM implementations** (`vllm/model_executor/models`) — the listed text and multimodal models. Generative models implement `VllmModelForTextGeneration` and output log-probabilities from final hidden states.
2. **Transformers modeling backend** — models *not* natively implemented can run via their HuggingFace Transformers implementation, at **<5% performance penalty** vs a dedicated implementation (official figure). Works for embedding/language/vision-language modalities and encoder-only/decoder-only/MoE architectures.

That second path is the answer to "is X supported?" for long-tail models: often yes, via Transformers backend, before a native implementation lands ([[syntheses/model-notes-from-the-tracker]] for per-model realities).

## Hardware-specific model lists

Separate validated-model lists exist per hardware class (e.g. CPU/Xeon, XPU) — check those when off the NVIDIA path.

## Related

[[concepts/pooling-models]] · [[concepts/multimodal-and-lora]] · [[concepts/configuration]] (model resolution)


<!-- ===== vllm/wiki/concepts/multimodal-and-lora.md ===== -->

---
title: "Multimodal Inputs, LoRA & Prompt Embeddings"
type: concept
tags: [multimodal, lora, vlm, features]
updated: 2026-06-09
confidence: high
sources: [raw/github_doc-docs-features-multimodal-inputs-md.md, raw/github_doc-docs-features-lora-md.md, raw/github_doc-docs-features-prompt-embeds-md.md, raw/github_doc-docs-features-disagg-prefill-md.md]
---

# Multimodal Inputs, LoRA & Prompt Embeddings

## Multimodal inputs

Pass images/video/audio to multimodal models via `multi_modal_data` (schema: `vllm.inputs.MultiModalDataDict`) alongside the HF-format `prompt`.

**Official security warning for serving VLMs:** set `--allowed-media-domains` (e.g. `upload.wikimedia.org github.com`) to stop the server fetching arbitrary URLs (**SSRF risk**), and `VLLM_MEDIA_URL_ALLOW_REDIRECTS=0` to block redirect-based bypasses — especially in containerized deployments with internal-network access.

## LoRA adapters

Any model implementing `SupportsLoRA` can serve adapters **per-request with minimal overhead**:

```python
llm = LLM(model=base, enable_lora=True)
# per request: LoRARequest(name, id, path)
```

Download adapters locally first (e.g. `huggingface_hub.snapshot_download`).

## Prompt embeddings & disaggregated prefill

- **Prompt embeds:** feed embedding tensors directly instead of token ids.
- **Disaggregated prefill (experimental):** split prefill and decode across instances — pairs with the parallelism options in [[concepts/parallelism-and-scaling]].

## Related

[[concepts/models-and-support]] · [[syntheses/model-notes-from-the-tracker]] (VLM gotchas)


<!-- ===== vllm/wiki/concepts/observability-and-ops.md ===== -->

---
title: "Observability & Ops — Metrics, Reproducibility, Usage Stats"
type: concept
tags: [metrics, monitoring, reproducibility, ops]
updated: 2026-06-09
confidence: high
sources: [raw/github_doc-docs-usage-metrics-md.md, raw/github_doc-docs-usage-reproducibility-md.md, raw/github_doc-docs-usage-usage-stats-md.md, raw/github_doc-docs-usage-readme-md.md, raw/github_doc-docs-usage-v1-guide-md.md]
---

# Observability & Ops

## Production metrics

The OpenAI-compatible server exposes health/system metrics at **`/metrics`** (Prometheus-style) — the monitoring hook for production deployments.

## Reproducibility

**Not guaranteed by default** (performance trade-off). To get reproducible results: offline, set `VLLM_ENABLE_V1_MULTIPROCESSING=0` (deterministic scheduling) — see the reproducibility doc for the full conditions.

## Anonymous usage stats

vLLM **collects anonymous usage data by default** (hardware/model-config telemetry; aggregated subsets published, e.g. the 2024 report at 2024.vllm.ai). Opt-out documented in the usage-stats page.

## V1 engine

**V0 is fully deprecated** (RFC #18571) — V1 re-architected the scheduler, KV-cache manager, worker, sampler, and API server while keeping V0's models/kernels. If guidance references V0-only behavior, it's outdated.

## Related

[[concepts/configuration]] · [[syntheses/troubleshooting-playbook]]


<!-- ===== vllm/wiki/concepts/openai-compatible-server.md ===== -->

---
title: "OpenAI-Compatible Server"
type: concept
tags: [server, api, openai-compatible]
updated: 2026-06-09
confidence: high
sources: [raw/github_doc-docs-serving-online-serving-openai-compatible-server-md.md, raw/github_doc-docs-configuration-serve-args-md.md, raw/github_doc-docs-usage-faq-md.md, raw/github_issue-support-multiple-models.md]
---

# OpenAI-Compatible Server

`vllm serve <model>` exposes an HTTP server implementing OpenAI's APIs against your local model.

## Supported endpoints (with documented caveats)

| Endpoint | Notes (official) |
|---|---|
| `/v1/completions` | text-generation models; **`suffix` param not supported** |
| `/v1/responses` | text-generation models |
| `/v1/chat/completions` | needs a chat template; **`user` param ignored**; `parallel_tool_calls=false` forces ≤1 tool call (default `true` allows more, model-dependent) |
| `/v1/embeddings` | pooling models ([[concepts/pooling-models]]) |

## Server arguments — three ways

`vllm serve` flags (see [[concepts/cli-reference]]), or a **YAML config file**:

```yaml
# config.yaml
model: meta-llama/Llama-3.1-8B-Instruct
host: "127.0.0.1"
port: 6379
```

Argument names must be the long form of the CLI flags.

## One model per server (FAQ)

Serving **multiple models on one port is not supported** — run one server instance per model and put a router in front (official FAQ + a long-running feature request). Model swap = stop the old server, start a new one.

## Related

[[concepts/configuration]] · [[syntheses/serving-decisions]] · [[concepts/integrations-and-clients]]


<!-- ===== vllm/wiki/concepts/parallelism-and-scaling.md ===== -->

---
title: "Parallelism & Scaling (TP / PP / DP / EP / CP)"
type: concept
tags: [distributed, tensor-parallel, scaling]
updated: 2026-06-09
confidence: high
sources: [raw/github_doc-docs-serving-parallelism-scaling-md.md, raw/github_doc-docs-serving-data-parallel-deployment-md.md, raw/github_doc-docs-serving-context-parallel-deployment-md.md, raw/github_doc-docs-serving-expert-parallel-deployment-md.md, raw/github_doc-docs-serving-distributed-troubleshooting-md.md, raw/github_doc-docs-usage-security-md.md]
---

# Parallelism & Scaling

## Choosing a strategy (official decision ladder)

1. **Fits on one GPU** → no distribution.
2. **Too big for one GPU, fits on one node** → tensor parallelism: `tensor_parallel_size=<GPUs per node>`.
3. **Too big for one node** → TP + pipeline parallelism: `tensor_parallel_size=8, pipeline_parallel_size=2` for 2×8-GPU nodes.

Watch startup logs for `GPU KV cache size: N tokens` and `Maximum concurrency for X tokens per request: Y×` to size capacity.

## The other axes

- **Data parallel** — replicate weights across instances/GPUs for independent batches (dense + MoE).
- **Expert parallel (EP)** — MoE experts on separate GPUs; "more efficient when used in conjunction with DP".
- **Context parallel (CP)** — long-context serving; prefill and decode handled separately (TTFT amortization for long prefill).

## Multi-node troubleshooting (Ray)

- Verify **inter-node GPU communication** first; pass env like `NCCL_SOCKET_IFNAME=eth0` at *cluster creation* so it propagates to all nodes (issue #6803).
- `No available node types can fulfill resource request` despite enough GPUs → multiple IPs per node; set `VLLM_HOST_IP` per node (different value each), verify with `ray status` / `ray list nodes` (issue #7815).

## Security (official)

**Inter-node communication is insecure by default** — isolate the cluster network.

## Related

[[concepts/configuration]] · [[syntheses/serving-decisions]]


<!-- ===== vllm/wiki/concepts/pooling-models.md ===== -->

---
title: "Pooling Models — Embeddings, Classify, Score, Reward"
type: concept
tags: [embeddings, pooling, rerank, classification]
updated: 2026-06-09
confidence: high
sources: [raw/github_doc-docs-models-pooling-models-readme-md.md, raw/github_doc-docs-models-pooling-models-embed-md.md, raw/github_doc-docs-models-pooling-models-classify-md.md, raw/github_doc-docs-models-pooling-models-scoring-md.md, raw/github_doc-docs-models-pooling-models-reward-md.md, raw/github_doc-docs-usage-faq-md.md]
---

# Pooling Models

Non-generative models for **NLU tasks** — classification and retrieval. Official caveat: pooling support exists "primarily for convenience" — **no performance guarantee over using HF Transformers / Sentence Transformers directly** (optimization is planned, issue #21796).

## The four usages

| Usage | What it does |
|---|---|
| **Embed** | unstructured input → numerical embedding vectors (also via `/v1/embeddings`) |
| **Classify** | predict the best label for an input |
| **Score** | similarity between two prompts; three `score_type`s: `cross-encoder`, `late-interaction`, `bi-encoder` — the reranking piece of RAG |
| **Reward** | score the quality of generated outputs (RM, human-preference proxy) |

vLLM handles only the **model-inference component** of RAG (embedding + reranking); orchestration belongs to frameworks like LangChain ([[concepts/integrations-and-clients]]).

## Which embedding model? (FAQ)

Officially suggested starters: `e5-mistral-7b-instruct`, `BAAI/bge-base-en-v1.5`. Generative models (Llama-3-8B etc.) *can* be auto-converted to embedders by extracting hidden states — but are "expected to be inferior" to purpose-trained embedding models.

## Related

[[concepts/models-and-support]] · [[concepts/openai-compatible-server]]


<!-- ===== vllm/wiki/concepts/quantization.md ===== -->

---
title: "Quantization — Methods & When to Use Which"
type: concept
tags: [quantization, fp8, int4, gguf, awq]
updated: 2026-06-09
confidence: high
sources: [raw/github_doc-docs-features-quantization-readme-md.md, raw/github_doc-docs-features-quantization-llm-compressor-readme-md.md, raw/github_doc-docs-features-quantization-llm-compressor-fp8-md.md, raw/github_doc-docs-features-quantization-llm-compressor-int4-md.md, raw/github_doc-docs-features-quantization-online-md.md, raw/github_doc-docs-features-quantization-quantized-kvcache-md.md, raw/github_doc-docs-features-quantization-bnb-md.md, raw/github_doc-docs-features-quantization-gguf-md.md, raw/github_doc-docs-features-quantization-auto-awq-md.md, raw/github_doc-docs-features-quantization-gptqmodel-md.md, raw/github_doc-docs-features-quantization-modelopt-md.md, raw/github_doc-docs-features-quantization-inc-md.md, raw/github_doc-docs-features-quantization-quark-md.md, raw/github_doc-docs-features-quantization-fp8-vit-attn-md.md]
---

# Quantization

Trades precision for memory. **Official tip: start with LLM Compressor** (the vLLM project's own library — FP4/FP8/INT8/INT4).

## Method map

| Method | Notes (from the docs) |
|---|---|
| **LLM Compressor** (recommended) | FP8 W8A8: hardware-accelerated on H100/MI300x; **W8A8 official only on Hopper/Ada**, Turing/Ampere get W8A16 weight-only via Marlin kernels. INT4 W4A16: memory savings + low-QPS latency; ready-made HF collection of INT4 checkpoints exists |
| **Online quantization** | quantize BF16/FP16 → FP8 **at load time**, no pre-quantized checkpoint or calibration: `LLM(model, quantization="fp8_per_tensor")` or `"fp8_per_block"` (128×128 weight blocks) |
| **Quantized KV cache** | FP8 KV cache → more tokens in memory, longer contexts. Per-tensor or per-attention-head scales (per-head requires Flash Attention backend + llm-compressor calibration); with FA3, attention itself runs in FP8 |
| **BitsAndBytes** | no calibration data needed |
| **GGUF** | **"highly experimental and under-optimized"** (official warning) — use only as a memory-footprint reducer |
| **AutoAWQ** | **deprecated** — AWQ functionality absorbed into LLM Compressor |
| **GPTQModel** | INT4/INT8 GPTQ checkpoints (ModelCloud) |
| **NVIDIA Model Optimizer** | PTQ + QAT for LLMs/VLMs/diffusion on NVIDIA |
| **Intel AutoRound / Neural Compressor** | INT2–8, MXFP8/4, NVFP4, GGUF; strong at 2–3 bits |
| **AMD Quark** | the quantization toolkit for AMD GPUs |
| **FP8 ViT encoder attention** | for big-image VLM workloads where the vision encoder bottlenecks |

## Decision rules (as documented)

- NVIDIA Hopper/Ada, want throughput → **LLM Compressor FP8 W8A8**; older NVIDIA → W8A16 or INT4.
- No pre-quantized checkpoint handy → **online quantization** at load.
- Memory-bound on context length → add **FP8 KV cache**.
- AMD → **Quark**; Intel → **AutoRound**.
- Avoid GGUF in vLLM unless you specifically need it (experimental).

## Related

[[concepts/configuration]] (conserving memory) · [[summaries/release-digest]]


<!-- ===== vllm/wiki/concepts/quickstart-and-serving.md ===== -->

---
title: "Quickstart — Offline Inference & Online Serving"
type: concept
tags: [quickstart, offline, serving, llm-class]
updated: 2026-06-09
confidence: high
sources: [raw/github_doc-docs-getting-started-quickstart-md.md, raw/github_doc-docs-serving-offline-inference-md.md, raw/github_doc-docs-serving-online-serving-readme-md.md]
---

# Quickstart — Offline Inference & Online Serving

vLLM has two usage modes:

## 1. Offline batched inference (the `LLM` class)

```python
from vllm import LLM
llm = LLM(model="...")
```

APIs by model type:
- **Generative models:** `LLM.generate` (completions), `LLM.chat` (chat conversations)
- **Async queue:** `LLM.enqueue` / `LLM.enqueue_chat` / `LLM.wait_for_completion` — enqueue without blocking, collect later
- **Pooling models** (embeddings/classify/score — [[concepts/pooling-models]]) have their own APIs

## 2. Online serving (HTTP server)

```bash
vllm serve <model>
```

Starts the OpenAI-compatible server ([[concepts/openai-compatible-server]]) — compatible "with many interfaces" per the serving overview. Configuration via engine/server args ([[concepts/configuration]]).

## Related

[[concepts/cli-reference]] · [[concepts/models-and-support]]


<!-- ===== vllm/wiki/log.md ===== -->

---
title: "Activity Log"
type: log
---

# Activity Log

Append-only record of all wiki changes.

## Format

Each entry follows this format:
```
### YYYY-MM-DD HH:MM — [Action Type]
- **Source/Trigger**: what initiated the action
- **Pages created**: list of new pages
- **Pages updated**: list of updated pages
- **Notes**: any contradictions flagged, decisions made
```

---

### 2026-04-08 00:00 — Setup

- **Source/Trigger**: Repository initialized
- **Pages created**: index.md, log.md, dashboard.md, analytics.md, flashcards.md
- **Pages updated**: none
- **Notes**: Empty knowledge base ready for first source ingestion

---

## 2026-06-10 — removed Obsidian scaffolding from the served wiki

Deleted `analytics.md`, `dashboard.md`, `flashcards.md` (Obsidian plugin pages — Dataview/Charts View/Spaced Repetition markup, unusable when served as plain Markdown to agents) and the `journal/` scaffold (template only). `CLAUDE.md` directory layout updated: production/planning material lives at repo root, never under `wiki/` (everything under `wiki/` is served publicly).


<!-- ===== vllm/wiki/summaries/release-digest.md ===== -->

---
title: "Release Digest — v0.19.0 → v0.22.1"
type: summary
tags: [releases, versions, changelog]
updated: 2026-06-09
confidence: high
sources: [raw/github_release-v0-19-0.md, raw/github_release-v0-19-1.md, raw/github_release-v0-20-0.md, raw/github_release-v0-20-1.md, raw/github_release-v0-20-2.md, raw/github_release-v0-21-0.md, raw/github_release-v0-22-0.md, raw/github_release-v0-22-1.md]
---

# Release Digest — v0.19.0 → v0.22.1 (Apr–Jun 2026)

Eight releases in ~9 weeks (full changelogs in `raw/github_release-*`). Current at fetch: **v0.22.1 (2026-06-05)**.

| Release | Date | Highlights (from the notes) |
|---|---|---|
| v0.19.0 | 2026-04-03 | **Gemma 4 support**; zero-bubble async scheduling + speculative decoding |
| v0.20.0 | 2026-04-27 | **DeepSeek V4 initial support**; **CUDA 13.0 becomes the default wheel/image** |
| v0.20.1 | 2026-05-04 | base-model support; multi-stream pre-attention GEMM |
| v0.20.2 | 2026-05-10 | DeepSeek V4 sparse-attention + KV-cache allocation fixes |
| v0.21.0 | 2026-05-15 | **Transformers v4 deprecated**; **C++20 build requirement** |
| v0.22.0 | 2026-05-29 | DeepSeek V4 hardening; **Model Runner V2 default for Qwen-family** |
| v0.22.1 | 2026-06-05 | JetBrains **Mellum v2** (open-weights MoE); DeepSeek-V4 CUTLASS fix |

## Patterns worth knowing

- **New-flagship-model support lands within weeks** of model releases (Gemma 4, DeepSeek V4, Mellum v2) and then hardens over 2–3 patch releases — if a brand-new model misbehaves, check whether you're on the release where its support *landed* vs. *matured* ([[syntheses/model-notes-from-the-tracker]]).
- **Toolchain floors move fast**: CUDA 13.0 default, C++20 required, Transformers v4 deprecated — all within this window. Pin versions deliberately when building from source ([[concepts/install]]).


<!-- ===== vllm/wiki/syntheses/model-notes-from-the-tracker.md ===== -->

---
title: "Model Notes from the Tracker — gpt-oss, Llama, Qwen, Gemma & Friends"
type: synthesis
tags: [models, gpt-oss, llama, qwen, compatibility]
updated: 2026-06-09
confidence: medium
sources: [raw/github_issue-bug-gpt-oss-on-ampere.md, raw/github_issue-bug-gpt-oss-fa3-not-detected-on-rtx-5090-blackwell-sinks-are.md, raw/github_issue-bug-vllm-vllm-openai-gptoss-assertionerror-sinks-are-only-su.md, raw/github_issue-bug-for-gpt-oss-120b-expected-2-output-messages-reasoning-an.md, raw/github_issue-model-meta-llama-3-1-know-issues-faq.md, raw/github_issue-llama3-2-vision-model-guides-and-issues.md, raw/github_issue-usage-qwen3-usage-guide.md, raw/github_issue-feature-support-gemma3-architecture.md, raw/github_issue-doc-steps-to-run-vllm-on-your-rtx5080-or-5090.md, raw/github_issue-bug-vllm-fails-to-run-internvl-hf-format-multimodal-model-bu.md, raw/github_issue-new-model-multimodal-embedding-model-gme.md, raw/github_issue-misc-throughput-latency-for-guided-json-with-100-gpu-cache-u.md]
---

# Model Notes from the Tracker

Per-model realities from well-discussed issues — the layer official docs don't capture. Confidence: medium (tracker threads, point-in-time).

## gpt-oss (a 4-issue cluster)

- **Hardware floor:** the `0.10.1+gptoss` wheels were built **only for sm90/sm100 (Hopper/Blackwell)** — on Ampere (A100) or Ada (L40S) you hit `sinks are only supported...` assertions; a community-verified path was **building from source** per PR #22259.
- **Blackwell (RTX 5090): FA3 not detected** reported even on supported hardware — attention-backend selection issues.
- **Responses parsing:** "expected 2 output messages (reasoning and final)" errors reported for gpt-oss-120b — reasoning-output parsing is part of serving this model.
- **Rule:** gpt-oss is hardware- and version-sensitive; check your compute capability against the wheel build before debugging anything else.

## Maintainer-run "known issues / FAQ" threads exist for major models

Llama 3.1 (e.g. the `rope_scaling ... KeyError: 'type'` config error class), Llama 3.2 Vision (guides + issues), and Qwen3 (usage guide incl. MCP/spec-decode questions) each have a dedicated tracker mega-thread — **these threads are the fastest path when a major model misbehaves.** Gemma 3's architecture support arrived via a tracked feature request ([[summaries/release-digest]] for when each landed).

## Newer GPUs (RTX 5080/5090)

A maintainer-authored doc-issue walks through running vLLM on Blackwell consumer cards; community alternative: NVIDIA's Triton image (`nvcr.io/nvidia/tritonserver:25.01-vllm-python-py3`), tested on a 5090.

## Multimodal / embedding edge

InternVL in HF format failing to run, and the GME multimodal-embedding model request, mark the multimodal frontier; structured/guided JSON at 100% GPU-cache utilization has a dedicated throughput/latency thread.

## Related

[[concepts/models-and-support]] · [[syntheses/troubleshooting-playbook]]


<!-- ===== vllm/wiki/syntheses/serving-decisions.md ===== -->

---
title: "Serving Decisions — Mode, Memory, Scale"
type: synthesis
tags: [serving, decision, memory, scale]
updated: 2026-06-09
confidence: high
sources: [raw/github_doc-docs-serving-offline-inference-md.md, raw/github_doc-docs-serving-online-serving-openai-compatible-server-md.md, raw/github_doc-docs-serving-parallelism-scaling-md.md, raw/github_doc-docs-configuration-conserving-memory-md.md, raw/github_doc-docs-configuration-optimization-md.md, raw/github_doc-docs-usage-faq-md.md, raw/github_doc-docs-features-quantization-online-md.md, raw/github_doc-docs-features-quantization-quantized-kvcache-md.md, raw/github_issue-rfc-deprecating-vllm-v0.md, raw/github_issue-v1-feedback-thread.md]
---

# Serving Decisions

The cross-doc decision page: mode → memory → scale.

## 1. Mode

| Situation | Use |
|---|---|
| Batch jobs in your own Python | offline `LLM` class (`generate`/`chat`/`enqueue`) |
| Anything that speaks OpenAI | `vllm serve` (one model per server — multi-model needs N servers + a router, per FAQ) |
| Many models, swap rarely | restart the server per swap (the documented answer) |

Everything runs the **V1 engine** — V0 is deprecated (RFC + the V1 feedback mega-thread track the migration edge cases).

## 2. Memory ladder (cheapest first)

1. `tensor_parallel_size` across available GPUs
2. **Online FP8 quantization** at load (`quantization="fp8_per_tensor"`) — no checkpoint needed
3. **FP8 KV cache** — more tokens in memory, longer contexts
4. Pre-quantized INT4/INT8 checkpoints ([[concepts/quantization]] decision rules)
5. Accept preemption/recompute under KV pressure (engine does this automatically; `-O2` default optimization level)

## 3. Scale ladder

Single GPU → TP (one node) → TP+PP (multi-node) → DP for replica throughput → EP for MoE → CP for long contexts. Details + sizing logs: [[concepts/parallelism-and-scaling]].

## Related

[[concepts/openai-compatible-server]] · [[concepts/configuration]] · [[concepts/observability-and-ops]]


<!-- ===== vllm/wiki/syntheses/troubleshooting-playbook.md ===== -->

---
title: "Troubleshooting Playbook"
type: synthesis
tags: [troubleshooting, errors, hangs, oom]
updated: 2026-06-09
confidence: high
sources: [raw/github_doc-docs-usage-troubleshooting-md.md, raw/github_doc-docs-serving-distributed-troubleshooting-md.md, raw/github_doc-docs-configuration-conserving-memory-md.md, raw/github_issue-bug-docker-vllm-0-9-1-cuda-error-an-illegal-memory-access-sa.md, raw/github_issue-importerror-ramyapra-vllm-vllm-c-cpython-310-x86-64-linux-gn.md, raw/github_issue-arm-aarch-64-server-build-failed-host-os-ubuntu22-04-3.md, raw/github_doc-docs-getting-started-installation-gpu-cuda-inc-md.md]
---

# Troubleshooting Playbook

Built from the official troubleshooting docs + well-discussed tracker issues. First rule (official): **after debugging, unset debug env vars** — they slow the system if left on.

## Hangs

- **Hang downloading a model** → pre-download with `huggingface-cli` and pass the local path; isolates network from vLLM.
- **Hang loading from disk** → model on a slow shared/network filesystem; move to local disk; watch CPU memory (large models can swap-thrash).

## Crashes & errors

- **`Cannot re-initialize CUDA in forked subprocess`** → something touched CUDA before vLLM init; don't call CUDA-initializing functions first, select devices with `CUDA_VISIBLE_DEVICES` instead.
- **CUDA "illegal memory access" (Docker)** → version-specific bug class (e.g. the 0.9.1 issue); try the matching-CUDA image tag and check the issue tracker for your exact version.
- **`ImportError: ..._C.cpython-310-x86_64-linux-gnu.so`** → binary/env mismatch — vLLM's compiled kernels are binary-incompatible across CUDA/PyTorch variants; reinstall in a **fresh environment** (the same rule the install docs state).
- **ARM/aarch64 build failures** → reported on Ubuntu 22.04 hosts; check the issue for the build-flag state of play.
- **NCCL problems with conda-installed PyTorch** → statically-linked NCCL conflict (issue #8420) — use pip/uv-installed torch.

## OOM

Tensor-parallel across GPUs (`tensor_parallel_size`), then the conserving-memory knobs; quantization and FP8 KV cache are the bigger hammers ([[concepts/quantization]]).

## Multi-node

See [[concepts/parallelism-and-scaling]] § troubleshooting (inter-node NCCL env propagation, `VLLM_HOST_IP` per node).

## When filing a bug

Search existing issues first (official guidance); attach `vllm collect-env` output ([[concepts/cli-reference]]).