{"slug":"vllm","title":"vLLM","description":"Wiki for vLLM: installation, the OpenAI-compatible server, configuration, quantization, distributed serving, and per-model realities.","tags":["vllm","serving","inference"],"category":"inference","scope":{"covers":"vLLM installation (CUDA/ROCm/CPU), offline LLM API and the OpenAI-compatible server, engine args and env vars, quantization methods, parallelism and scaling, CLI, metrics/ops, releases v0.19-v0.22, and tracker-sourced model notes (gpt-oss, Llama, Qwen).","notCovered":"Kubernetes/Docker deployment guides, engine internals/design docs, contributing, benchmarking deep-dives, and releases after the date below - use web search.","currentAs":"2026-06-09 (v0.22.1)"},"lastUpdated":"2026-06-09","documentCount":19,"raw_base":"/raw/vllm/","html_base":"/wiki/vllm/","documents":[{"path":"README.md","title":"LLM Wiki","type":null,"updated":null},{"path":"wiki/index.md","title":"vLLM KB — Master Index","type":"index","updated":"2026-06-09"},{"path":"wiki/concepts/cli-reference.md","title":"CLI Reference — vllm {serve,chat,complete,bench,run-batch}","type":"concept","updated":"2026-06-09"},{"path":"wiki/concepts/configuration.md","title":"Configuration — Engine Args, Env Vars, Memory","type":"concept","updated":"2026-06-09"},{"path":"wiki/concepts/install.md","title":"Installation (GPU / CPU / Platforms)","type":"concept","updated":"2026-06-09"},{"path":"wiki/concepts/integrations-and-clients.md","title":"Integrations — Claude Code, Codex, LangChain, LlamaIndex","type":"concept","updated":"2026-06-09"},{"path":"wiki/concepts/models-and-support.md","title":"Models & Support (incl. Transformers Backend)","type":"concept","updated":"2026-06-09"},{"path":"wiki/concepts/multimodal-and-lora.md","title":"Multimodal Inputs, LoRA & Prompt Embeddings","type":"concept","updated":"2026-06-09"},{"path":"wiki/concepts/observability-and-ops.md","title":"Observability & Ops — Metrics, Reproducibility, Usage Stats","type":"concept","updated":"2026-06-09"},{"path":"wiki/concepts/openai-compatible-server.md","title":"OpenAI-Compatible Server","type":"concept","updated":"2026-06-09"},{"path":"wiki/concepts/parallelism-and-scaling.md","title":"Parallelism & Scaling (TP / PP / DP / EP / CP)","type":"concept","updated":"2026-06-09"},{"path":"wiki/concepts/pooling-models.md","title":"Pooling Models — Embeddings, Classify, Score, Reward","type":"concept","updated":"2026-06-09"},{"path":"wiki/concepts/quantization.md","title":"Quantization — Methods & When to Use Which","type":"concept","updated":"2026-06-09"},{"path":"wiki/concepts/quickstart-and-serving.md","title":"Quickstart — Offline Inference & Online Serving","type":"concept","updated":"2026-06-09"},{"path":"wiki/log.md","title":"Activity Log","type":"log","updated":null},{"path":"wiki/summaries/release-digest.md","title":"Release Digest — v0.19.0 → v0.22.1","type":"summary","updated":"2026-06-09"},{"path":"wiki/syntheses/model-notes-from-the-tracker.md","title":"Model Notes from the Tracker — gpt-oss, Llama, Qwen, Gemma & Friends","type":"synthesis","updated":"2026-06-09"},{"path":"wiki/syntheses/serving-decisions.md","title":"Serving Decisions — Mode, Memory, Scale","type":"synthesis","updated":"2026-06-09"},{"path":"wiki/syntheses/troubleshooting-playbook.md","title":"Troubleshooting Playbook","type":"synthesis","updated":"2026-06-09"}]}