# Ollama — full corpus


<!-- ===== ollama/README.md ===== -->

# LLM Wiki

An open-source template for building LLM-powered knowledge bases, following [Andrej Karpathy's "LLM Wiki" pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f).

You provide raw sources. The LLM reads them, writes structured wiki pages, cross-links everything, and maintains it over time. You never edit the wiki directly — you curate sources and ask questions.

## How It Works

The system has three layers:

```
raw/              Sources you collect (articles, transcripts, notes, PDFs)
wiki/             LLM-written & maintained pages (summaries, concepts, entities, syntheses)
CLAUDE.md         Schema that tells the LLM how to structure everything
```

Three operations drive the workflow:

| Operation | Trigger | What happens |
|-----------|---------|--------------|
| **Ingest** | "ingest raw/my-source.txt" | LLM reads the source, creates a summary page, creates/updates concept and entity pages, adds cross-links, updates the index and log |
| **Query** | Ask any question | LLM searches the wiki, synthesizes an answer with citations, optionally creates a synthesis page for novel insights |
| **Lint** | "lint" or "health check" | LLM audits all pages for orphans, contradictions, missing links, incomplete sections, and low-confidence claims — fixes what it can, reports the rest |

## Quick Start

1. **Clone this repo**
   ```bash
   git clone https://github.com/YOUR_USERNAME/llm-wiki.git my-knowledge-base
   cd my-knowledge-base
   ```

2. **Customize CLAUDE.md** for your domain
   - Update the Purpose section with your topic
   - Replace the placeholder tagging taxonomy with your own categories
   - Adjust confidence level descriptions if needed
   - Everything else (workflows, page formats, linking rules) works as-is

3. **Drop sources into `raw/`**
   - Text files, transcripts, articles, notes — any plain text
   - These are immutable once added; the LLM never modifies them

4. **Tell the LLM to ingest**
   ```
   ingest raw/my-first-source.txt
   ```
   The LLM will create summary pages, concept pages, entity pages, cross-links, and update the index.

5. **Ask questions**
   ```
   What are the key differences between X and Y?
   ```
   The LLM answers from the wiki, citing specific pages.

6. **Run health checks**
   ```
   lint
   ```
   The LLM audits the wiki and fixes issues.

## Directory Structure

```
.
├── CLAUDE.md                      # Schema — the LLM's instructions
├── raw/                           # Your source documents (immutable)
└── wiki/
    ├── index.md                   # Master catalog of all pages
    ├── log.md                     # Append-only activity log
    ├── dashboard.md               # Dataview dashboard (Obsidian)
    ├── analytics.md               # Charts View analytics (Obsidian)
    ├── flashcards.md              # Spaced repetition cards
    ├── summaries/                 # One page per source document
    ├── concepts/                  # Concept and framework pages
    ├── entities/                  # People, tools, organizations, etc.
    ├── syntheses/                 # Cross-cutting analyses and comparisons
    ├── journal/                   # Research/session journal entries
    │   └── template.md            # Journal entry template
    └── presentations/             # Marp slide decks
```

## Enhancements

This template includes several extras beyond the core wiki pattern:

### Dataview Dashboard (`wiki/dashboard.md`)
Live queries that surface low-confidence pages, recent updates, concepts by tag, and pages with the most sources. Requires the [Dataview](https://github.com/blacksmithgu/obsidian-dataview) Obsidian plugin.

### Charts View Analytics (`wiki/analytics.md`)
Visual analytics with pie charts, bar charts, and word clouds. Requires the [Charts View](https://github.com/caronchen/obsidian-chartsview-plugin) Obsidian plugin.

### Mermaid Diagrams
Use Mermaid code blocks in any wiki page to create flowcharts, sequence diagrams, or concept maps. Native support in Obsidian and GitHub.

### Marp Slides (`wiki/presentations/`)
Create slide decks from markdown using [Marp](https://marp.app/). Drop presentation files in this directory.

### Research Journal (`wiki/journal/`)
Track your research sessions, experiments, or applied work with the included template. The LLM can reference journal entries when answering queries.

### Spaced Repetition (`wiki/flashcards.md`)
Flashcards in the format used by the [Spaced Repetition](https://github.com/st3v3nmw/obsidian-spaced-repetition) Obsidian plugin. Ask the LLM to generate flashcards from any wiki page.

### MCP Server
This repo works with Claude Code's MCP server capabilities. Point an MCP-compatible client at this repo and the LLM can read/write the wiki programmatically.

## Customizing for Your Domain

The schema in `CLAUDE.md` is domain-agnostic. To adapt it:

1. **Purpose** — Describe your knowledge domain in one paragraph
2. **Tagging taxonomy** — Replace placeholder categories with your own (e.g., for a cooking KB: `cuisine`, `technique`, `ingredient`, `equipment`)
3. **Confidence levels** — Adjust the descriptions to match your domain's evidence standards
4. **Entity types** — Update the entity page description to match what entities mean in your domain (people, tools, companies, etc.)
5. **Journal template** — Customize `wiki/journal/template.md` for your workflow

Everything else — page format, linking conventions, workflows, rules — is universal and works across domains.

## Example Domains

This template works for any knowledge-intensive topic:

- **Research notes** — papers, experiments, methodologies
- **Book analysis** — themes, characters, author techniques
- **Competitive analysis** — companies, products, market trends
- **Course notes** — lectures, readings, key concepts
- **Personal development** — frameworks, habits, book summaries
- **Technical documentation** — APIs, architectures, design patterns
- **Hobby deep-dives** — any subject you want to master

## License

MIT


<!-- ===== ollama/wiki/index.md ===== -->

---
title: "Ollama KB — Master Index"
type: index
updated: 2026-06-23
ollama_version: "0.30.10"
---

# Ollama KB — Master Index

**Domain:** Ollama — run open-weight LLMs locally: install & serve, the CLI, Modelfiles, the REST API and OpenAI/Anthropic-compatible endpoints, capabilities (tools, vision, embeddings, structured outputs, thinking, web search), GPU/hardware, and Ollama Cloud.
**Corpus:** 121 provenance-stamped sources in `raw/` — the official docs (docs.ollama.com llms.txt), the README, 14 release notes (v0.23–v0.30), and 40 solved GitHub issues.
**Pages:** 17 (13 concepts · 1 entity · 1 summary · 2 syntheses) — the user ring plus the operator/developer ring.

## Concepts (core ideas + operational how-tos)

- [[concepts/what-is-ollama]] — what Ollama is, how it runs models locally, first steps
- [[concepts/installation]] — install on macOS / Windows / Linux / Docker
- [[concepts/cli-reference]] — the full `ollama` command set, verbatim
- [[concepts/modelfile]] — Modelfile instructions (`FROM`, `PARAMETER`, `TEMPLATE`, `SYSTEM`, `ADAPTER`…) and creating/importing models
- [[concepts/rest-api]] — the native REST API: `/api/generate`, `/api/chat`, `/api/embed`, options, streaming
- [[concepts/openai-and-anthropic-compat]] — the OpenAI-compatible (`/v1/...`) and Anthropic-compatible endpoints
- [[concepts/tool-calling]] — function/tool calling: the `tools` array and `tool_calls`
- [[concepts/structured-outputs]] — JSON-schema-constrained responses via the `format` parameter
- [[concepts/vision-and-multimodal]] — multimodal models and passing images
- [[concepts/embeddings]] — `/api/embed`, embedding models, RAG/semantic search
- [[concepts/thinking-and-web-search]] — reasoning/"thinking" control and the web search API
- [[concepts/gpu-and-hardware]] — NVIDIA / AMD ROCm / Apple Metal support, VRAM, GPU selection
- [[concepts/configuration-and-serving]] — `ollama serve`, env vars (`OLLAMA_HOST`, `OLLAMA_MODELS`, `OLLAMA_KEEP_ALIVE`, `OLLAMA_NUM_PARALLEL`, `OLLAMA_CONTEXT_LENGTH`), context window, networking

## Entities

- [[entities/ollama-cloud]] — Ollama Cloud: hosted models, `ollama signin`, the cloud API and web search

## Summaries

- [[summaries/model-library-and-integrations-catalog]] — the model-library tags and the integration ecosystem (editors/agents/tools) — mapped, not paged

## Syntheses (decisions & casebooks)

- [[syntheses/api-surfaces-compared]] — native REST vs OpenAI-compatible vs Anthropic-compatible: pick by need
- [[syntheses/troubleshooting-playbook]] — symptom → cause → fix from 40 solved issues (GPU not detected, AMD ROCm, OOM, stalled downloads, memory growth, stops-serving)

## Statistics

- **Total pages**: 17
- **Concepts**: 13 · **Entities**: 1 · **Summaries**: 1 · **Syntheses**: 2
- **Sources ingested**: 121 (raw/, immutable)
- **High confidence**: 15 · **Medium confidence**: 2 · **Low confidence**: 0

## Coverage notes

Strong: install/serve, the CLI and Modelfile, the native + compatible APIs, all the capability surfaces (tools/vision/embeddings/structured outputs/thinking/web search), GPU/hardware, and a solved-issues casebook. Latest release seen: v0.30.10 (14 releases v0.23–v0.30 in `raw/`); freshness = source fetch date 2026-06-23.

Mapped, not paged (see [[summaries/model-library-and-integrations-catalog]]): the full model library and the per-integration setup docs (Claude Code, Cline, Codex, Goose, Zed, VS Code, JetBrains, n8n, etc.). For live model availability and post-date releases, use `ollama.com` and web search.


<!-- ===== ollama/wiki/concepts/cli-reference.md ===== -->

---
title: "Ollama CLI Reference"
type: concept
tags: [cli, commands, reference, run, pull]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-cli-reference.md, raw/github_doc-readme-md.md, raw/llms_txt_doc-usage.md]
---
# Ollama CLI Reference

The `ollama` command manages and runs models locally. Run `ollama` with no arguments for the interactive menu.

## Running models

```
ollama run gemma4
```

Multimodal input — pass an image path in the prompt:

```
ollama run gemma4 "What's in this image? /Users/jmorgan/Desktop/smile.png"
```

Multiline input — wrap text with `"""`:

```
>>> """Hello,
... world!
... """
```

## Managing models

```
ollama pull gemma4          # Download a model
ollama rm gemma4            # Remove a model
ollama ls                  # List models (also: ollama list)
ollama ps                  # List running models
ollama stop gemma4         # Stop a running model
ollama cp mymodel myuser/mymodel   # Copy a model
ollama push myuser/mymodel         # Push a model to ollama.com
ollama show --modelfile llama3.2   # Show a model's Modelfile
```

## Creating a model

First create a `Modelfile`:

```
FROM gemma4
SYSTEM """You are a happy cat."""
```

Then run `ollama create`:

```
ollama create -f Modelfile
```

See [[concepts/modelfile]] for the full Modelfile syntax.

## Embeddings

```
ollama run embeddinggemma "Hello world"
echo "Hello world" | ollama run nomic-embed-text
```

Output is a JSON array. See [[concepts/embeddings]].

## Serving

```
ollama serve
```

Starts the Ollama server. To view a list of environment variables that can be set, run `ollama serve --help`. See [[concepts/configuration-and-serving]].

## Launching integrations

```
ollama launch                          # interactive
ollama launch claude                   # specific integration
ollama launch claude --model qwen3.5   # with a specific model
ollama launch droid --config           # configure without launching
```

Supported integrations include **OpenCode**, **Claude Code**, **Codex**, **VS Code**, and **Droid**.

## Authentication

```
ollama signin     # Sign in to Ollama
ollama signout    # Sign out of Ollama
ollama -v         # Print version
```

See [[concepts/configuration-and-serving]] for sign-in details and API keys.


<!-- ===== ollama/wiki/concepts/configuration-and-serving.md ===== -->

---
title: "Configuration and Serving"
type: concept
tags: [serve, configuration, environment-variables, networking, context-length]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-faq.md, raw/llms_txt_doc-context-length.md, raw/llms_txt_doc-authentication.md, raw/llms_txt_doc-troubleshooting.md]
---
# Configuration and Serving

Start the server with `ollama serve`; it is configured entirely through environment variables (`ollama serve --help` lists them).

## Key environment variables

| Variable | Purpose |
| --- | --- |
| `OLLAMA_HOST` | Bind address. Default binds `127.0.0.1` port `11434`. Set e.g. `0.0.0.0:11434` to expose on the network. |
| `OLLAMA_MODELS` | Directory where downloaded models are stored. |
| `OLLAMA_KEEP_ALIVE` | How long models stay loaded in memory (duration string, seconds, `-1` to keep loaded, `0` to unload immediately). |
| `OLLAMA_NUM_PARALLEL` | Max parallel requests per model (default 1). RAM scales by `OLLAMA_NUM_PARALLEL` * `OLLAMA_CONTEXT_LENGTH`. |
| `OLLAMA_MAX_LOADED_MODELS` | Max models loaded concurrently. Default is 3 * number of GPUs, or 3 for CPU inference. |
| `OLLAMA_MAX_QUEUE` | Max queued requests before returning a 503 (default 512). |
| `OLLAMA_CONTEXT_LENGTH` | Default context window in tokens (default 4096). |
| `OLLAMA_FLASH_ATTENTION` | Set to `1` to enable Flash Attention (reduces memory as context grows). |
| `OLLAMA_KV_CACHE_TYPE` | K/V cache quantization type: `f16` (default), `q8_0`, `q4_0`. Global; requires Flash Attention. |
| `OLLAMA_ORIGINS` | Additional allowed CORS origins (defaults allow `127.0.0.1` and `0.0.0.0`). |
| `OLLAMA_NO_CLOUD` | Set to `1` to disable cloud features (local-only mode). |
| `HTTPS_PROXY` | Proxy for outbound model pulls. Avoid setting `HTTP_PROXY` — Ollama pulls over HTTPS only. |

### Setting environment variables per platform

- **macOS** (run as app): `launchctl setenv OLLAMA_HOST "0.0.0.0:11434"`, then restart Ollama.
- **Linux** (systemd): `systemctl edit ollama.service`, add `Environment="OLLAMA_HOST=0.0.0.0:11434"` under `[Service]`, then `systemctl daemon-reload && systemctl restart ollama`.
- **Windows**: Quit Ollama, edit your user environment variables (`OLLAMA_HOST`, `OLLAMA_MODELS`, etc.), then relaunch from the Start menu.

## Context window control

Default context is 4096 tokens; web search, agents, and coding tools should use at least 64000. Override with `OLLAMA_CONTEXT_LENGTH=8192 ollama serve`. In `ollama run`, use `/set parameter num_ctx 4096`; via the API, `"options": { "num_ctx": 4096 }`. Verify with `ollama ps` (`PROCESSOR`, `CONTEXT` columns).

## Where models are stored

macOS `~/.ollama/models`; Linux `/usr/share/ollama/.ollama/models`; Windows `C:\Users\%username%\.ollama\models`. Relocate with `OLLAMA_MODELS`; on Linux the `ollama` user needs access: `sudo chown -R ollama:ollama <directory>`.

## Keeping models loaded

Models stay in memory 5 minutes by default. `ollama stop <model>` unloads immediately, or use the API `keep_alive` on `/api/generate` and `/api/chat` (`"10m"`, `3600`, negative to keep loaded, or `0` to unload). `keep_alive` overrides `OLLAMA_KEEP_ALIVE`.

## Networking and remote access

Binds `127.0.0.1:11434` by default; set `OLLAMA_HOST` to change. Reverse proxy (Nginx) — `proxy_pass http://localhost:11434;` with `proxy_set_header Host localhost:11434;`. Tunnels:

```shell
ngrok http 11434 --host-header="localhost:11434"
cloudflared tunnel --url http://localhost:11434 --http-host-header="localhost:11434"
```

## Authentication

No auth locally; required for cloud models, publishing, and private downloads. Sign in with `ollama signin`. For direct access to `https://ollama.com/api`, set `export OLLAMA_API_KEY=your_api_key` and pass `-H "Authorization: Bearer $OLLAMA_API_KEY"`.

## Logs

macOS `cat ~/.ollama/logs/server.log`; Linux `journalctl -u ollama --no-pager --follow --pager-end`; Docker `docker logs <container-name>`; Windows `explorer %LOCALAPPDATA%\Ollama` (`server.log`). Debug: `OLLAMA_DEBUG=1`.

## See also

- [[concepts/cli-reference]] — `ollama serve`, `ollama ps`, `ollama stop`
- [[concepts/rest-api]] — `keep_alive`, `num_ctx` via the API
- [[syntheses/troubleshooting-playbook]]
- [[concepts/gpu-and-hardware]]


<!-- ===== ollama/wiki/concepts/embeddings.md ===== -->

---
title: "Embeddings"
type: concept
tags: [embeddings, rag, semantic-search, api]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-embeddings.md, raw/llms_txt_doc-generate-embeddings.md]
---
# Embeddings

Embeddings turn text into numeric vectors for vector databases, cosine-similarity search, or RAG pipelines. Vector length depends on the model (typically 384–1024 dimensions).

## Endpoint

`POST /api/embed` creates vector embeddings for the input text. Required: `model` and `input`. See [[concepts/rest-api]].

```shell
curl http://localhost:11434/api/embed -d '{
  "model": "embeddinggemma",
  "input": "Why is the sky blue?"
}'
```

Fields: `model` (required); `input` (string or array of strings, required — pass an array for batch embeddings); `truncate` (boolean, default `true`; `false` errors on over-long input); `dimensions` (integer); `keep_alive` (string); `options` (e.g. `num_ctx`). Response: `model`, `embeddings` (array of vectors), `total_duration`, `load_duration`, `prompt_eval_count`. Vectors are L2-normalized (unit length).

## SDK and CLI

```python
import ollama
single = ollama.embed(model='embeddinggemma', input='The quick brown fox...')
print(len(single['embeddings'][0]))  # vector length
```

CLI — directly or by piping: `ollama run embeddinggemma "Hello world"` or `echo "Hello world" | ollama run embeddinggemma`.

## Recommended models

* `embeddinggemma`
* `qwen3-embedding`
* `all-minilm`

## Tips

* Use cosine similarity for most semantic search.
* Use the same embedding model for indexing and querying.

Embeddings are also exposed through the OpenAI-compatible `/v1/embeddings` endpoint (supports `model`, `input`, `encoding format`, `dimensions`) — see [[concepts/openai-and-anthropic-compat]].


<!-- ===== ollama/wiki/concepts/gpu-and-hardware.md ===== -->

---
title: "GPU and Hardware Support"
type: concept
tags: [gpu, hardware, cuda, rocm, metal, vram]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-hardware-support.md, raw/llms_txt_doc-faq.md, raw/llms_txt_doc-troubleshooting.md]
---
# GPU and Hardware Support

Ollama accelerates inference on NVIDIA (CUDA), AMD (ROCm), Apple (Metal), and other GPUs via Vulkan, falling back to CPU when no GPU is usable. See [[concepts/configuration-and-serving]] for server environment variables.

## NVIDIA (CUDA)

Supports NVIDIA GPUs with **compute capability 5.0+** and driver 531+. Cards with compute capability 5.0–6.2 require driver 570+. Check your card at `https://developer.nvidia.com/cuda-gpus`.

Examples by compute capability: 12.0 = RTX 50xx (`RTX 5090`, `RTX 5080`...);
9.0 = `H200`, `H100`; 8.9 = RTX 40xx; 8.6 = RTX 30xx; 8.0 = `A100`, `A30`;
7.5 = RTX 20xx / `T4`; 5.0 = `GTX 750 Ti`.

### GPU selection

Limit Ollama to a subset of NVIDIA GPUs with `CUDA_VISIBLE_DEVICES` (comma-separated). Numeric IDs work but ordering may vary, so UUIDs (from `nvidia-smi -L`) are more reliable. Force CPU with an invalid GPU ID (e.g. `-1`).

## AMD Radeon (ROCm)

AMD GPUs are supported via ROCm; Ollama requires the **AMD ROCm v7 driver** on Linux (install/upgrade with `amdgpu-install`) and a ROCm v7 / HIP7-capable driver stack on Windows. Supported families: Radeon RX (`7900 XTX`, `9070 XT`...), Radeon PRO (`W7900`...), Radeon AI PRO, Ryzen AI, and AMD Instinct (`MI300X`...).

### GPU selection and overrides

* Limit to a subset: set `ROCR_VISIBLE_DEVICES` (list devices with `rocminfo`; prefer `Uuid` over numeric IDs; `-1` forces CPU).
* Unsupported card: force a close LLVM target with `HSA_OVERRIDE_GFX_VERSION` using `x.y.z` syntax (e.g. `HSA_OVERRIDE_GFX_VERSION="10.3.0"` for an RX 5400). For multiple GPUs, suffix the device number, e.g. `HSA_OVERRIDE_GFX_VERSION_0=10.3.0`.

## Apple (Metal) and Vulkan

Apple devices accelerate via the Metal API. Windows/Linux also support Vulkan (enabled by default when installed): select GPUs with `GGML_VK_VISIBLE_DEVICES` (numeric IDs); disable all with `OLLAMA_VULKAN=0` or `GGML_VK_VISIBLE_DEVICES=-1`.

## Placement and VRAM

`ollama ps` `Processor` column shows `100% GPU`, `100% CPU`, or a split (`48%/52% CPU/GPU`). On load, if a model fits a single GPU it loads there; otherwise it spreads across GPUs. `OLLAMA_MAX_LOADED_MODELS` (default 3 × GPUs, or 3 for CPU) and `OLLAMA_NUM_PARALLEL` (default 1) control concurrency; RAM scales by `OLLAMA_NUM_PARALLEL` × `OLLAMA_CONTEXT_LENGTH`.

## CPU fallback and library override

Ollama auto-picks among bundled LLM libraries. CPU order: `cpu_avx2` > `cpu_avx` > `cpu`. Force one with `OLLAMA_LLM_LIBRARY`, e.g. `OLLAMA_LLM_LIBRARY="cpu_avx2" ollama serve`. For GPU-discovery failures, see [[syntheses/troubleshooting-playbook]].


<!-- ===== ollama/wiki/concepts/installation.md ===== -->

---
title: "Installing Ollama"
type: concept
tags: [installation, macos, windows, linux, docker]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-macos.md, raw/llms_txt_doc-windows.md, raw/llms_txt_doc-linux.md, raw/llms_txt_doc-docker.md, raw/llms_txt_doc-quickstart.md, raw/github_doc-readme-md.md]
---
# Installing Ollama

Ollama is available on macOS, Windows, and Linux. After install, the API is served on `http://localhost:11434`.

## macOS

```shell
curl -fsSL https://ollama.com/install.sh | sh
```

Or [download manually](https://ollama.com/download/Ollama.dmg). Requires macOS Sonoma (v14)+; Apple M series (CPU+GPU) or x86 (CPU only). Preferred: mount `ollama.dmg` and drag the app to `Applications`. On startup the app verifies the `ollama` CLI is in PATH and, if not, prompts to create a link in `/usr/local/bin`.

Models and configuration live in `~/.ollama`; logs in `~/.ollama/logs`.

## Windows

```shell
irm https://ollama.com/install.ps1 | iex
```

Or [download manually](https://ollama.com/download/OllamaSetup.exe). Requires Windows 10 22H2+ (Home or Pro); NVIDIA 452.39+ drivers for NVIDIA cards. No Administrator needed; installs in your home directory (needs ≥4GB for the binary). The `ollama` command works in `cmd`, `powershell`, or your terminal.

Install to a different location:

```powershell
OllamaSetup.exe /DIR="d:\some\location"
```

A standalone `ollama-windows-amd64.zip` (CLI + GPU libs) is available for embedding or running as a service via `ollama serve`. Models and configuration live under `%HOMEPATH%\.ollama`.

## Linux

```shell
curl -fsSL https://ollama.com/install.sh | sh
```

### Manual install

If upgrading, first `sudo rm -rf /usr/lib/ollama`, then extract and run:

```shell
curl -fsSL https://ollama.com/download/ollama-linux-amd64.tar.zst | sudo tar x -C /usr
ollama serve && ollama -v
```

AMD GPU adds `ollama-linux-amd64-rocm.tar.zst`; ARM64 uses `ollama-linux-arm64.tar.zst`. Pin a version: `curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.5.7 sh`.

### Startup service (recommended)

Create an `ollama` user, then `/etc/systemd/system/ollama.service`, and `sudo systemctl daemon-reload && sudo systemctl enable --now ollama`:

```ini
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=$PATH"

[Install]
WantedBy=multi-user.target
```

## Docker

Official image `ollama/ollama` on Docker Hub (Vulkan bundled, enabled when the container can access the GPU). Run a model after start: `docker exec -it ollama ollama run llama3.2`.

```shell
# CPU only
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# NVIDIA GPU (requires NVIDIA Container Toolkit)
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# AMD GPU
docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm
```

## See also

- [[concepts/cli-reference]] — full command set
- [[concepts/configuration-and-serving]] — env vars, model storage, networking
- [[concepts/gpu-and-hardware]] — GPU support details


<!-- ===== ollama/wiki/concepts/modelfile.md ===== -->

---
title: "Modelfile Reference"
type: concept
tags: [modelfile, create-model, customization, parameters, import]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-modelfile-reference.md, raw/llms_txt_doc-create-a-model.md, raw/llms_txt_doc-importing-a-model.md]
---
# Modelfile Reference

A Modelfile is the blueprint to create and share customized models. Format is `INSTRUCTION arguments` (one per line, `#` for comments); **not case sensitive**, any order. Example then build:

```
FROM llama3.2
PARAMETER temperature 1
PARAMETER num_ctx 4096
SYSTEM You are Mario from super mario bros, acting as an assistant.
```

```shell
ollama create choose-a-model-name -f ./Modelfile && ollama run choose-a-model-name
```

## Instructions

| Instruction | Description |
| --- | --- |
| `FROM` (required) | Defines the base model to use. |
| `PARAMETER` | Sets the parameters for how Ollama will run the model. |
| `TEMPLATE` | The full prompt template to be sent to the model. |
| `SYSTEM` | Specifies the system message set in the template. |
| `ADAPTER` | Defines the (Q)LoRA adapters to apply to the model. |
| `LICENSE` | Specifies the legal license. |
| `MESSAGE` | Specify message history. |
| `REQUIRES` | Specify the minimum version of Ollama required by the model. |

### FROM (required)

`FROM <model name>:<tag>` (existing model, e.g. `FROM llama3.2`), `FROM <model directory>` (Safetensors dir), or `FROM ./ollama-model.gguf` (GGUF, absolute or relative path). Supported Safetensors architectures: Llama (incl. 2/3/3.1/3.2), Mistral (incl. Mistral 1/2 and Mixtral), Gemma (incl. Gemma 1 and 2), Phi3.

### PARAMETER

`PARAMETER <parameter> <parametervalue>`:

| Parameter | Description | Default |
| --- | --- | --- |
| num_ctx | Size of the context window used to generate the next token | 2048 |
| repeat_last_n | How far back to look to prevent repetition (0 = disabled, -1 = num_ctx) | 64 |
| repeat_penalty | How strongly to penalize repetitions | 1.1 |
| temperature | Higher = more creative | 0.8 |
| seed | Random seed; fixed value gives reproducible output | 0 |
| stop | Stop sequence; set multiple `stop` params for multiple sequences | — |
| num_predict | Max tokens to predict (-1 = infinite) | -1 |
| draft_num_predict | Max speculative draft tokens per step (0 to disable) | 4 |
| top_k | Higher = more diverse | 40 |
| top_p | Works with top-k; higher = more diverse | 0.9 |
| min_p | Minimum token probability relative to the most likely token | 0.0 |

### TEMPLATE

The full prompt template, using Go [template syntax](https://pkg.go.dev/text/template). Variables: `{{ .System }}`, `{{ .Prompt }}`, `{{ .Response }}` (text after `.Response` is omitted when generating).

```
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
"""
```

### SYSTEM, ADAPTER, LICENSE

`SYSTEM """<system message>"""`; `ADAPTER ./ollama-lora.gguf` (or a Safetensor adapter path); `LICENSE """<license text>"""`. `ADAPTER` must be absolute or relative to the Modelfile, and `FROM` must use the **same base model** the adapter was tuned from or behavior will be erratic.

### MESSAGE

`MESSAGE <role> <message>` (roles `system`, `user`, `assistant`) builds a conversation to guide the model:

```
MESSAGE user Is Toronto in Canada?
MESSAGE assistant yes
```

### REQUIRES

`REQUIRES <version>` — e.g. `REQUIRES 0.14.0`.

## Importing and quantizing

Point `FROM` at a Safetensors/GGUF file or directory (`FROM .` if the Modelfile sits with the weights), then `ollama create my-model`. Quantize an FP16/FP32 model with `-q`/`--quantize`, e.g. `ollama create --quantize q4_K_M mymodel`. Supported: `q8_0`, K-means `q4_K_S`, `q4_K_M`.

## See also

- [[concepts/cli-reference]] — `ollama create`, `ollama show`, `ollama push`
- [[concepts/rest-api]] — the `/api/create` endpoint


<!-- ===== ollama/wiki/concepts/openai-and-anthropic-compat.md ===== -->

---
title: "OpenAI- and Anthropic-Compatible APIs"
type: concept
tags: [openai, anthropic, compatibility, api, claude-code]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-openai-compatibility.md, raw/llms_txt_doc-anthropic-compatibility.md]
---
# OpenAI- and Anthropic-Compatible APIs

Drop-in compatible endpoints so existing OpenAI/Anthropic SDK code can point at a local Ollama server. See [[concepts/rest-api]] (native) and [[syntheses/api-surfaces-compared]].

## OpenAI compatibility

Base URL `http://localhost:11434/v1/`; `api_key` required by the SDK but ignored (use `'ollama'`):

```python
from openai import OpenAI
client = OpenAI(base_url='http://localhost:11434/v1/', api_key='ollama')
client.chat.completions.create(model='gpt-oss:20b',
    messages=[{'role': 'user', 'content': 'Say this is a test'}])
```

Endpoints: `/v1/chat/completions`, `/v1/completions`, `/v1/models`, `/v1/models/{model}`, `/v1/embeddings`, `/v1/images/generations` (experimental), `/v1/responses` (added in v0.13.3; non-stateful only — no `previous_response_id` or `conversation`).

`/v1/chat/completions` supports streaming, JSON mode, reproducible outputs, vision, tools, reasoning control. Fields: `model`, `messages`, `frequency_penalty`, `presence_penalty`, `response_format`, `seed`, `stop`, `stream`, `stream_options.include_usage`, `temperature`, `top_p`, `max_tokens`, `tools`, `reasoning_effort` (`"high"`/`"medium"`/`"low"`/`"max"`/`"none"`), `reasoning.effort`. Not supported: `tool_choice`, `logit_bias`, `user`, `n`, `logprobs`.

No way to set context size here; bake `PARAMETER num_ctx <size>` into a Modelfile, `ollama create mymodel`, call the new name. See [[concepts/modelfile]].

## Anthropic compatibility

Anthropic Messages API at `/v1/messages` (base URL `http://localhost:11434`), enabling tools like Claude Code. Set `ANTHROPIC_AUTH_TOKEN=ollama` (ignored) and `ANTHROPIC_BASE_URL=http://localhost:11434`.

```shell
curl -X POST http://localhost:11434/v1/messages \
-H "x-api-key: ollama" -H "anthropic-version: 2023-06-01" \
-d '{ "model": "qwen3-coder", "max_tokens": 1024,
  "messages": [{ "role": "user", "content": "Hello, how are you?" }] }'
```

`/v1/messages` supports streaming, system prompts, multi-turn, vision (base64), `tool_use`/`tool_result` blocks, `thinking` blocks. Fields: `model`, `max_tokens`, `messages`, `system`, `stream`, `temperature`, `top_p`, `top_k`, `stop_sequences`, `tools`, `thinking`.

**Claude Code:** `ollama launch claude` auto-configures and launches; `--config` configures without launching; or set the env vars and run `claude --model qwen3-coder`. Recommended: `glm-4.7`, `minimax-m2.1`, `qwen3-coder`.

**Differences from the real Anthropic API:** API key not validated; `anthropic-version` unused; token counts approximate. Not supported: `/v1/messages/count_tokens`, `tool_choice`, `metadata`, prompt caching (`cache_control`), Batches API, citations, PDF (`document`) blocks. Image base64-only (no URLs); extended thinking basic (`budget_tokens` accepted but not enforced).


<!-- ===== ollama/wiki/concepts/rest-api.md ===== -->

---
title: "Ollama REST API"
type: concept
tags: [rest-api, api, generate, chat, embeddings, streaming]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-generate-a-response.md, raw/llms_txt_doc-generate-a-chat-message.md, raw/llms_txt_doc-generate-embeddings.md, raw/llms_txt_doc-usage.md, raw/llms_txt_doc-streaming.md, raw/llms_txt_doc-list-models.md, raw/llms_txt_doc-list-running-models.md, raw/llms_txt_doc-show-model-details.md, raw/llms_txt_doc-pull-a-model.md, raw/llms_txt_doc-create-a-model.md, raw/llms_txt_doc-get-version.md]
---
# Ollama REST API

Native API at `http://localhost:11434` (base path `/api`); cloud at `https://ollama.com/api`. No local auth.

## POST /api/generate

Prompt → response. Required: `model`. Fields: `prompt`, `suffix`, `images` (base64), `system`, `format` (`"json"` or a JSON schema), `stream` (default `true`), `think` (boolean or `"high"`/`"medium"`/`"low"`/`"max"`), `raw`, `keep_alive`, `options`, `logprobs`, `top_logprobs`. Response: `response`, `done`, `done_reason`, `thinking`, usage metrics (below).

## POST /api/chat

Next chat message. Required: `model`, `messages`. Fields: `tools`, `format`, `options`, `stream` (default `true`), `think`, `keep_alive`, `logprobs`, `top_logprobs`.

```shell
curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "messages": [{ "role": "user", "content": "why is the sky blue?" }]
}'
```

Each message has `role` (`system`/`user`/`assistant`/`tool`), `content`, optional `images` (base64) and `tool_calls`. The response `message.role` is always `assistant`. See [[concepts/tool-calling]], [[concepts/structured-outputs]], [[concepts/vision-and-multimodal]], [[concepts/thinking-and-web-search]].

## The `options` object

Controls generation: `seed`, `temperature`, `top_k`, `top_p`, `min_p`, `stop` (string or array), `num_ctx` (context length), `num_predict` (max tokens). Additional properties allowed.

## POST /api/embed

Vector embeddings. Required: `model`, `input` (string or array). Fields: `truncate` (default `true`), `dimensions`, `keep_alive`, `options`. Returns `embeddings` (array of vectors) plus `total_duration`, `load_duration`, `prompt_eval_count`. See [[concepts/embeddings]].

## Model management endpoints

```shell
curl http://localhost:11434/api/tags        # GET — list local models
curl http://localhost:11434/api/ps           # GET — list running models
curl http://localhost:11434/api/version      # GET — Ollama version
curl http://localhost:11434/api/show -d '{ "model": "gemma4" }'   # POST — model details
curl http://localhost:11434/api/pull -d '{ "model": "gemma4" }'   # POST — pull a model
curl http://localhost:11434/api/create -d '{ "from": "gemma4", "model": "alpaca", "system": "You are Alpaca." }'  # POST — create
```

- `/api/tags` returns each model's `name`, `model`, `modified_at`, `size`, `digest`, and `details` (format, family, parameter_size, quantization_level).
- `/api/ps` adds `expires_at`, `size_vram`, and `context_length` per running model.
- `/api/show` returns `parameters`, `license`, `template`, `capabilities`, `details`, and `model_info`; pass `"verbose": true` for large fields.
- `/api/pull` and `/api/create` stream status events (`status`, `digest`, `total`, `completed`); pass `"stream": false` to disable.

## Streaming

Endpoints stream by default as newline-delimited JSON (`application/x-ndjson`), one chunk per line with `"done": false` until the final chunk:

```json
{"model":"gemma4","created_at":"2025-10-26T17:15:24.166576Z","response":"!","done":true,"done_reason":"stop"}
```

Set `{"stream": false}` for a single `application/json` response. Usage fields appear in the final chunk.

## Usage metrics

All timing values in **nanoseconds**: `total_duration` (total), `load_duration` (model load), `prompt_eval_count` (input tokens) / `prompt_eval_duration`, `eval_count` (output tokens) / `eval_duration`.

## See also

- [[concepts/configuration-and-serving]] — `keep_alive`, env vars, networking
- [[concepts/openai-and-anthropic-compat]] — OpenAI/Anthropic-compatible endpoints
- [[syntheses/api-surfaces-compared]]


<!-- ===== ollama/wiki/concepts/structured-outputs.md ===== -->

---
title: "Structured Outputs (JSON Schema)"
type: concept
tags: [structured-outputs, json, schema, format]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-structured-outputs.md]
---
# Structured Outputs (JSON Schema)

Structured outputs enforce a JSON schema on responses so you can reliably extract data, describe images, or keep replies consistent. Set the `format` parameter on the `/api/chat` request (see [[concepts/rest-api]]).

> Note: Ollama's Cloud currently does not support structured outputs. See
> [[entities/ollama-cloud]].

## JSON mode

Pass `"format": "json"` in the `/api/chat` body to force valid JSON output.

## JSON with a schema

Provide a full JSON schema to `format` (also passing it as a string in the prompt helps ground the model):

```shell
curl -X POST http://localhost:11434/api/chat -d '{
  "model": "gpt-oss",
  "messages": [{"role": "user", "content": "Tell me about Canada."}],
  "stream": false,
  "format": {
    "type": "object",
    "properties": {
      "name": {"type": "string"},
      "capital": {"type": "string"},
      "languages": {"type": "array", "items": {"type": "string"}}
    },
    "required": ["name", "capital", "languages"]
  }
}'
```

In Python, pass a Pydantic model's `model_json_schema()` to `format` and
validate the response with `model_validate_json()`:

```python
from ollama import chat
from pydantic import BaseModel

class Country(BaseModel):
  name: str
  capital: str
  languages: list[str]

response = chat(
  model='gpt-oss',
  messages=[{'role': 'user', 'content': 'Tell me about Canada.'}],
  format=Country.model_json_schema(),
)
country = Country.model_validate_json(response.message.content)
```

In JavaScript, serialize a Zod schema with `z.toJSONSchema(schema)` and parse
the result.

## Vision with structured outputs

Vision models accept the same `format` parameter for deterministic image
descriptions (pass `images` and a schema; set `options={'temperature': 0}`).
See [[concepts/vision-and-multimodal]].

## Tips

* Define schemas with Pydantic (Python) or Zod (JavaScript) so they can be
  reused for validation.
* Lower the temperature (e.g. `0`) for more deterministic completions.
* Through the OpenAI-compatible API, structured outputs work via
  `response_format`. See [[concepts/openai-and-anthropic-compat]].


<!-- ===== ollama/wiki/concepts/thinking-and-web-search.md ===== -->

---
title: "Thinking and Web Search"
type: concept
tags: [thinking, reasoning, web-search, agents, api]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-thinking.md, raw/llms_txt_doc-web-search.md]
---
# Thinking and Web Search

Two capabilities that augment generation: controlling a model's reasoning trace ("thinking"), and grounding answers with Ollama's web search API.

## Thinking (reasoning control)

Thinking-capable models emit a `thinking` field separating their reasoning trace from the final answer — use it to audit steps, animate "thinking" in a UI, or hide the trace.

Set the `think` field on chat or generate requests. Most models accept booleans (`true`/`false`) or levels (`low`, `medium`, `high`, `max`), where `max` requests the highest level. GPT-OSS instead requires one of `low`, `medium`, `high` — `true`/`false` is ignored.

```shell
curl http://localhost:11434/api/chat -d '{
  "model": "qwen3",
  "messages": [{"role": "user", "content": "How many letter r are in strawberry?"}],
  "think": true,
  "stream": false
}'
```

The `message.thinking` (chat) or `thinking` (generate) field holds the reasoning trace; `message.content` / `response` holds the final answer. When streaming, thinking tokens precede answer tokens — detect the first `thinking` chunk, then switch once `message.content` arrives. Supported models: Qwen 3, GPT-OSS (levels only), DeepSeek-v3.1, DeepSeek R1; enabled by default in CLI and API.

### CLI quick reference

* Enable: `ollama run deepseek-r1 --think "..."`; disable: `--think=false`; hide trace: `--hidethinking`.
* Interactive toggle: `/set think` or `/set nothink`. GPT-OSS levels: `ollama run gpt-oss --think=low "..."`.

See [[concepts/cli-reference]] and [[concepts/tool-calling]] (SDK examples combine `think=True` with tools).

## Web search API

Augments models with current information. Hosted at `ollama.com` (not local); needs an API key — create at `https://ollama.com/settings/keys` (free account). Set `OLLAMA_API_KEY` or pass it in the `Authorization` header.

* **`POST https://ollama.com/api/web_search`** — `query` (string, required); `max_results` (integer, optional; default 5, max 10). Returns `results`, each with `title`, `url`, `content`. (Example in [[entities/ollama-cloud]].)
* **`POST https://ollama.com/api/web_fetch`** — fetches a single page by `url`; returns `title`, `content`, `links`.

Libraries expose `web_search`/`web_fetch` (Python) and `webSearch`/`webFetch` (JS), passable as tools in an agent loop. Results can be thousands of tokens — raise context to ≥~32000. Also enableable in any MCP client via the Python MCP server. See [[entities/ollama-cloud]].


<!-- ===== ollama/wiki/concepts/tool-calling.md ===== -->

---
title: "Tool Calling (Function Calling)"
type: concept
tags: [tools, function-calling, agents, api]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-tool-calling.md]
---
# Tool Calling (Function Calling)

Ollama supports tool calling (function calling) so a model can invoke tools and incorporate their results. Tools are passed to the `/api/chat` endpoint (see [[concepts/rest-api]]).

## Defining tools

The `tools` array contains objects of `type: "function"`, each with a `function` holding `name`, `description`, and a JSON-Schema `parameters` object:

```shell
curl -s http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{
  "model": "qwen3",
  "messages": [{"role": "user", "content": "What is the temperature in New York?"}],
  "stream": false,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_temperature",
        "description": "Get the current temperature for a city",
        "parameters": {
          "type": "object",
          "required": ["city"],
          "properties": {
            "city": {"type": "string", "description": "The name of the city"}
          }
        }
      }
    }
  ]
}'
```

## Returning tool results

The model replies with `message.tool_calls` — each a `{"type":"function","function":{"index":0,"name":...,"arguments":{...}}}`. Execute each call, then re-send `messages` with the assistant message (carrying its `tool_calls`) plus one `{"role":"tool","tool_name":...,"content":...}` message per call, in order. Parallel calls return multiple `tool_calls` entries (each with an `index`).

## Agent loop and SDKs

A multi-turn agent loop calls the model repeatedly, executing returned tool calls and appending results until `tool_calls` is empty:

```python
from ollama import chat, ChatResponse

available_functions = {'add': add, 'multiply': multiply}
messages = [{'role': 'user', 'content': 'What is (11434+12341)*412?'}]
while True:
    response: ChatResponse = chat(model='qwen3', messages=messages,
                                  tools=[add, multiply], think=True)
    messages.append(response.message)
    if response.message.tool_calls:
        for tc in response.message.tool_calls:
            result = available_functions[tc.function.name](**tc.function.arguments)
            messages.append({'role': 'tool', 'tool_name': tc.function.name, 'content': str(result)})
    else:
        break
```

The Python SDK auto-parses Python functions into a tool schema, so you can pass functions directly in the `tools` list (raw JSON schemas also work). Install with `pip install ollama -U` (Python) or `npm i ollama` (JavaScript).

When streaming, gather every `thinking`, `content`, and `tool_calls` chunk, then send those fields back with the tool results in the follow-up request. Tool calling pairs naturally with [[concepts/thinking-and-web-search]] (`think=True`).


<!-- ===== ollama/wiki/concepts/vision-and-multimodal.md ===== -->

---
title: "Vision and Multimodal Models"
type: concept
tags: [vision, multimodal, images, api]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-vision.md, raw/llms_txt_doc-openai-compatibility.md]
---
# Vision and Multimodal Models

Vision models accept images alongside text to describe, classify, and answer questions about what they see.

## Quick start (CLI)

```shell
ollama run gemma4 ./image.png whats in this image?
```

See [[concepts/cli-reference]] for `ollama run`.

## Passing images via the native API

Provide an `images` array on the message. SDKs accept file paths, URLs, or raw bytes; the REST API (`/api/chat`) expects base64-encoded data (`IMG=$(base64 < test.jpg | tr -d '\n')`):

```shell
curl -X POST http://localhost:11434/api/chat -d '{
    "model": "gemma4",
    "messages": [{
      "role": "user",
      "content": "What is in this image?",
      "images": ["'"$IMG"'"]
    }],
    "stream": false
}'
```

In the Python SDK, `images` accepts a path, base64 string, or raw bytes (`messages=[{'role':'user','content':'...','images':[path]}]`). See [[concepts/rest-api]] for the full `/api/chat` shape.

## Passing images via the OpenAI-compatible API

`/v1/chat/completions` accepts vision input as a content part of type `image_url`, where `image_url` is a base64 data URI (image URLs are not supported):

```shell
curl -X POST http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen3-vl:8b",
  "messages": [{"role": "user", "content": [
    {"type": "text", "text": "What is this an image of?"},
    {"type": "image_url", "image_url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUg..."}
  ]}]
}'
```

See [[concepts/openai-and-anthropic-compat]].

## Example models

`gemma4`, `qwen3-vl:8b`. Structured image descriptions are possible by combining
vision with a JSON schema — see [[concepts/structured-outputs]].


<!-- ===== ollama/wiki/concepts/what-is-ollama.md ===== -->

---
title: "What is Ollama"
type: concept
tags: [ollama, overview, getting-started, local-llm]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-introduction.md, raw/llms_txt_doc-quickstart.md, raw/llms_txt_doc-overview.md, raw/github_doc-readme-md.md]
---
# What is Ollama

Ollama runs open models locally and exposes a REST API to build with them programmatically. Available on macOS, Windows, and Linux.

## Getting started

Run `ollama` to open the interactive menu:

```sh
ollama
```

Navigate with `↑/↓`, `enter` to launch, `→` to change model, `esc` to quit. The menu gives quick access to **Run a model**, **Launch tools** (Claude Code, Codex, OpenClaw, and more), and **Additional integrations** (under "More...").

Chat with a model directly:

```sh
ollama run gemma4
```

See [ollama.com/library](https://ollama.com/library) for the full model list.

## Launching integrations

```sh
ollama launch claude
ollama launch codex
ollama launch opencode
ollama launch openclaw
```

See [[summaries/model-library-and-integrations-catalog]] for the full catalog.

## REST API

After installation, the API is served by default at `http://localhost:11434/api`. For cloud models on **ollama.com**, the same API is at `https://ollama.com/api`. Access via `curl`:

```sh
curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "messages": [{ "role": "user", "content": "Hello!" }]
}'
```

See [[concepts/rest-api]] for all endpoints.

## Libraries

Official libraries for Python (`pip install ollama`) and JavaScript (`npm i ollama`); community libraries also exist.

## Backends and versioning

Ollama is built on [llama.cpp](https://github.com/ggml-org/llama.cpp). The API isn't strictly versioned but is expected to be stable and backwards compatible; deprecations are rare and announced in the release notes.

## Next steps

- [[concepts/installation]] — install on your platform
- [[concepts/cli-reference]] — the full `ollama` command set
- [[concepts/configuration-and-serving]] — `ollama serve` and environment variables


<!-- ===== ollama/wiki/entities/ollama-cloud.md ===== -->

---
title: "Ollama Cloud"
type: entity
tags: [ollama-cloud, hosted-models, web-search, authentication, api-keys]
updated: 2026-06-23
confidence: high
sources: [raw/llms_txt_doc-cloud.md, raw/llms_txt_doc-web-search.md, raw/llms_txt_doc-authentication.md]
---
# Ollama Cloud

Ollama Cloud is Ollama's hosted service: it runs large models that wouldn't fit locally by offloading them to Ollama's servers while you keep using the same local tools. It also provides a hosted web search / web fetch API. Cloud features are optional and can be disabled to run [[concepts/what-is-ollama|Ollama]] in local-only mode.

## Cloud models

Cloud models "run without a powerful GPU" — auto-offloaded to Ollama's cloud, same capabilities as local models, full context length. Supported list: filter the library at `https://ollama.com/search?c=cloud`. Requires an [ollama.com](https://ollama.com) account (`ollama signin`).

Once signed in, use them like local models — the name carries a `-cloud` (or `:cloud`) suffix (e.g. `ollama run gpt-oss:120b-cloud`). For SDK/`curl` against the **local** endpoint, first `ollama pull gpt-oss:120b-cloud`; Ollama then authenticates cloud requests automatically.

## Cloud API access (ollama.com as a remote host)

Cloud models can also be hit directly on ollama.com's API — "ollama.com acts as a remote Ollama host," served at `https://ollama.com/api`, with the same native endpoints. Here the model name drops the `-cloud` suffix (e.g. `gpt-oss:120b`). List with `curl https://ollama.com/api/tags`; generate:

```
curl https://ollama.com/api/chat \
  -H "Authorization: Bearer $OLLAMA_API_KEY" \
  -d '{
    "model": "gpt-oss:120b",
    "messages": [{"role": "user", "content": "Why is the sky blue?"}],
    "stream": false
  }'
```

The Python/JavaScript libraries accept `host="https://ollama.com"` plus an `Authorization: Bearer` header to target the cloud host.

## Authentication

No auth locally; **required** for running cloud models, publishing, and private downloads. Two methods: **sign in** (`ollama signin` — then ollama.com requests authenticate automatically), or an **API key** for programmatic access to `https://ollama.com/api`. Create a key at `https://ollama.com/settings/keys`, `export OLLAMA_API_KEY=your_api_key`, and pass `-H "Authorization: Bearer $OLLAMA_API_KEY"`. Keys don't expire but can be revoked anytime. (Your local instance also has an Ollama Public Key, `id_ed25519.pub`, for pushing/pulling private models — see [[concepts/configuration-and-serving]].)

## Web search and web fetch

A REST API (free account + API key required) that augments models with current information:

* **`POST https://ollama.com/api/web_search`** — `query` (string, required), `max_results` (integer, optional; default `5`, max `10`). Returns `results[]`, each with `title`, `url`, `content`.
* **`POST https://ollama.com/api/web_fetch`** — `url` (string, required). Returns `title`, `content`, `links[]`.

```bash
curl https://ollama.com/api/web_search \
  --header "Authorization: Bearer $OLLAMA_API_KEY" \
  -d '{"query":"what is ollama?"}'
```

The Python (`ollama.web_search(...)`, `web_fetch(...)`) and JS (`client.webSearch(...)`, `client.webFetch(...)`) libraries expose these as callable tools for an agent loop (see [[concepts/thinking-and-web-search]], [[concepts/tool-calling]]). Results can be thousands of tokens — raise context to ≥~32000. Also wireable into any MCP client via the Python MCP server (Cline, Codex, Goose).

## Local-only mode

Disable cloud entirely — set `disable_ollama_cloud` in `~/.ollama/server.json`:

```json
{ "disable_ollama_cloud": true }
```

…or set `OLLAMA_NO_CLOUD=1`, then restart. Logs then show `Ollama cloud disabled: true`; cloud models and web search become unavailable.

## Privacy and deprecations

Ollama processes cloud prompts/responses to serve the request but states it does "not store or log that content and never train on it"; only basic account info and limited usage metadata are collected. Local models never send prompt data.

Older cloud models are occasionally retired as better ones ship; impacted users are notified in advance by email and on the website, and **retirement does not affect local models.** A schedule with replacements is in the cloud docs — e.g. `kimi-k2-thinking`, `kimi-k2:1t`, `minimax-m2`, `glm-4.6`, `qwen3-next:80b`, `cogito-2.1:671b` were listed for a June 16, 2026 retirement.

## Related

[[summaries/model-library-and-integrations-catalog]] · [[syntheses/api-surfaces-compared]] · [[concepts/rest-api]] · [[concepts/thinking-and-web-search]]


<!-- ===== ollama/wiki/log.md ===== -->

---
title: "Activity Log"
type: log
---

# Activity Log

Append-only record of all wiki changes.

## Format

Each entry follows this format:
```
### YYYY-MM-DD HH:MM — [Action Type]
- **Source/Trigger**: what initiated the action
- **Pages created**: list of new pages
- **Pages updated**: list of updated pages
- **Notes**: any contradictions flagged, decisions made
```

---

### 2026-04-08 00:00 — Setup

- **Source/Trigger**: Repository initialized
- **Pages created**: index.md, log.md, dashboard.md, analytics.md, flashcards.md
- **Pages updated**: none
- **Notes**: Empty knowledge base ready for first source ingestion

---

### 2026-06-23 — Initial curation (factory build)

- **Source/Trigger**: `new_wiki.py init ollama` — 121 sources gathered (docs.ollama.com llms.txt, README, 14 releases v0.23–v0.30, 40 solved GitHub issues)
- **Pages created**: 17 — 13 concepts (what-is-ollama, installation, cli-reference, modelfile, rest-api, openai-and-anthropic-compat, tool-calling, structured-outputs, vision-and-multimodal, embeddings, thinking-and-web-search, gpu-and-hardware, configuration-and-serving), 1 entity (ollama-cloud), 1 summary (model-library-and-integrations-catalog), 2 syntheses (api-surfaces-compared, troubleshooting-playbook)
- **Pages updated**: index.md (master catalog + stats), log.md
- **Notes**: Curated to the medium rung per RECIPE. Folded context-length into configuration-and-serving. Troubleshooting playbook built from the 40 solved issues. Noted a source discrepancy on default context window (FAQ says 4096; context-length doc gives VRAM-tiered 4k/32k/256k) — both presented as written, attributed to their sources.


<!-- ===== ollama/wiki/summaries/model-library-and-integrations-catalog.md ===== -->

---
title: "Model Library and Integrations Catalog"
type: summary
tags: [models, model-tags, integrations, ecosystem, catalog, coding-agents]
updated: 2026-06-23
confidence: high
sources: [raw/github_doc-readme-md.md, raw/llms_txt-llms-txt-index.md, raw/llms_txt_doc-list-models.md, raw/llms_txt_doc-overview.md, raw/llms_txt_doc-quickstart.md, raw/llms_txt_doc-claude-code.md, raw/llms_txt_doc-codex-cli.md, raw/llms_txt_doc-zed.md, raw/llms_txt_doc-cloud.md, raw/llms_txt_doc-openai-compatibility.md, raw/llms_txt_doc-anthropic-compatibility.md, raw/llms_txt_doc-web-search.md]
---
# Model Library and Integrations Catalog

Maps (a) the exact documented model tags and (b) the integration ecosystem. Points to the right tag and integration page; does not reproduce every setup.

## Running and finding models

Pull/run by `<name>:<tag>`; default registry is the library at `ollama.com/library` (cloud filter: `https://ollama.com/search?c=cloud`).

```
ollama run gemma4               # run + chat
ollama pull llama3.2            # pull only
ollama run gpt-oss:120b-cloud   # cloud model (offloaded)
```

`GET /api/tags` (CLI `ollama list`) reports `name`, `size`, `digest`, `details` — incl. `parameter_size` (`7B`, `13B`), `quantization_level` (`Q4_K_M`), `family`/`format` (`gguf`). See [[concepts/cli-reference]], [[syntheses/api-surfaces-compared]].

## Model tags referenced in the docs (verbatim)

Exact tags from the source docs. Suffixes encode size/quantization (`:20b`, `:120b`, `:8b`, `-cloud`, `:cloud`). Documented working set only; the library is far larger.

| Tag (verbatim) | Where documented |
| --- | --- |
| `gemma4` | Default chat example (README, quickstart, API) |
| `gpt-oss:20b` | OpenAI-compat; "Strong general-purpose model" |
| `gpt-oss:120b` | Cloud API (no suffix when hitting ollama.com) |
| `gpt-oss:120b-cloud` | Cloud model run locally; Codex `--oss -m` |
| `qwen3:4b` | Web-search agent example (Qwen 3, 4B params) |
| `qwen3:8b` | `/v1/responses` example |
| `qwen3-vl:8b` | OpenAI-compat vision example |
| `qwen3-coder` | Claude Code default; "30B, ≥24GB VRAM" |
| `qwen3.5`, `qwen3.5:cloud` | Claude Code recommended |
| `glm-4.7`, `glm-4.7-flash`, `glm-4.7:cloud` | Claude Code / coding |
| `glm-5:cloud` | Claude Code recommended (cloud) |
| `kimi-k2.5:cloud` | `ollama launch claude --model kimi-k2.5:cloud` |
| `minimax-m2.1:cloud`, `minimax-m2.7:cloud` | Claude Code recommended (cloud) |
| `llama3.2` | OpenAI-compat pull; FAQ keep-alive/preload |
| `mistral` | FAQ preload example |

Cloud tags (`-cloud`/`:cloud`) follow Ollama Cloud's deprecation schedule — see [[entities/ollama-cloud]]. Quantization/context tradeoffs in [[concepts/configuration-and-serving]] and [[concepts/modelfile]].

## Integration ecosystem map

Ollama connects via the interactive menu (`ollama`), launchers (`ollama launch <tool>`), the native REST API, the OpenAI-/Anthropic-compatible APIs, the Python/JavaScript libraries, and MCP servers. Official integration docs live at `docs.ollama.com/integrations/*`.

* **Coding agents** (`ollama launch <tool>` where noted): Claude Code (`ollama launch claude`, Anthropic-compat — setup in [[syntheses/api-surfaces-compared]]), Codex CLI (`ollama launch codex` or `codex --oss [-m <model>]`) + Codex App, Copilot CLI, Cline CLI, OpenCode (`ollama launch opencode`), Droid, Goose (also a web-search MCP target), Oh My Pi, Pi, Pool.
* **Assistants:** OpenClaw (`ollama launch openclaw`, "100+ skills"), Hermes Agent, Hermes Desktop, NemoClaw.
* **IDEs & editors:** VS Code, JetBrains, Xcode, Cline, Roo Code; **Zed** — provider Ollama, Host URL `http://localhost:11434` (or API URL `https://ollama.com` for cloud).
* **Chat/RAG, automation, notebooks:** Onyx, n8n, marimo.
* **Connection methods:** native REST / `/v1` compat at `http://localhost:11434` (see [[concepts/rest-api]], [[concepts/openai-and-anthropic-compat]]); libraries `pip install ollama` / `npm i ollama`; MCP servers (web search via the Python MCP server, configs for Cline, Codex, Goose — see [[entities/ollama-cloud]]).

## Community integrations (from the README, not exhaustive)

* **Chat UIs:** Open WebUI, LibreChat, Lobe Chat, NextChat, AnythingLLM, Cherry Studio, Enchanted, Msty, Chatbox, Alpaca, SwiftChat.
* **Code editors & dev:** Continue, Void, twinny, gptel/Ellama (Emacs), AI Toolkit for VS Code, Open Interpreter, QodeAssist (Qt Creator).
* **Libraries & SDKs:** LiteLLM, LangChain / LangChain.js / LangChain4j / LangChainGo / LangChainRust / LangChainDart, LlamaIndex, Haystack, Semantic Kernel, Spring AI, OllamaSharp (.NET), Ollama4j (Java), ollama-swift, Firebase Genkit, Testcontainers, Portkey.
* **Frameworks & agents:** AutoGPT, crewAI, Strands Agents (AWS), Cheshire Cat, any-agent (Mozilla).
* **RAG & KBs:** RAGFlow, R2R, MaxKB, Minima, Casibase, Archyve.
* **Terminal/CLI:** aichat, oterm, gollama, tlm, ParLlama, llm-ollama.
* **Database & embeddings:** pgai (Postgres), MindsDB, chromem-go, Kangaroo.
* **Observability:** Opik, OpenLIT, Lunary, Langfuse, HoneyHive, MLflow Tracing.
* **Infra/deploy & packaging:** Google Cloud, Fly.io, Koyeb, Harbor; Homebrew, Pacman, Nix, Helm Chart, Gentoo, Flox. Official Docker image `ollama/ollama` on Docker Hub.

## Related

[[entities/ollama-cloud]] · [[syntheses/api-surfaces-compared]] · [[syntheses/troubleshooting-playbook]] · [[concepts/what-is-ollama]] · [[concepts/installation]] · [[concepts/cli-reference]] · [[concepts/modelfile]] · [[concepts/configuration-and-serving]] · [[concepts/vision-and-multimodal]] · [[concepts/embeddings]]


<!-- ===== ollama/wiki/syntheses/api-surfaces-compared.md ===== -->

---
title: "API Surfaces Compared: Native REST vs OpenAI-compatible vs Anthropic-compatible"
type: synthesis
tags: [rest-api, openai-compat, anthropic-compat, endpoints, compatibility]
updated: 2026-06-23
confidence: medium
sources: [raw/llms_txt_doc-generate-a-chat-message.md, raw/llms_txt_doc-generate-a-response.md, raw/llms_txt_doc-openai-compatibility.md, raw/llms_txt_doc-anthropic-compatibility.md, raw/llms_txt_doc-list-models.md]
---
# API Surfaces Compared

Ollama exposes **three** HTTP API surfaces on one server (`http://localhost:11434`, no local auth): the **native REST API** (`/api/*`, full feature set), the **OpenAI-compatible** shim (`/v1/*`), and the **Anthropic-compatible** shim (`/v1/messages`, notably [[concepts/openai-and-anthropic-compat|Claude Code]]). Choose native for new code; choose a compat surface for existing OpenAI/Anthropic tooling.

## Endpoints at a glance

| Surface | Chat endpoint | Other endpoints | Auth |
| --- | --- | --- | --- |
| Native REST | `POST /api/chat` | `POST /api/generate`, `GET /api/tags`, `/api/ps`, `/api/pull`, `/api/push`, `/api/embed`, `/api/show`, `/api/create`, `/api/copy`, `/api/delete`, `GET /api/version` | none (local); Bearer key for ollama.com |
| OpenAI-compat | `POST /v1/chat/completions` | `/v1/completions`, `/v1/responses`, `/v1/models`, `/v1/models/{model}`, `/v1/embeddings`, `/v1/images/generations` (experimental) | `api_key` "required but ignored" |
| Anthropic-compat | `POST /v1/messages` | — | `x-api-key` / `ANTHROPIC_AUTH_TOKEN` accepted but not validated |

See [[concepts/rest-api]] (native) and [[concepts/openai-and-anthropic-compat]] (compat).

## 1. Native Ollama REST API

Two text-generation endpoints plus model-management endpoints (CLI equivalents in [[concepts/cli-reference]]).

* **`POST /api/generate`** — prompt → response. Request: `model` (required), `prompt`, `suffix` (fill-in-the-middle), `images` (base64), `system`, `format`, `stream` (default `true`), `think`, `raw`, `keep_alive`, `options`, `logprobs`/`top_logprobs`. Response: `response`, optional `thinking`, `done`, `done_reason`, timing fields (`total_duration`, `load_duration`, `prompt_eval_count`, `eval_count`).
* **`POST /api/chat`** — multi-turn `messages[]` (roles `system`/`user`/`assistant`/`tool`) → assistant message. Adds `tools` and per-message `images`/`tool_calls`. Response `message` has `content`, optional `thinking`, `tool_calls`, `images`.

Native-only (or partial on compat) capabilities:

* **`think`** — boolean **or** `"high" | "medium" | "low" | "max"` (see [[concepts/thinking-and-web-search]]).
* **`format`** — `"json"` or a full JSON Schema for [[concepts/structured-outputs|structured outputs]].
* **`options`** — `seed`, `temperature`, `top_k`, `top_p`, `min_p`, `stop`, `num_ctx`, `num_predict`. The only way to set context size per-request (OpenAI surface can't).
* **`keep_alive`** — unload timing (`5m`, `0`, `-1`); see [[concepts/configuration-and-serving]].
* **`logprobs` / `top_logprobs`** — token log-probabilities. Streaming is `application/x-ndjson`, default on.

`GET /api/tags` also carries `remote_model`/`remote_host` for cloud/remote models — see [[entities/ollama-cloud]].

## 2. OpenAI-compatible API (`/v1`)

Point any OpenAI SDK at `base_url='http://localhost:11434/v1/'`, `api_key='ollama'` (ignored). Pull first; for hardcoded names alias with `ollama cp llama3.2 gpt-3.5-turbo`. Full field lists in [[concepts/openai-and-anthropic-compat]]; comparison highlights:

* `/v1/chat/completions` — vision is base64-only (**not** image URL); supports `reasoning_effort`/`reasoning.effort` (`"high"|"medium"|"low"|"max"|"none"`); **not** supported: `logprobs`, `tool_choice`, `logit_bias`, `user`, `n`.
* `/v1/responses` (added v0.13.3) — **non-stateful only** (no `previous_response_id`/`conversation`); supports `instructions`, `max_output_tokens`.
* `/v1/completions` (legacy, `prompt` string-only, `suffix`); `/v1/embeddings` (`encoding_format`, `dimensions` — see [[concepts/embeddings]]); `/v1/models`(`/{model}`) (`created` = last-modified, `owned_by` = `"library"`); `/v1/images/generations` (experimental, `response_format: b64_json` only).

**Key limitation vs native:** no per-request context size — bake `num_ctx` into a Modelfile (`PARAMETER num_ctx <size>`, `ollama create mymodel`). See [[concepts/modelfile]].

## 3. Anthropic-compatible API (`/v1/messages`)

Anthropic Messages API so tools like Claude Code can use open models (`ANTHROPIC_AUTH_TOKEN=ollama`, `ANTHROPIC_BASE_URL=http://localhost:11434` — full setup in [[concepts/openai-and-anthropic-compat]]).

`POST /v1/messages` requires `model` + `max_tokens` + `messages`; supports streaming, system prompts, multi-turn, vision (base64), `tool_use`/`tool_result` blocks, `thinking` blocks; honors `temperature`, `top_p`, `top_k`, `stop_sequences`. Streaming emits the full Anthropic event set (`message_start`, `content_block_delta` with `text_delta`/`input_json_delta`/`thinking_delta`, `message_stop`, etc.). Aliasing: `ollama cp qwen3-coder claude-3-5-sonnet`.

**vs the real Anthropic API:** key **not validated**, `anthropic-version` **not used**, token counts approximate. **Not supported:** `/v1/messages/count_tokens`, `tool_choice`, `metadata`, prompt caching (`cache_control`), Batches API, citations, PDF/`document` blocks, server-sent `error` events. **Partial:** image base64-only; extended thinking (`budget_tokens` **not enforced**).

For Claude Code setup (`ollama launch claude`, models, context ≥64k), see [[summaries/model-library-and-integrations-catalog]].

## Related

[[concepts/rest-api]] · [[concepts/openai-and-anthropic-compat]] · [[concepts/tool-calling]] · [[concepts/structured-outputs]] · [[concepts/thinking-and-web-search]] · [[entities/ollama-cloud]] · [[summaries/model-library-and-integrations-catalog]]


<!-- ===== ollama/wiki/syntheses/troubleshooting-playbook.md ===== -->

---
title: "Troubleshooting Playbook"
type: synthesis
tags: [troubleshooting, gpu, rocm, out-of-memory, downloads, configuration]
updated: 2026-06-23
confidence: medium
sources: [raw/llms_txt_doc-troubleshooting.md, raw/llms_txt_doc-faq.md, raw/github_issue-ollama-serve-fails-to-detect-nvidia-gpus-after-updating-to-t.md, raw/github_issue-ollama-not-using-nvidia-gpus-with-gpt-oss-models.md, raw/github_issue-amd-7900xtx-fails-with-could-not-initialize-tensile-host-no-.md, raw/github_issue-amd-gpu-rocm-support.md, raw/github_issue-integrated-amd-gpu-support.md, raw/github_issue-out-of-memory-errors-when-running-gemma3.md, raw/github_issue-ollama-500-error-on-larger-models.md, raw/github_issue-qwen3-5-35b-error-500-internal-server-error.md, raw/github_issue-preview-0-5-13-rc2-uses-5-times-more-ram.md, raw/github_issue-pull-model-manifest-500.md, raw/github_issue-downloading-a-model-with-ollama-pull-or-ollama-run-stalls.md, raw/github_issue-issue-with-ollama-model-download-progress-reverting-during-d.md, raw/github_issue-ollama-0-6-6-memory-leak-with-different-models.md, raw/github_issue-ollama-stops-serving-requests-after-10-15-minutes.md, raw/github_issue-ollama-stuck-after-few-runs.md, raw/github_issue-ollama-stops-generating-output-and-fails-to-run-models-after.md, raw/github_issue-llama3-instruct-models-not-stopping-at-stop-token.md, raw/github_issue-allow-listening-on-all-local-interfaces.md, raw/github_issue-support-gpu-runners-on-cpus-without-avx.md, raw/github_issue-ollama-ai-certificate-has-expired-not-possible-to-download-m.md]
---
# Troubleshooting Playbook

Symptom → cause → fix, from the official troubleshooting/FAQ docs plus 19 solved GitHub issues. **First step: read the logs.**

## Logs and debug

* Logs — macOS `~/.ollama/logs/server.log`; Linux `journalctl -u ollama --no-pager --follow --pager-end`; Docker `docker logs <container-name>`; Windows `explorer %LOCALAPPDATA%\Ollama` (`server.log`); manual `ollama serve` prints to terminal.
* Debug — Windows: quit tray, `$env:OLLAMA_DEBUG="1"; & "ollama app.exe"`. `OLLAMA_DEBUG=1`/`2`; NVIDIA `CUDA_ERROR_LEVEL=50`; AMD `AMD_LOG_LEVEL=3`.

## GPU not detected / falls back to CPU

* **NVIDIA gone after update; "low vram mode" `total vram=0 B`** (#12618): stale/misformatted `CUDA_VISIBLE_DEVICES` (e.g. `CUDA_VISIBLE_DEVICES:0,1,2`). Fix: **unset `CUDA_VISIBLE_DEVICES`**; scope with UUIDs from `nvidia-smi -L`.
* **NVIDIA discovery failures** (codes "3"/"46"/"100"/"999"): latest driver; in containers verify `docker run --gpus all ubuntu nvidia-smi`; load UVM `sudo nvidia-modprobe -u` or reload `sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm`, reboot.
* **Linux after suspend/resume:** reload `sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm`.
* **Docker GPU→CPU drift over time:** add `"exec-opts": ["native.cgroupdriver=cgroupfs"]` to `/etc/docker/daemon.json`.

## Model loads on CPU though GPU detected

* **`gpt-oss:20b`/`gpt-oss:120b` all-CPU while e.g. `qwen3:30b` use GPU** (#11676, `offloaded 0/NN layers to GPU`): raised `OLLAMA_NUM_PARALLEL` inflated VRAM. Fix: **leave `OLLAMA_NUM_PARALLEL` at default**; confirm with `ollama ps` `PROCESSOR` column.

## AMD ROCm

* **7900XTX `Could not initialize Tensile host: No devices found`** (#6685, ROCm 6.2, `ollama/ollama:rocm`): container device permissions. Fix: pass `--device /dev/kfd --device /dev/dri`, add numeric group IDs from `ls -lnd /dev/kfd /dev/dri /dev/dri/*` via `--group-add`; on SELinux set `container_use_devices` on.
* **AMD driver too old** (`failed to finish discovery before timeout`, `bootstrap discovery took duration=30s`): Ollama bundles **ROCm 7** libs; ROCm 6.x hangs → CPU. Fix: `amdgpu-install`, reboot, restart.
* **Self-built binary on CPU** (#738, `Not compiled with GPU offload support`): pass `-tags rocm` to **both** `go generate` and `go build`, set `ROCM_PATH` (e.g. `/opt/rocm`). Debug `AMD_LOG_LEVEL=3` + `OLLAMA_DEBUG=1`.
* **Integrated AMD GPU** (#2637): ROCm iGPU support limited ("detects Radeon then says no GPU" fixed in latest binary). Best-effort.
* **Multiple AMD GPUs — gibberish on Linux:** see AMD's multi-GPU known-issues guide.

## OOM / 500 "unable to load model"

* **>~7B/8B `500 Internal Server Error` on `/api/chat`** (#5892, `check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape`): architecture unsupported by old version. **Fix: upgrade Ollama.**
* **`500 ... unable to load model` for `qwen3.5:35b`** (#14419): needs newer Ollama (0.17.0 lacked qwen3.5). Fix: `curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.17.1-rc1 sh`.
* **Crashes/freezes at higher context** (`gemma3:12b`, #9791): spilled to CPU (`ollama ps` `7%/93% CPU/GPU`), 8k context crashed the box. Mitigate: lower `OLLAMA_CONTEXT_LENGTH`/`num_ctx`; `OLLAMA_FLASH_ATTENTION=1`; `OLLAMA_KV_CACHE_TYPE=q8_0` (≈½ f16) or `q4_0` (≈¼). See [[concepts/gpu-and-hardware]], [[concepts/configuration-and-serving]].
* **RAM/VRAM far larger than model file** ("5× more RAM", #9457): not a leak — weights + **context buffer + model graph** (2.5G model can show ~5.2G in `ollama ps`). Lower `num_ctx` or K/V quantize.

## Downloads: stall / manifest 500 / progress reverts

* **`pull model manifest: 500 {"errors":[{"code":"INTERNAL_ERROR"...}]}`** (#8873): registry **overloaded**. Fix: wait, retry.
* **`error max retries exceeded: EOF` / `r2.cloudflarestorage.com ... server misbehaving`, stalls at fixed %** (#8632): **DNS** failure on the Cloudflare R2 host (`127.0.0.53:53: server misbehaving`). Fix: working resolver (e.g. 1.1.1.1), verify `nslookup`.
* **Progress reverts (drops after 10–60%); `part N stalled; retrying`** (#8484): no data >5s (stall threshold) on a flaky link. Fix: `Ctrl+C` within ~5s of the drop, re-run to resume.

## Server stops responding

* **VRAM/RAM held after model done; `ollama ps` empty but runners linger** (#10433): **orphaned runner processes** (server crashed). Fix: check logs, restart service. Control unload with `keep_alive`/`OLLAMA_KEEP_ALIVE`, `ollama stop <model>`.
* **Heavy parallel load: fine 10–15 min then `failed to generate embedding` / `Failed to acquire semaphore: context canceled` / `no slots available after 10 retries`** (#4545; `OLLAMA_NUM_PARALLEL=10/20`, `OLLAMA_MAX_QUEUE=1024`): saturated slots. Levers: `OLLAMA_NUM_PARALLEL` (RAM scales by `NUM_PARALLEL` × `CONTEXT_LENGTH`), `OLLAMA_MAX_QUEUE` (default 512; over-queue → 503), `OLLAMA_MAX_LOADED_MODELS`.
* **Hangs/"stuck" after a few runs** (#1863, #2225): version-era bugs (~0.1.16–0.1.22). Fix: **upgrade** (#2225 fixed on 0.1.22); `systemctl restart ollama` restores temporarily.

## Runaway generation (won't stop at stop token)

* **`llama3`/`llama3:70b` keeps emitting `<|eot_id|><|start_header_id|>assistant...`** (#3759): (a) over the **OpenAI-compatible endpoint** Modelfile `PARAMETER stop` is ignored — send stop token(s) in the request `stop` field; (b) raw-GGUF import had a wrong `TEMPLATE` — use `PARAMETER stop "<|start_header_id|>"`, `"<|end_header_id|>"`, `"<|eot_id|>"`. See [[concepts/modelfile]], [[syntheses/api-surfaces-compared]].

## CPU without AVX (SIGILL)

* **`CPU does not have vector extensions` then `SIGILL: illegal instruction`** (#2187): runners built for CPU features the host lacks. Newer Ollama **falls back to CPU** (`CPU does not have AVX or AVX2, disabling GPU support`). Force via `OLLAMA_LLM_LIBRARY` (`cpu_avx2` > `cpu_avx` > `cpu`), e.g. `OLLAMA_LLM_LIBRARY="cpu" ollama serve`. Check `cat /proc/cpuinfo | grep flags | head -1`.

## Network binding (OLLAMA_HOST)

* **Only listens on loopback; unreachable from containers/proxies** (#703): binds `127.0.0.1:11434` by default. **Fix: `OLLAMA_HOST`**, e.g. `OLLAMA_HOST=0.0.0.0:8080` (no separate `OLLAMA_PORT`); on systemd add `Environment="OLLAMA_HOST=0.0.0.0:8080"`. For proxies/tunnels and `OLLAMA_ORIGINS` (CORS), see [[concepts/configuration-and-serving]].

## TLS / certificate errors on pull

* **`tls: failed to verify certificate: x509: certificate has expired`** (#3336): the **registry cert genuinely expired** (service incident). Fix: wait for renewal, retry; if only you, check local clock and CA trust.

## Related

[[summaries/model-library-and-integrations-catalog]] · [[entities/ollama-cloud]] · [[syntheses/api-surfaces-compared]] · [[concepts/gpu-and-hardware]] · [[concepts/configuration-and-serving]] · [[concepts/installation]] · [[concepts/modelfile]] · [[concepts/cli-reference]]