# Ollama — full corpus # LLM Wiki An open-source template for building LLM-powered knowledge bases, following [Andrej Karpathy's "LLM Wiki" pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f). You provide raw sources. The LLM reads them, writes structured wiki pages, cross-links everything, and maintains it over time. You never edit the wiki directly — you curate sources and ask questions. ## How It Works The system has three layers: ``` raw/ Sources you collect (articles, transcripts, notes, PDFs) wiki/ LLM-written & maintained pages (summaries, concepts, entities, syntheses) CLAUDE.md Schema that tells the LLM how to structure everything ``` Three operations drive the workflow: | Operation | Trigger | What happens | |-----------|---------|--------------| | **Ingest** | "ingest raw/my-source.txt" | LLM reads the source, creates a summary page, creates/updates concept and entity pages, adds cross-links, updates the index and log | | **Query** | Ask any question | LLM searches the wiki, synthesizes an answer with citations, optionally creates a synthesis page for novel insights | | **Lint** | "lint" or "health check" | LLM audits all pages for orphans, contradictions, missing links, incomplete sections, and low-confidence claims — fixes what it can, reports the rest | ## Quick Start 1. **Clone this repo** ```bash git clone https://github.com/YOUR_USERNAME/llm-wiki.git my-knowledge-base cd my-knowledge-base ``` 2. **Customize CLAUDE.md** for your domain - Update the Purpose section with your topic - Replace the placeholder tagging taxonomy with your own categories - Adjust confidence level descriptions if needed - Everything else (workflows, page formats, linking rules) works as-is 3. **Drop sources into `raw/`** - Text files, transcripts, articles, notes — any plain text - These are immutable once added; the LLM never modifies them 4. **Tell the LLM to ingest** ``` ingest raw/my-first-source.txt ``` The LLM will create summary pages, concept pages, entity pages, cross-links, and update the index. 5. **Ask questions** ``` What are the key differences between X and Y? ``` The LLM answers from the wiki, citing specific pages. 6. **Run health checks** ``` lint ``` The LLM audits the wiki and fixes issues. ## Directory Structure ``` . ├── CLAUDE.md # Schema — the LLM's instructions ├── raw/ # Your source documents (immutable) └── wiki/ ├── index.md # Master catalog of all pages ├── log.md # Append-only activity log ├── dashboard.md # Dataview dashboard (Obsidian) ├── analytics.md # Charts View analytics (Obsidian) ├── flashcards.md # Spaced repetition cards ├── summaries/ # One page per source document ├── concepts/ # Concept and framework pages ├── entities/ # People, tools, organizations, etc. ├── syntheses/ # Cross-cutting analyses and comparisons ├── journal/ # Research/session journal entries │ └── template.md # Journal entry template └── presentations/ # Marp slide decks ``` ## Enhancements This template includes several extras beyond the core wiki pattern: ### Dataview Dashboard (`wiki/dashboard.md`) Live queries that surface low-confidence pages, recent updates, concepts by tag, and pages with the most sources. Requires the [Dataview](https://github.com/blacksmithgu/obsidian-dataview) Obsidian plugin. ### Charts View Analytics (`wiki/analytics.md`) Visual analytics with pie charts, bar charts, and word clouds. Requires the [Charts View](https://github.com/caronchen/obsidian-chartsview-plugin) Obsidian plugin. ### Mermaid Diagrams Use Mermaid code blocks in any wiki page to create flowcharts, sequence diagrams, or concept maps. Native support in Obsidian and GitHub. ### Marp Slides (`wiki/presentations/`) Create slide decks from markdown using [Marp](https://marp.app/). Drop presentation files in this directory. ### Research Journal (`wiki/journal/`) Track your research sessions, experiments, or applied work with the included template. The LLM can reference journal entries when answering queries. ### Spaced Repetition (`wiki/flashcards.md`) Flashcards in the format used by the [Spaced Repetition](https://github.com/st3v3nmw/obsidian-spaced-repetition) Obsidian plugin. Ask the LLM to generate flashcards from any wiki page. ### MCP Server This repo works with Claude Code's MCP server capabilities. Point an MCP-compatible client at this repo and the LLM can read/write the wiki programmatically. ## Customizing for Your Domain The schema in `CLAUDE.md` is domain-agnostic. To adapt it: 1. **Purpose** — Describe your knowledge domain in one paragraph 2. **Tagging taxonomy** — Replace placeholder categories with your own (e.g., for a cooking KB: `cuisine`, `technique`, `ingredient`, `equipment`) 3. **Confidence levels** — Adjust the descriptions to match your domain's evidence standards 4. **Entity types** — Update the entity page description to match what entities mean in your domain (people, tools, companies, etc.) 5. **Journal template** — Customize `wiki/journal/template.md` for your workflow Everything else — page format, linking conventions, workflows, rules — is universal and works across domains. ## Example Domains This template works for any knowledge-intensive topic: - **Research notes** — papers, experiments, methodologies - **Book analysis** — themes, characters, author techniques - **Competitive analysis** — companies, products, market trends - **Course notes** — lectures, readings, key concepts - **Personal development** — frameworks, habits, book summaries - **Technical documentation** — APIs, architectures, design patterns - **Hobby deep-dives** — any subject you want to master ## License MIT --- title: "Ollama KB — Master Index" type: index updated: 2026-06-23 ollama_version: "0.30.10" --- # Ollama KB — Master Index **Domain:** Ollama — run open-weight LLMs locally: install & serve, the CLI, Modelfiles, the REST API and OpenAI/Anthropic-compatible endpoints, capabilities (tools, vision, embeddings, structured outputs, thinking, web search), GPU/hardware, and Ollama Cloud. **Corpus:** 121 provenance-stamped sources in `raw/` — the official docs (docs.ollama.com llms.txt), the README, 14 release notes (v0.23–v0.30), and 40 solved GitHub issues. **Pages:** 17 (13 concepts · 1 entity · 1 summary · 2 syntheses) — the user ring plus the operator/developer ring. ## Concepts (core ideas + operational how-tos) - [[concepts/what-is-ollama]] — what Ollama is, how it runs models locally, first steps - [[concepts/installation]] — install on macOS / Windows / Linux / Docker - [[concepts/cli-reference]] — the full `ollama` command set, verbatim - [[concepts/modelfile]] — Modelfile instructions (`FROM`, `PARAMETER`, `TEMPLATE`, `SYSTEM`, `ADAPTER`…) and creating/importing models - [[concepts/rest-api]] — the native REST API: `/api/generate`, `/api/chat`, `/api/embed`, options, streaming - [[concepts/openai-and-anthropic-compat]] — the OpenAI-compatible (`/v1/...`) and Anthropic-compatible endpoints - [[concepts/tool-calling]] — function/tool calling: the `tools` array and `tool_calls` - [[concepts/structured-outputs]] — JSON-schema-constrained responses via the `format` parameter - [[concepts/vision-and-multimodal]] — multimodal models and passing images - [[concepts/embeddings]] — `/api/embed`, embedding models, RAG/semantic search - [[concepts/thinking-and-web-search]] — reasoning/"thinking" control and the web search API - [[concepts/gpu-and-hardware]] — NVIDIA / AMD ROCm / Apple Metal support, VRAM, GPU selection - [[concepts/configuration-and-serving]] — `ollama serve`, env vars (`OLLAMA_HOST`, `OLLAMA_MODELS`, `OLLAMA_KEEP_ALIVE`, `OLLAMA_NUM_PARALLEL`, `OLLAMA_CONTEXT_LENGTH`), context window, networking ## Entities - [[entities/ollama-cloud]] — Ollama Cloud: hosted models, `ollama signin`, the cloud API and web search ## Summaries - [[summaries/model-library-and-integrations-catalog]] — the model-library tags and the integration ecosystem (editors/agents/tools) — mapped, not paged ## Syntheses (decisions & casebooks) - [[syntheses/api-surfaces-compared]] — native REST vs OpenAI-compatible vs Anthropic-compatible: pick by need - [[syntheses/troubleshooting-playbook]] — symptom → cause → fix from 40 solved issues (GPU not detected, AMD ROCm, OOM, stalled downloads, memory growth, stops-serving) ## Statistics - **Total pages**: 17 - **Concepts**: 13 · **Entities**: 1 · **Summaries**: 1 · **Syntheses**: 2 - **Sources ingested**: 121 (raw/, immutable) - **High confidence**: 15 · **Medium confidence**: 2 · **Low confidence**: 0 ## Coverage notes Strong: install/serve, the CLI and Modelfile, the native + compatible APIs, all the capability surfaces (tools/vision/embeddings/structured outputs/thinking/web search), GPU/hardware, and a solved-issues casebook. Latest release seen: v0.30.10 (14 releases v0.23–v0.30 in `raw/`); freshness = source fetch date 2026-06-23. Mapped, not paged (see [[summaries/model-library-and-integrations-catalog]]): the full model library and the per-integration setup docs (Claude Code, Cline, Codex, Goose, Zed, VS Code, JetBrains, n8n, etc.). For live model availability and post-date releases, use `ollama.com` and web search. --- title: "Ollama CLI Reference" type: concept tags: [cli, commands, reference, run, pull] updated: 2026-06-23 confidence: high sources: [raw/llms_txt_doc-cli-reference.md, raw/github_doc-readme-md.md, raw/llms_txt_doc-usage.md] --- # Ollama CLI Reference The `ollama` command manages and runs models locally. Run `ollama` with no arguments for the interactive menu. ## Running models ``` ollama run gemma4 ``` Multimodal input — pass an image path in the prompt: ``` ollama run gemma4 "What's in this image? /Users/jmorgan/Desktop/smile.png" ``` Multiline input — wrap text with `"""`: ``` >>> """Hello, ... world! ... """ ``` ## Managing models ``` ollama pull gemma4 # Download a model ollama rm gemma4 # Remove a model ollama ls # List models (also: ollama list) ollama ps # List running models ollama stop gemma4 # Stop a running model ollama cp mymodel myuser/mymodel # Copy a model ollama push myuser/mymodel # Push a model to ollama.com ollama show --modelfile llama3.2 # Show a model's Modelfile ``` ## Creating a model First create a `Modelfile`: ``` FROM gemma4 SYSTEM """You are a happy cat.""" ``` Then run `ollama create`: ``` ollama create -f Modelfile ``` See [[concepts/modelfile]] for the full Modelfile syntax. ## Embeddings ``` ollama run embeddinggemma "Hello world" echo "Hello world" | ollama run nomic-embed-text ``` Output is a JSON array. See [[concepts/embeddings]]. ## Serving ``` ollama serve ``` Starts the Ollama server. To view a list of environment variables that can be set, run `ollama serve --help`. See [[concepts/configuration-and-serving]]. ## Launching integrations ``` ollama launch # interactive ollama launch claude # specific integration ollama launch claude --model qwen3.5 # with a specific model ollama launch droid --config # configure without launching ``` Supported integrations include **OpenCode**, **Claude Code**, **Codex**, **VS Code**, and **Droid**. ## Authentication ``` ollama signin # Sign in to Ollama ollama signout # Sign out of Ollama ollama -v # Print version ``` See [[concepts/configuration-and-serving]] for sign-in details and API keys. --- title: "Configuration and Serving" type: concept tags: [serve, configuration, environment-variables, networking, context-length] updated: 2026-06-23 confidence: high sources: [raw/llms_txt_doc-faq.md, raw/llms_txt_doc-context-length.md, raw/llms_txt_doc-authentication.md, raw/llms_txt_doc-troubleshooting.md] --- # Configuration and Serving Start the server with `ollama serve`; it is configured entirely through environment variables (`ollama serve --help` lists them). ## Key environment variables | Variable | Purpose | | --- | --- | | `OLLAMA_HOST` | Bind address. Default binds `127.0.0.1` port `11434`. Set e.g. `0.0.0.0:11434` to expose on the network. | | `OLLAMA_MODELS` | Directory where downloaded models are stored. | | `OLLAMA_KEEP_ALIVE` | How long models stay loaded in memory (duration string, seconds, `-1` to keep loaded, `0` to unload immediately). | | `OLLAMA_NUM_PARALLEL` | Max parallel requests per model (default 1). RAM scales by `OLLAMA_NUM_PARALLEL` * `OLLAMA_CONTEXT_LENGTH`. | | `OLLAMA_MAX_LOADED_MODELS` | Max models loaded concurrently. Default is 3 * number of GPUs, or 3 for CPU inference. | | `OLLAMA_MAX_QUEUE` | Max queued requests before returning a 503 (default 512). | | `OLLAMA_CONTEXT_LENGTH` | Default context window in tokens (default 4096). | | `OLLAMA_FLASH_ATTENTION` | Set to `1` to enable Flash Attention (reduces memory as context grows). | | `OLLAMA_KV_CACHE_TYPE` | K/V cache quantization type: `f16` (default), `q8_0`, `q4_0`. Global; requires Flash Attention. | | `OLLAMA_ORIGINS` | Additional allowed CORS origins (defaults allow `127.0.0.1` and `0.0.0.0`). | | `OLLAMA_NO_CLOUD` | Set to `1` to disable cloud features (local-only mode). | | `HTTPS_PROXY` | Proxy for outbound model pulls. Avoid setting `HTTP_PROXY` — Ollama pulls over HTTPS only. | ### Setting environment variables per platform - **macOS** (run as app): `launchctl setenv OLLAMA_HOST "0.0.0.0:11434"`, then restart Ollama. - **Linux** (systemd): `systemctl edit ollama.service`, add `Environment="OLLAMA_HOST=0.0.0.0:11434"` under `[Service]`, then `systemctl daemon-reload && systemctl restart ollama`. - **Windows**: Quit Ollama, edit your user environment variables (`OLLAMA_HOST`, `OLLAMA_MODELS`, etc.), then relaunch from the Start menu. ## Context window control Default context is 4096 tokens; web search, agents, and coding tools should use at least 64000. Override with `OLLAMA_CONTEXT_LENGTH=8192 ollama serve`. In `ollama run`, use `/set parameter num_ctx 4096`; via the API, `"options": { "num_ctx": 4096 }`. Verify with `ollama ps` (`PROCESSOR`, `CONTEXT` columns). ## Where models are stored macOS `~/.ollama/models`; Linux `/usr/share/ollama/.ollama/models`; Windows `C:\Users\%username%\.ollama\models`. Relocate with `OLLAMA_MODELS`; on Linux the `ollama` user needs access: `sudo chown -R ollama:ollama `. ## Keeping models loaded Models stay in memory 5 minutes by default. `ollama stop ` unloads immediately, or use the API `keep_alive` on `/api/generate` and `/api/chat` (`"10m"`, `3600`, negative to keep loaded, or `0` to unload). `keep_alive` overrides `OLLAMA_KEEP_ALIVE`. ## Networking and remote access Binds `127.0.0.1:11434` by default; set `OLLAMA_HOST` to change. Reverse proxy (Nginx) — `proxy_pass http://localhost:11434;` with `proxy_set_header Host localhost:11434;`. Tunnels: ```shell ngrok http 11434 --host-header="localhost:11434" cloudflared tunnel --url http://localhost:11434 --http-host-header="localhost:11434" ``` ## Authentication No auth locally; required for cloud models, publishing, and private downloads. Sign in with `ollama signin`. For direct access to `https://ollama.com/api`, set `export OLLAMA_API_KEY=your_api_key` and pass `-H "Authorization: Bearer $OLLAMA_API_KEY"`. ## Logs macOS `cat ~/.ollama/logs/server.log`; Linux `journalctl -u ollama --no-pager --follow --pager-end`; Docker `docker logs `; Windows `explorer %LOCALAPPDATA%\Ollama` (`server.log`). Debug: `OLLAMA_DEBUG=1`. ## See also - [[concepts/cli-reference]] — `ollama serve`, `ollama ps`, `ollama stop` - [[concepts/rest-api]] — `keep_alive`, `num_ctx` via the API - [[syntheses/troubleshooting-playbook]] - [[concepts/gpu-and-hardware]] --- title: "Embeddings" type: concept tags: [embeddings, rag, semantic-search, api] updated: 2026-06-23 confidence: high sources: [raw/llms_txt_doc-embeddings.md, raw/llms_txt_doc-generate-embeddings.md] --- # Embeddings Embeddings turn text into numeric vectors for vector databases, cosine-similarity search, or RAG pipelines. Vector length depends on the model (typically 384–1024 dimensions). ## Endpoint `POST /api/embed` creates vector embeddings for the input text. Required: `model` and `input`. See [[concepts/rest-api]]. ```shell curl http://localhost:11434/api/embed -d '{ "model": "embeddinggemma", "input": "Why is the sky blue?" }' ``` Fields: `model` (required); `input` (string or array of strings, required — pass an array for batch embeddings); `truncate` (boolean, default `true`; `false` errors on over-long input); `dimensions` (integer); `keep_alive` (string); `options` (e.g. `num_ctx`). Response: `model`, `embeddings` (array of vectors), `total_duration`, `load_duration`, `prompt_eval_count`. Vectors are L2-normalized (unit length). ## SDK and CLI ```python import ollama single = ollama.embed(model='embeddinggemma', input='The quick brown fox...') print(len(single['embeddings'][0])) # vector length ``` CLI — directly or by piping: `ollama run embeddinggemma "Hello world"` or `echo "Hello world" | ollama run embeddinggemma`. ## Recommended models * `embeddinggemma` * `qwen3-embedding` * `all-minilm` ## Tips * Use cosine similarity for most semantic search. * Use the same embedding model for indexing and querying. Embeddings are also exposed through the OpenAI-compatible `/v1/embeddings` endpoint (supports `model`, `input`, `encoding format`, `dimensions`) — see [[concepts/openai-and-anthropic-compat]]. --- title: "GPU and Hardware Support" type: concept tags: [gpu, hardware, cuda, rocm, metal, vram] updated: 2026-06-23 confidence: high sources: [raw/llms_txt_doc-hardware-support.md, raw/llms_txt_doc-faq.md, raw/llms_txt_doc-troubleshooting.md] --- # GPU and Hardware Support Ollama accelerates inference on NVIDIA (CUDA), AMD (ROCm), Apple (Metal), and other GPUs via Vulkan, falling back to CPU when no GPU is usable. See [[concepts/configuration-and-serving]] for server environment variables. ## NVIDIA (CUDA) Supports NVIDIA GPUs with **compute capability 5.0+** and driver 531+. Cards with compute capability 5.0–6.2 require driver 570+. Check your card at `https://developer.nvidia.com/cuda-gpus`. Examples by compute capability: 12.0 = RTX 50xx (`RTX 5090`, `RTX 5080`...); 9.0 = `H200`, `H100`; 8.9 = RTX 40xx; 8.6 = RTX 30xx; 8.0 = `A100`, `A30`; 7.5 = RTX 20xx / `T4`; 5.0 = `GTX 750 Ti`. ### GPU selection Limit Ollama to a subset of NVIDIA GPUs with `CUDA_VISIBLE_DEVICES` (comma-separated). Numeric IDs work but ordering may vary, so UUIDs (from `nvidia-smi -L`) are more reliable. Force CPU with an invalid GPU ID (e.g. `-1`). ## AMD Radeon (ROCm) AMD GPUs are supported via ROCm; Ollama requires the **AMD ROCm v7 driver** on Linux (install/upgrade with `amdgpu-install`) and a ROCm v7 / HIP7-capable driver stack on Windows. Supported families: Radeon RX (`7900 XTX`, `9070 XT`...), Radeon PRO (`W7900`...), Radeon AI PRO, Ryzen AI, and AMD Instinct (`MI300X`...). ### GPU selection and overrides * Limit to a subset: set `ROCR_VISIBLE_DEVICES` (list devices with `rocminfo`; prefer `Uuid` over numeric IDs; `-1` forces CPU). * Unsupported card: force a close LLVM target with `HSA_OVERRIDE_GFX_VERSION` using `x.y.z` syntax (e.g. `HSA_OVERRIDE_GFX_VERSION="10.3.0"` for an RX 5400). For multiple GPUs, suffix the device number, e.g. `HSA_OVERRIDE_GFX_VERSION_0=10.3.0`. ## Apple (Metal) and Vulkan Apple devices accelerate via the Metal API. Windows/Linux also support Vulkan (enabled by default when installed): select GPUs with `GGML_VK_VISIBLE_DEVICES` (numeric IDs); disable all with `OLLAMA_VULKAN=0` or `GGML_VK_VISIBLE_DEVICES=-1`. ## Placement and VRAM `ollama ps` `Processor` column shows `100% GPU`, `100% CPU`, or a split (`48%/52% CPU/GPU`). On load, if a model fits a single GPU it loads there; otherwise it spreads across GPUs. `OLLAMA_MAX_LOADED_MODELS` (default 3 × GPUs, or 3 for CPU) and `OLLAMA_NUM_PARALLEL` (default 1) control concurrency; RAM scales by `OLLAMA_NUM_PARALLEL` × `OLLAMA_CONTEXT_LENGTH`. ## CPU fallback and library override Ollama auto-picks among bundled LLM libraries. CPU order: `cpu_avx2` > `cpu_avx` > `cpu`. Force one with `OLLAMA_LLM_LIBRARY`, e.g. `OLLAMA_LLM_LIBRARY="cpu_avx2" ollama serve`. For GPU-discovery failures, see [[syntheses/troubleshooting-playbook]]. --- title: "Installing Ollama" type: concept tags: [installation, macos, windows, linux, docker] updated: 2026-06-23 confidence: high sources: [raw/llms_txt_doc-macos.md, raw/llms_txt_doc-windows.md, raw/llms_txt_doc-linux.md, raw/llms_txt_doc-docker.md, raw/llms_txt_doc-quickstart.md, raw/github_doc-readme-md.md] --- # Installing Ollama Ollama is available on macOS, Windows, and Linux. After install, the API is served on `http://localhost:11434`. ## macOS ```shell curl -fsSL https://ollama.com/install.sh | sh ``` Or [download manually](https://ollama.com/download/Ollama.dmg). Requires macOS Sonoma (v14)+; Apple M series (CPU+GPU) or x86 (CPU only). Preferred: mount `ollama.dmg` and drag the app to `Applications`. On startup the app verifies the `ollama` CLI is in PATH and, if not, prompts to create a link in `/usr/local/bin`. Models and configuration live in `~/.ollama`; logs in `~/.ollama/logs`. ## Windows ```shell irm https://ollama.com/install.ps1 | iex ``` Or [download manually](https://ollama.com/download/OllamaSetup.exe). Requires Windows 10 22H2+ (Home or Pro); NVIDIA 452.39+ drivers for NVIDIA cards. No Administrator needed; installs in your home directory (needs ≥4GB for the binary). The `ollama` command works in `cmd`, `powershell`, or your terminal. Install to a different location: ```powershell OllamaSetup.exe /DIR="d:\some\location" ``` A standalone `ollama-windows-amd64.zip` (CLI + GPU libs) is available for embedding or running as a service via `ollama serve`. Models and configuration live under `%HOMEPATH%\.ollama`. ## Linux ```shell curl -fsSL https://ollama.com/install.sh | sh ``` ### Manual install If upgrading, first `sudo rm -rf /usr/lib/ollama`, then extract and run: ```shell curl -fsSL https://ollama.com/download/ollama-linux-amd64.tar.zst | sudo tar x -C /usr ollama serve && ollama -v ``` AMD GPU adds `ollama-linux-amd64-rocm.tar.zst`; ARM64 uses `ollama-linux-arm64.tar.zst`. Pin a version: `curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.5.7 sh`. ### Startup service (recommended) Create an `ollama` user, then `/etc/systemd/system/ollama.service`, and `sudo systemctl daemon-reload && sudo systemctl enable --now ollama`: ```ini [Unit] Description=Ollama Service After=network-online.target [Service] ExecStart=/usr/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=$PATH" [Install] WantedBy=multi-user.target ``` ## Docker Official image `ollama/ollama` on Docker Hub (Vulkan bundled, enabled when the container can access the GPU). Run a model after start: `docker exec -it ollama ollama run llama3.2`. ```shell # CPU only docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama # NVIDIA GPU (requires NVIDIA Container Toolkit) docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama # AMD GPU docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm ``` ## See also - [[concepts/cli-reference]] — full command set - [[concepts/configuration-and-serving]] — env vars, model storage, networking - [[concepts/gpu-and-hardware]] — GPU support details --- title: "Modelfile Reference" type: concept tags: [modelfile, create-model, customization, parameters, import] updated: 2026-06-23 confidence: high sources: [raw/llms_txt_doc-modelfile-reference.md, raw/llms_txt_doc-create-a-model.md, raw/llms_txt_doc-importing-a-model.md] --- # Modelfile Reference A Modelfile is the blueprint to create and share customized models. Format is `INSTRUCTION arguments` (one per line, `#` for comments); **not case sensitive**, any order. Example then build: ``` FROM llama3.2 PARAMETER temperature 1 PARAMETER num_ctx 4096 SYSTEM You are Mario from super mario bros, acting as an assistant. ``` ```shell ollama create choose-a-model-name -f ./Modelfile && ollama run choose-a-model-name ``` ## Instructions | Instruction | Description | | --- | --- | | `FROM` (required) | Defines the base model to use. | | `PARAMETER` | Sets the parameters for how Ollama will run the model. | | `TEMPLATE` | The full prompt template to be sent to the model. | | `SYSTEM` | Specifies the system message set in the template. | | `ADAPTER` | Defines the (Q)LoRA adapters to apply to the model. | | `LICENSE` | Specifies the legal license. | | `MESSAGE` | Specify message history. | | `REQUIRES` | Specify the minimum version of Ollama required by the model. | ### FROM (required) `FROM :` (existing model, e.g. `FROM llama3.2`), `FROM ` (Safetensors dir), or `FROM ./ollama-model.gguf` (GGUF, absolute or relative path). Supported Safetensors architectures: Llama (incl. 2/3/3.1/3.2), Mistral (incl. Mistral 1/2 and Mixtral), Gemma (incl. Gemma 1 and 2), Phi3. ### PARAMETER `PARAMETER `: | Parameter | Description | Default | | --- | --- | --- | | num_ctx | Size of the context window used to generate the next token | 2048 | | repeat_last_n | How far back to look to prevent repetition (0 = disabled, -1 = num_ctx) | 64 | | repeat_penalty | How strongly to penalize repetitions | 1.1 | | temperature | Higher = more creative | 0.8 | | seed | Random seed; fixed value gives reproducible output | 0 | | stop | Stop sequence; set multiple `stop` params for multiple sequences | — | | num_predict | Max tokens to predict (-1 = infinite) | -1 | | draft_num_predict | Max speculative draft tokens per step (0 to disable) | 4 | | top_k | Higher = more diverse | 40 | | top_p | Works with top-k; higher = more diverse | 0.9 | | min_p | Minimum token probability relative to the most likely token | 0.0 | ### TEMPLATE The full prompt template, using Go [template syntax](https://pkg.go.dev/text/template). Variables: `{{ .System }}`, `{{ .Prompt }}`, `{{ .Response }}` (text after `.Response` is omitted when generating). ``` TEMPLATE """{{ if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{ if .Prompt }}<|im_start|>user {{ .Prompt }}<|im_end|> {{ end }}<|im_start|>assistant """ ``` ### SYSTEM, ADAPTER, LICENSE `SYSTEM """"""`; `ADAPTER ./ollama-lora.gguf` (or a Safetensor adapter path); `LICENSE """"""`. `ADAPTER` must be absolute or relative to the Modelfile, and `FROM` must use the **same base model** the adapter was tuned from or behavior will be erratic. ### MESSAGE `MESSAGE ` (roles `system`, `user`, `assistant`) builds a conversation to guide the model: ``` MESSAGE user Is Toronto in Canada? MESSAGE assistant yes ``` ### REQUIRES `REQUIRES ` — e.g. `REQUIRES 0.14.0`. ## Importing and quantizing Point `FROM` at a Safetensors/GGUF file or directory (`FROM .` if the Modelfile sits with the weights), then `ollama create my-model`. Quantize an FP16/FP32 model with `-q`/`--quantize`, e.g. `ollama create --quantize q4_K_M mymodel`. Supported: `q8_0`, K-means `q4_K_S`, `q4_K_M`. ## See also - [[concepts/cli-reference]] — `ollama create`, `ollama show`, `ollama push` - [[concepts/rest-api]] — the `/api/create` endpoint --- title: "OpenAI- and Anthropic-Compatible APIs" type: concept tags: [openai, anthropic, compatibility, api, claude-code] updated: 2026-06-23 confidence: high sources: [raw/llms_txt_doc-openai-compatibility.md, raw/llms_txt_doc-anthropic-compatibility.md] --- # OpenAI- and Anthropic-Compatible APIs Drop-in compatible endpoints so existing OpenAI/Anthropic SDK code can point at a local Ollama server. See [[concepts/rest-api]] (native) and [[syntheses/api-surfaces-compared]]. ## OpenAI compatibility Base URL `http://localhost:11434/v1/`; `api_key` required by the SDK but ignored (use `'ollama'`): ```python from openai import OpenAI client = OpenAI(base_url='http://localhost:11434/v1/', api_key='ollama') client.chat.completions.create(model='gpt-oss:20b', messages=[{'role': 'user', 'content': 'Say this is a test'}]) ``` Endpoints: `/v1/chat/completions`, `/v1/completions`, `/v1/models`, `/v1/models/{model}`, `/v1/embeddings`, `/v1/images/generations` (experimental), `/v1/responses` (added in v0.13.3; non-stateful only — no `previous_response_id` or `conversation`). `/v1/chat/completions` supports streaming, JSON mode, reproducible outputs, vision, tools, reasoning control. Fields: `model`, `messages`, `frequency_penalty`, `presence_penalty`, `response_format`, `seed`, `stop`, `stream`, `stream_options.include_usage`, `temperature`, `top_p`, `max_tokens`, `tools`, `reasoning_effort` (`"high"`/`"medium"`/`"low"`/`"max"`/`"none"`), `reasoning.effort`. Not supported: `tool_choice`, `logit_bias`, `user`, `n`, `logprobs`. No way to set context size here; bake `PARAMETER num_ctx ` into a Modelfile, `ollama create mymodel`, call the new name. See [[concepts/modelfile]]. ## Anthropic compatibility Anthropic Messages API at `/v1/messages` (base URL `http://localhost:11434`), enabling tools like Claude Code. Set `ANTHROPIC_AUTH_TOKEN=ollama` (ignored) and `ANTHROPIC_BASE_URL=http://localhost:11434`. ```shell curl -X POST http://localhost:11434/v1/messages \ -H "x-api-key: ollama" -H "anthropic-version: 2023-06-01" \ -d '{ "model": "qwen3-coder", "max_tokens": 1024, "messages": [{ "role": "user", "content": "Hello, how are you?" }] }' ``` `/v1/messages` supports streaming, system prompts, multi-turn, vision (base64), `tool_use`/`tool_result` blocks, `thinking` blocks. Fields: `model`, `max_tokens`, `messages`, `system`, `stream`, `temperature`, `top_p`, `top_k`, `stop_sequences`, `tools`, `thinking`. **Claude Code:** `ollama launch claude` auto-configures and launches; `--config` configures without launching; or set the env vars and run `claude --model qwen3-coder`. Recommended: `glm-4.7`, `minimax-m2.1`, `qwen3-coder`. **Differences from the real Anthropic API:** API key not validated; `anthropic-version` unused; token counts approximate. Not supported: `/v1/messages/count_tokens`, `tool_choice`, `metadata`, prompt caching (`cache_control`), Batches API, citations, PDF (`document`) blocks. Image base64-only (no URLs); extended thinking basic (`budget_tokens` accepted but not enforced). --- title: "Ollama REST API" type: concept tags: [rest-api, api, generate, chat, embeddings, streaming] updated: 2026-06-23 confidence: high sources: [raw/llms_txt_doc-generate-a-response.md, raw/llms_txt_doc-generate-a-chat-message.md, raw/llms_txt_doc-generate-embeddings.md, raw/llms_txt_doc-usage.md, raw/llms_txt_doc-streaming.md, raw/llms_txt_doc-list-models.md, raw/llms_txt_doc-list-running-models.md, raw/llms_txt_doc-show-model-details.md, raw/llms_txt_doc-pull-a-model.md, raw/llms_txt_doc-create-a-model.md, raw/llms_txt_doc-get-version.md] --- # Ollama REST API Native API at `http://localhost:11434` (base path `/api`); cloud at `https://ollama.com/api`. No local auth. ## POST /api/generate Prompt → response. Required: `model`. Fields: `prompt`, `suffix`, `images` (base64), `system`, `format` (`"json"` or a JSON schema), `stream` (default `true`), `think` (boolean or `"high"`/`"medium"`/`"low"`/`"max"`), `raw`, `keep_alive`, `options`, `logprobs`, `top_logprobs`. Response: `response`, `done`, `done_reason`, `thinking`, usage metrics (below). ## POST /api/chat Next chat message. Required: `model`, `messages`. Fields: `tools`, `format`, `options`, `stream` (default `true`), `think`, `keep_alive`, `logprobs`, `top_logprobs`. ```shell curl http://localhost:11434/api/chat -d '{ "model": "gemma4", "messages": [{ "role": "user", "content": "why is the sky blue?" }] }' ``` Each message has `role` (`system`/`user`/`assistant`/`tool`), `content`, optional `images` (base64) and `tool_calls`. The response `message.role` is always `assistant`. See [[concepts/tool-calling]], [[concepts/structured-outputs]], [[concepts/vision-and-multimodal]], [[concepts/thinking-and-web-search]]. ## The `options` object Controls generation: `seed`, `temperature`, `top_k`, `top_p`, `min_p`, `stop` (string or array), `num_ctx` (context length), `num_predict` (max tokens). Additional properties allowed. ## POST /api/embed Vector embeddings. Required: `model`, `input` (string or array). Fields: `truncate` (default `true`), `dimensions`, `keep_alive`, `options`. Returns `embeddings` (array of vectors) plus `total_duration`, `load_duration`, `prompt_eval_count`. See [[concepts/embeddings]]. ## Model management endpoints ```shell curl http://localhost:11434/api/tags # GET — list local models curl http://localhost:11434/api/ps # GET — list running models curl http://localhost:11434/api/version # GET — Ollama version curl http://localhost:11434/api/show -d '{ "model": "gemma4" }' # POST — model details curl http://localhost:11434/api/pull -d '{ "model": "gemma4" }' # POST — pull a model curl http://localhost:11434/api/create -d '{ "from": "gemma4", "model": "alpaca", "system": "You are Alpaca." }' # POST — create ``` - `/api/tags` returns each model's `name`, `model`, `modified_at`, `size`, `digest`, and `details` (format, family, parameter_size, quantization_level). - `/api/ps` adds `expires_at`, `size_vram`, and `context_length` per running model. - `/api/show` returns `parameters`, `license`, `template`, `capabilities`, `details`, and `model_info`; pass `"verbose": true` for large fields. - `/api/pull` and `/api/create` stream status events (`status`, `digest`, `total`, `completed`); pass `"stream": false` to disable. ## Streaming Endpoints stream by default as newline-delimited JSON (`application/x-ndjson`), one chunk per line with `"done": false` until the final chunk: ```json {"model":"gemma4","created_at":"2025-10-26T17:15:24.166576Z","response":"!","done":true,"done_reason":"stop"} ``` Set `{"stream": false}` for a single `application/json` response. Usage fields appear in the final chunk. ## Usage metrics All timing values in **nanoseconds**: `total_duration` (total), `load_duration` (model load), `prompt_eval_count` (input tokens) / `prompt_eval_duration`, `eval_count` (output tokens) / `eval_duration`. ## See also - [[concepts/configuration-and-serving]] — `keep_alive`, env vars, networking - [[concepts/openai-and-anthropic-compat]] — OpenAI/Anthropic-compatible endpoints - [[syntheses/api-surfaces-compared]] --- title: "Structured Outputs (JSON Schema)" type: concept tags: [structured-outputs, json, schema, format] updated: 2026-06-23 confidence: high sources: [raw/llms_txt_doc-structured-outputs.md] --- # Structured Outputs (JSON Schema) Structured outputs enforce a JSON schema on responses so you can reliably extract data, describe images, or keep replies consistent. Set the `format` parameter on the `/api/chat` request (see [[concepts/rest-api]]). > Note: Ollama's Cloud currently does not support structured outputs. See > [[entities/ollama-cloud]]. ## JSON mode Pass `"format": "json"` in the `/api/chat` body to force valid JSON output. ## JSON with a schema Provide a full JSON schema to `format` (also passing it as a string in the prompt helps ground the model): ```shell curl -X POST http://localhost:11434/api/chat -d '{ "model": "gpt-oss", "messages": [{"role": "user", "content": "Tell me about Canada."}], "stream": false, "format": { "type": "object", "properties": { "name": {"type": "string"}, "capital": {"type": "string"}, "languages": {"type": "array", "items": {"type": "string"}} }, "required": ["name", "capital", "languages"] } }' ``` In Python, pass a Pydantic model's `model_json_schema()` to `format` and validate the response with `model_validate_json()`: ```python from ollama import chat from pydantic import BaseModel class Country(BaseModel): name: str capital: str languages: list[str] response = chat( model='gpt-oss', messages=[{'role': 'user', 'content': 'Tell me about Canada.'}], format=Country.model_json_schema(), ) country = Country.model_validate_json(response.message.content) ``` In JavaScript, serialize a Zod schema with `z.toJSONSchema(schema)` and parse the result. ## Vision with structured outputs Vision models accept the same `format` parameter for deterministic image descriptions (pass `images` and a schema; set `options={'temperature': 0}`). See [[concepts/vision-and-multimodal]]. ## Tips * Define schemas with Pydantic (Python) or Zod (JavaScript) so they can be reused for validation. * Lower the temperature (e.g. `0`) for more deterministic completions. * Through the OpenAI-compatible API, structured outputs work via `response_format`. See [[concepts/openai-and-anthropic-compat]]. --- title: "Thinking and Web Search" type: concept tags: [thinking, reasoning, web-search, agents, api] updated: 2026-06-23 confidence: high sources: [raw/llms_txt_doc-thinking.md, raw/llms_txt_doc-web-search.md] --- # Thinking and Web Search Two capabilities that augment generation: controlling a model's reasoning trace ("thinking"), and grounding answers with Ollama's web search API. ## Thinking (reasoning control) Thinking-capable models emit a `thinking` field separating their reasoning trace from the final answer — use it to audit steps, animate "thinking" in a UI, or hide the trace. Set the `think` field on chat or generate requests. Most models accept booleans (`true`/`false`) or levels (`low`, `medium`, `high`, `max`), where `max` requests the highest level. GPT-OSS instead requires one of `low`, `medium`, `high` — `true`/`false` is ignored. ```shell curl http://localhost:11434/api/chat -d '{ "model": "qwen3", "messages": [{"role": "user", "content": "How many letter r are in strawberry?"}], "think": true, "stream": false }' ``` The `message.thinking` (chat) or `thinking` (generate) field holds the reasoning trace; `message.content` / `response` holds the final answer. When streaming, thinking tokens precede answer tokens — detect the first `thinking` chunk, then switch once `message.content` arrives. Supported models: Qwen 3, GPT-OSS (levels only), DeepSeek-v3.1, DeepSeek R1; enabled by default in CLI and API. ### CLI quick reference * Enable: `ollama run deepseek-r1 --think "..."`; disable: `--think=false`; hide trace: `--hidethinking`. * Interactive toggle: `/set think` or `/set nothink`. GPT-OSS levels: `ollama run gpt-oss --think=low "..."`. See [[concepts/cli-reference]] and [[concepts/tool-calling]] (SDK examples combine `think=True` with tools). ## Web search API Augments models with current information. Hosted at `ollama.com` (not local); needs an API key — create at `https://ollama.com/settings/keys` (free account). Set `OLLAMA_API_KEY` or pass it in the `Authorization` header. * **`POST https://ollama.com/api/web_search`** — `query` (string, required); `max_results` (integer, optional; default 5, max 10). Returns `results`, each with `title`, `url`, `content`. (Example in [[entities/ollama-cloud]].) * **`POST https://ollama.com/api/web_fetch`** — fetches a single page by `url`; returns `title`, `content`, `links`. Libraries expose `web_search`/`web_fetch` (Python) and `webSearch`/`webFetch` (JS), passable as tools in an agent loop. Results can be thousands of tokens — raise context to ≥~32000. Also enableable in any MCP client via the Python MCP server. See [[entities/ollama-cloud]]. --- title: "Tool Calling (Function Calling)" type: concept tags: [tools, function-calling, agents, api] updated: 2026-06-23 confidence: high sources: [raw/llms_txt_doc-tool-calling.md] --- # Tool Calling (Function Calling) Ollama supports tool calling (function calling) so a model can invoke tools and incorporate their results. Tools are passed to the `/api/chat` endpoint (see [[concepts/rest-api]]). ## Defining tools The `tools` array contains objects of `type: "function"`, each with a `function` holding `name`, `description`, and a JSON-Schema `parameters` object: ```shell curl -s http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{ "model": "qwen3", "messages": [{"role": "user", "content": "What is the temperature in New York?"}], "stream": false, "tools": [ { "type": "function", "function": { "name": "get_temperature", "description": "Get the current temperature for a city", "parameters": { "type": "object", "required": ["city"], "properties": { "city": {"type": "string", "description": "The name of the city"} } } } } ] }' ``` ## Returning tool results The model replies with `message.tool_calls` — each a `{"type":"function","function":{"index":0,"name":...,"arguments":{...}}}`. Execute each call, then re-send `messages` with the assistant message (carrying its `tool_calls`) plus one `{"role":"tool","tool_name":...,"content":...}` message per call, in order. Parallel calls return multiple `tool_calls` entries (each with an `index`). ## Agent loop and SDKs A multi-turn agent loop calls the model repeatedly, executing returned tool calls and appending results until `tool_calls` is empty: ```python from ollama import chat, ChatResponse available_functions = {'add': add, 'multiply': multiply} messages = [{'role': 'user', 'content': 'What is (11434+12341)*412?'}] while True: response: ChatResponse = chat(model='qwen3', messages=messages, tools=[add, multiply], think=True) messages.append(response.message) if response.message.tool_calls: for tc in response.message.tool_calls: result = available_functions[tc.function.name](**tc.function.arguments) messages.append({'role': 'tool', 'tool_name': tc.function.name, 'content': str(result)}) else: break ``` The Python SDK auto-parses Python functions into a tool schema, so you can pass functions directly in the `tools` list (raw JSON schemas also work). Install with `pip install ollama -U` (Python) or `npm i ollama` (JavaScript). When streaming, gather every `thinking`, `content`, and `tool_calls` chunk, then send those fields back with the tool results in the follow-up request. Tool calling pairs naturally with [[concepts/thinking-and-web-search]] (`think=True`). --- title: "Vision and Multimodal Models" type: concept tags: [vision, multimodal, images, api] updated: 2026-06-23 confidence: high sources: [raw/llms_txt_doc-vision.md, raw/llms_txt_doc-openai-compatibility.md] --- # Vision and Multimodal Models Vision models accept images alongside text to describe, classify, and answer questions about what they see. ## Quick start (CLI) ```shell ollama run gemma4 ./image.png whats in this image? ``` See [[concepts/cli-reference]] for `ollama run`. ## Passing images via the native API Provide an `images` array on the message. SDKs accept file paths, URLs, or raw bytes; the REST API (`/api/chat`) expects base64-encoded data (`IMG=$(base64 < test.jpg | tr -d '\n')`): ```shell curl -X POST http://localhost:11434/api/chat -d '{ "model": "gemma4", "messages": [{ "role": "user", "content": "What is in this image?", "images": ["'"$IMG"'"] }], "stream": false }' ``` In the Python SDK, `images` accepts a path, base64 string, or raw bytes (`messages=[{'role':'user','content':'...','images':[path]}]`). See [[concepts/rest-api]] for the full `/api/chat` shape. ## Passing images via the OpenAI-compatible API `/v1/chat/completions` accepts vision input as a content part of type `image_url`, where `image_url` is a base64 data URI (image URLs are not supported): ```shell curl -X POST http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3-vl:8b", "messages": [{"role": "user", "content": [ {"type": "text", "text": "What is this an image of?"}, {"type": "image_url", "image_url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUg..."} ]}] }' ``` See [[concepts/openai-and-anthropic-compat]]. ## Example models `gemma4`, `qwen3-vl:8b`. Structured image descriptions are possible by combining vision with a JSON schema — see [[concepts/structured-outputs]]. --- title: "What is Ollama" type: concept tags: [ollama, overview, getting-started, local-llm] updated: 2026-06-23 confidence: high sources: [raw/llms_txt_doc-introduction.md, raw/llms_txt_doc-quickstart.md, raw/llms_txt_doc-overview.md, raw/github_doc-readme-md.md] --- # What is Ollama Ollama runs open models locally and exposes a REST API to build with them programmatically. Available on macOS, Windows, and Linux. ## Getting started Run `ollama` to open the interactive menu: ```sh ollama ``` Navigate with `↑/↓`, `enter` to launch, `→` to change model, `esc` to quit. The menu gives quick access to **Run a model**, **Launch tools** (Claude Code, Codex, OpenClaw, and more), and **Additional integrations** (under "More..."). Chat with a model directly: ```sh ollama run gemma4 ``` See [ollama.com/library](https://ollama.com/library) for the full model list. ## Launching integrations ```sh ollama launch claude ollama launch codex ollama launch opencode ollama launch openclaw ``` See [[summaries/model-library-and-integrations-catalog]] for the full catalog. ## REST API After installation, the API is served by default at `http://localhost:11434/api`. For cloud models on **ollama.com**, the same API is at `https://ollama.com/api`. Access via `curl`: ```sh curl http://localhost:11434/api/chat -d '{ "model": "gemma4", "messages": [{ "role": "user", "content": "Hello!" }] }' ``` See [[concepts/rest-api]] for all endpoints. ## Libraries Official libraries for Python (`pip install ollama`) and JavaScript (`npm i ollama`); community libraries also exist. ## Backends and versioning Ollama is built on [llama.cpp](https://github.com/ggml-org/llama.cpp). The API isn't strictly versioned but is expected to be stable and backwards compatible; deprecations are rare and announced in the release notes. ## Next steps - [[concepts/installation]] — install on your platform - [[concepts/cli-reference]] — the full `ollama` command set - [[concepts/configuration-and-serving]] — `ollama serve` and environment variables --- title: "Ollama Cloud" type: entity tags: [ollama-cloud, hosted-models, web-search, authentication, api-keys] updated: 2026-06-23 confidence: high sources: [raw/llms_txt_doc-cloud.md, raw/llms_txt_doc-web-search.md, raw/llms_txt_doc-authentication.md] --- # Ollama Cloud Ollama Cloud is Ollama's hosted service: it runs large models that wouldn't fit locally by offloading them to Ollama's servers while you keep using the same local tools. It also provides a hosted web search / web fetch API. Cloud features are optional and can be disabled to run [[concepts/what-is-ollama|Ollama]] in local-only mode. ## Cloud models Cloud models "run without a powerful GPU" — auto-offloaded to Ollama's cloud, same capabilities as local models, full context length. Supported list: filter the library at `https://ollama.com/search?c=cloud`. Requires an [ollama.com](https://ollama.com) account (`ollama signin`). Once signed in, use them like local models — the name carries a `-cloud` (or `:cloud`) suffix (e.g. `ollama run gpt-oss:120b-cloud`). For SDK/`curl` against the **local** endpoint, first `ollama pull gpt-oss:120b-cloud`; Ollama then authenticates cloud requests automatically. ## Cloud API access (ollama.com as a remote host) Cloud models can also be hit directly on ollama.com's API — "ollama.com acts as a remote Ollama host," served at `https://ollama.com/api`, with the same native endpoints. Here the model name drops the `-cloud` suffix (e.g. `gpt-oss:120b`). List with `curl https://ollama.com/api/tags`; generate: ``` curl https://ollama.com/api/chat \ -H "Authorization: Bearer $OLLAMA_API_KEY" \ -d '{ "model": "gpt-oss:120b", "messages": [{"role": "user", "content": "Why is the sky blue?"}], "stream": false }' ``` The Python/JavaScript libraries accept `host="https://ollama.com"` plus an `Authorization: Bearer` header to target the cloud host. ## Authentication No auth locally; **required** for running cloud models, publishing, and private downloads. Two methods: **sign in** (`ollama signin` — then ollama.com requests authenticate automatically), or an **API key** for programmatic access to `https://ollama.com/api`. Create a key at `https://ollama.com/settings/keys`, `export OLLAMA_API_KEY=your_api_key`, and pass `-H "Authorization: Bearer $OLLAMA_API_KEY"`. Keys don't expire but can be revoked anytime. (Your local instance also has an Ollama Public Key, `id_ed25519.pub`, for pushing/pulling private models — see [[concepts/configuration-and-serving]].) ## Web search and web fetch A REST API (free account + API key required) that augments models with current information: * **`POST https://ollama.com/api/web_search`** — `query` (string, required), `max_results` (integer, optional; default `5`, max `10`). Returns `results[]`, each with `title`, `url`, `content`. * **`POST https://ollama.com/api/web_fetch`** — `url` (string, required). Returns `title`, `content`, `links[]`. ```bash curl https://ollama.com/api/web_search \ --header "Authorization: Bearer $OLLAMA_API_KEY" \ -d '{"query":"what is ollama?"}' ``` The Python (`ollama.web_search(...)`, `web_fetch(...)`) and JS (`client.webSearch(...)`, `client.webFetch(...)`) libraries expose these as callable tools for an agent loop (see [[concepts/thinking-and-web-search]], [[concepts/tool-calling]]). Results can be thousands of tokens — raise context to ≥~32000. Also wireable into any MCP client via the Python MCP server (Cline, Codex, Goose). ## Local-only mode Disable cloud entirely — set `disable_ollama_cloud` in `~/.ollama/server.json`: ```json { "disable_ollama_cloud": true } ``` …or set `OLLAMA_NO_CLOUD=1`, then restart. Logs then show `Ollama cloud disabled: true`; cloud models and web search become unavailable. ## Privacy and deprecations Ollama processes cloud prompts/responses to serve the request but states it does "not store or log that content and never train on it"; only basic account info and limited usage metadata are collected. Local models never send prompt data. Older cloud models are occasionally retired as better ones ship; impacted users are notified in advance by email and on the website, and **retirement does not affect local models.** A schedule with replacements is in the cloud docs — e.g. `kimi-k2-thinking`, `kimi-k2:1t`, `minimax-m2`, `glm-4.6`, `qwen3-next:80b`, `cogito-2.1:671b` were listed for a June 16, 2026 retirement. ## Related [[summaries/model-library-and-integrations-catalog]] · [[syntheses/api-surfaces-compared]] · [[concepts/rest-api]] · [[concepts/thinking-and-web-search]] --- title: "Activity Log" type: log --- # Activity Log Append-only record of all wiki changes. ## Format Each entry follows this format: ``` ### YYYY-MM-DD HH:MM — [Action Type] - **Source/Trigger**: what initiated the action - **Pages created**: list of new pages - **Pages updated**: list of updated pages - **Notes**: any contradictions flagged, decisions made ``` --- ### 2026-04-08 00:00 — Setup - **Source/Trigger**: Repository initialized - **Pages created**: index.md, log.md, dashboard.md, analytics.md, flashcards.md - **Pages updated**: none - **Notes**: Empty knowledge base ready for first source ingestion --- ### 2026-06-23 — Initial curation (factory build) - **Source/Trigger**: `new_wiki.py init ollama` — 121 sources gathered (docs.ollama.com llms.txt, README, 14 releases v0.23–v0.30, 40 solved GitHub issues) - **Pages created**: 17 — 13 concepts (what-is-ollama, installation, cli-reference, modelfile, rest-api, openai-and-anthropic-compat, tool-calling, structured-outputs, vision-and-multimodal, embeddings, thinking-and-web-search, gpu-and-hardware, configuration-and-serving), 1 entity (ollama-cloud), 1 summary (model-library-and-integrations-catalog), 2 syntheses (api-surfaces-compared, troubleshooting-playbook) - **Pages updated**: index.md (master catalog + stats), log.md - **Notes**: Curated to the medium rung per RECIPE. Folded context-length into configuration-and-serving. Troubleshooting playbook built from the 40 solved issues. Noted a source discrepancy on default context window (FAQ says 4096; context-length doc gives VRAM-tiered 4k/32k/256k) — both presented as written, attributed to their sources. --- title: "Model Library and Integrations Catalog" type: summary tags: [models, model-tags, integrations, ecosystem, catalog, coding-agents] updated: 2026-06-23 confidence: high sources: [raw/github_doc-readme-md.md, raw/llms_txt-llms-txt-index.md, raw/llms_txt_doc-list-models.md, raw/llms_txt_doc-overview.md, raw/llms_txt_doc-quickstart.md, raw/llms_txt_doc-claude-code.md, raw/llms_txt_doc-codex-cli.md, raw/llms_txt_doc-zed.md, raw/llms_txt_doc-cloud.md, raw/llms_txt_doc-openai-compatibility.md, raw/llms_txt_doc-anthropic-compatibility.md, raw/llms_txt_doc-web-search.md] --- # Model Library and Integrations Catalog Maps (a) the exact documented model tags and (b) the integration ecosystem. Points to the right tag and integration page; does not reproduce every setup. ## Running and finding models Pull/run by `:`; default registry is the library at `ollama.com/library` (cloud filter: `https://ollama.com/search?c=cloud`). ``` ollama run gemma4 # run + chat ollama pull llama3.2 # pull only ollama run gpt-oss:120b-cloud # cloud model (offloaded) ``` `GET /api/tags` (CLI `ollama list`) reports `name`, `size`, `digest`, `details` — incl. `parameter_size` (`7B`, `13B`), `quantization_level` (`Q4_K_M`), `family`/`format` (`gguf`). See [[concepts/cli-reference]], [[syntheses/api-surfaces-compared]]. ## Model tags referenced in the docs (verbatim) Exact tags from the source docs. Suffixes encode size/quantization (`:20b`, `:120b`, `:8b`, `-cloud`, `:cloud`). Documented working set only; the library is far larger. | Tag (verbatim) | Where documented | | --- | --- | | `gemma4` | Default chat example (README, quickstart, API) | | `gpt-oss:20b` | OpenAI-compat; "Strong general-purpose model" | | `gpt-oss:120b` | Cloud API (no suffix when hitting ollama.com) | | `gpt-oss:120b-cloud` | Cloud model run locally; Codex `--oss -m` | | `qwen3:4b` | Web-search agent example (Qwen 3, 4B params) | | `qwen3:8b` | `/v1/responses` example | | `qwen3-vl:8b` | OpenAI-compat vision example | | `qwen3-coder` | Claude Code default; "30B, ≥24GB VRAM" | | `qwen3.5`, `qwen3.5:cloud` | Claude Code recommended | | `glm-4.7`, `glm-4.7-flash`, `glm-4.7:cloud` | Claude Code / coding | | `glm-5:cloud` | Claude Code recommended (cloud) | | `kimi-k2.5:cloud` | `ollama launch claude --model kimi-k2.5:cloud` | | `minimax-m2.1:cloud`, `minimax-m2.7:cloud` | Claude Code recommended (cloud) | | `llama3.2` | OpenAI-compat pull; FAQ keep-alive/preload | | `mistral` | FAQ preload example | Cloud tags (`-cloud`/`:cloud`) follow Ollama Cloud's deprecation schedule — see [[entities/ollama-cloud]]. Quantization/context tradeoffs in [[concepts/configuration-and-serving]] and [[concepts/modelfile]]. ## Integration ecosystem map Ollama connects via the interactive menu (`ollama`), launchers (`ollama launch `), the native REST API, the OpenAI-/Anthropic-compatible APIs, the Python/JavaScript libraries, and MCP servers. Official integration docs live at `docs.ollama.com/integrations/*`. * **Coding agents** (`ollama launch ` where noted): Claude Code (`ollama launch claude`, Anthropic-compat — setup in [[syntheses/api-surfaces-compared]]), Codex CLI (`ollama launch codex` or `codex --oss [-m ]`) + Codex App, Copilot CLI, Cline CLI, OpenCode (`ollama launch opencode`), Droid, Goose (also a web-search MCP target), Oh My Pi, Pi, Pool. * **Assistants:** OpenClaw (`ollama launch openclaw`, "100+ skills"), Hermes Agent, Hermes Desktop, NemoClaw. * **IDEs & editors:** VS Code, JetBrains, Xcode, Cline, Roo Code; **Zed** — provider Ollama, Host URL `http://localhost:11434` (or API URL `https://ollama.com` for cloud). * **Chat/RAG, automation, notebooks:** Onyx, n8n, marimo. * **Connection methods:** native REST / `/v1` compat at `http://localhost:11434` (see [[concepts/rest-api]], [[concepts/openai-and-anthropic-compat]]); libraries `pip install ollama` / `npm i ollama`; MCP servers (web search via the Python MCP server, configs for Cline, Codex, Goose — see [[entities/ollama-cloud]]). ## Community integrations (from the README, not exhaustive) * **Chat UIs:** Open WebUI, LibreChat, Lobe Chat, NextChat, AnythingLLM, Cherry Studio, Enchanted, Msty, Chatbox, Alpaca, SwiftChat. * **Code editors & dev:** Continue, Void, twinny, gptel/Ellama (Emacs), AI Toolkit for VS Code, Open Interpreter, QodeAssist (Qt Creator). * **Libraries & SDKs:** LiteLLM, LangChain / LangChain.js / LangChain4j / LangChainGo / LangChainRust / LangChainDart, LlamaIndex, Haystack, Semantic Kernel, Spring AI, OllamaSharp (.NET), Ollama4j (Java), ollama-swift, Firebase Genkit, Testcontainers, Portkey. * **Frameworks & agents:** AutoGPT, crewAI, Strands Agents (AWS), Cheshire Cat, any-agent (Mozilla). * **RAG & KBs:** RAGFlow, R2R, MaxKB, Minima, Casibase, Archyve. * **Terminal/CLI:** aichat, oterm, gollama, tlm, ParLlama, llm-ollama. * **Database & embeddings:** pgai (Postgres), MindsDB, chromem-go, Kangaroo. * **Observability:** Opik, OpenLIT, Lunary, Langfuse, HoneyHive, MLflow Tracing. * **Infra/deploy & packaging:** Google Cloud, Fly.io, Koyeb, Harbor; Homebrew, Pacman, Nix, Helm Chart, Gentoo, Flox. Official Docker image `ollama/ollama` on Docker Hub. ## Related [[entities/ollama-cloud]] · [[syntheses/api-surfaces-compared]] · [[syntheses/troubleshooting-playbook]] · [[concepts/what-is-ollama]] · [[concepts/installation]] · [[concepts/cli-reference]] · [[concepts/modelfile]] · [[concepts/configuration-and-serving]] · [[concepts/vision-and-multimodal]] · [[concepts/embeddings]] --- title: "API Surfaces Compared: Native REST vs OpenAI-compatible vs Anthropic-compatible" type: synthesis tags: [rest-api, openai-compat, anthropic-compat, endpoints, compatibility] updated: 2026-06-23 confidence: medium sources: [raw/llms_txt_doc-generate-a-chat-message.md, raw/llms_txt_doc-generate-a-response.md, raw/llms_txt_doc-openai-compatibility.md, raw/llms_txt_doc-anthropic-compatibility.md, raw/llms_txt_doc-list-models.md] --- # API Surfaces Compared Ollama exposes **three** HTTP API surfaces on one server (`http://localhost:11434`, no local auth): the **native REST API** (`/api/*`, full feature set), the **OpenAI-compatible** shim (`/v1/*`), and the **Anthropic-compatible** shim (`/v1/messages`, notably [[concepts/openai-and-anthropic-compat|Claude Code]]). Choose native for new code; choose a compat surface for existing OpenAI/Anthropic tooling. ## Endpoints at a glance | Surface | Chat endpoint | Other endpoints | Auth | | --- | --- | --- | --- | | Native REST | `POST /api/chat` | `POST /api/generate`, `GET /api/tags`, `/api/ps`, `/api/pull`, `/api/push`, `/api/embed`, `/api/show`, `/api/create`, `/api/copy`, `/api/delete`, `GET /api/version` | none (local); Bearer key for ollama.com | | OpenAI-compat | `POST /v1/chat/completions` | `/v1/completions`, `/v1/responses`, `/v1/models`, `/v1/models/{model}`, `/v1/embeddings`, `/v1/images/generations` (experimental) | `api_key` "required but ignored" | | Anthropic-compat | `POST /v1/messages` | — | `x-api-key` / `ANTHROPIC_AUTH_TOKEN` accepted but not validated | See [[concepts/rest-api]] (native) and [[concepts/openai-and-anthropic-compat]] (compat). ## 1. Native Ollama REST API Two text-generation endpoints plus model-management endpoints (CLI equivalents in [[concepts/cli-reference]]). * **`POST /api/generate`** — prompt → response. Request: `model` (required), `prompt`, `suffix` (fill-in-the-middle), `images` (base64), `system`, `format`, `stream` (default `true`), `think`, `raw`, `keep_alive`, `options`, `logprobs`/`top_logprobs`. Response: `response`, optional `thinking`, `done`, `done_reason`, timing fields (`total_duration`, `load_duration`, `prompt_eval_count`, `eval_count`). * **`POST /api/chat`** — multi-turn `messages[]` (roles `system`/`user`/`assistant`/`tool`) → assistant message. Adds `tools` and per-message `images`/`tool_calls`. Response `message` has `content`, optional `thinking`, `tool_calls`, `images`. Native-only (or partial on compat) capabilities: * **`think`** — boolean **or** `"high" | "medium" | "low" | "max"` (see [[concepts/thinking-and-web-search]]). * **`format`** — `"json"` or a full JSON Schema for [[concepts/structured-outputs|structured outputs]]. * **`options`** — `seed`, `temperature`, `top_k`, `top_p`, `min_p`, `stop`, `num_ctx`, `num_predict`. The only way to set context size per-request (OpenAI surface can't). * **`keep_alive`** — unload timing (`5m`, `0`, `-1`); see [[concepts/configuration-and-serving]]. * **`logprobs` / `top_logprobs`** — token log-probabilities. Streaming is `application/x-ndjson`, default on. `GET /api/tags` also carries `remote_model`/`remote_host` for cloud/remote models — see [[entities/ollama-cloud]]. ## 2. OpenAI-compatible API (`/v1`) Point any OpenAI SDK at `base_url='http://localhost:11434/v1/'`, `api_key='ollama'` (ignored). Pull first; for hardcoded names alias with `ollama cp llama3.2 gpt-3.5-turbo`. Full field lists in [[concepts/openai-and-anthropic-compat]]; comparison highlights: * `/v1/chat/completions` — vision is base64-only (**not** image URL); supports `reasoning_effort`/`reasoning.effort` (`"high"|"medium"|"low"|"max"|"none"`); **not** supported: `logprobs`, `tool_choice`, `logit_bias`, `user`, `n`. * `/v1/responses` (added v0.13.3) — **non-stateful only** (no `previous_response_id`/`conversation`); supports `instructions`, `max_output_tokens`. * `/v1/completions` (legacy, `prompt` string-only, `suffix`); `/v1/embeddings` (`encoding_format`, `dimensions` — see [[concepts/embeddings]]); `/v1/models`(`/{model}`) (`created` = last-modified, `owned_by` = `"library"`); `/v1/images/generations` (experimental, `response_format: b64_json` only). **Key limitation vs native:** no per-request context size — bake `num_ctx` into a Modelfile (`PARAMETER num_ctx `, `ollama create mymodel`). See [[concepts/modelfile]]. ## 3. Anthropic-compatible API (`/v1/messages`) Anthropic Messages API so tools like Claude Code can use open models (`ANTHROPIC_AUTH_TOKEN=ollama`, `ANTHROPIC_BASE_URL=http://localhost:11434` — full setup in [[concepts/openai-and-anthropic-compat]]). `POST /v1/messages` requires `model` + `max_tokens` + `messages`; supports streaming, system prompts, multi-turn, vision (base64), `tool_use`/`tool_result` blocks, `thinking` blocks; honors `temperature`, `top_p`, `top_k`, `stop_sequences`. Streaming emits the full Anthropic event set (`message_start`, `content_block_delta` with `text_delta`/`input_json_delta`/`thinking_delta`, `message_stop`, etc.). Aliasing: `ollama cp qwen3-coder claude-3-5-sonnet`. **vs the real Anthropic API:** key **not validated**, `anthropic-version` **not used**, token counts approximate. **Not supported:** `/v1/messages/count_tokens`, `tool_choice`, `metadata`, prompt caching (`cache_control`), Batches API, citations, PDF/`document` blocks, server-sent `error` events. **Partial:** image base64-only; extended thinking (`budget_tokens` **not enforced**). For Claude Code setup (`ollama launch claude`, models, context ≥64k), see [[summaries/model-library-and-integrations-catalog]]. ## Related [[concepts/rest-api]] · [[concepts/openai-and-anthropic-compat]] · [[concepts/tool-calling]] · [[concepts/structured-outputs]] · [[concepts/thinking-and-web-search]] · [[entities/ollama-cloud]] · [[summaries/model-library-and-integrations-catalog]] --- title: "Troubleshooting Playbook" type: synthesis tags: [troubleshooting, gpu, rocm, out-of-memory, downloads, configuration] updated: 2026-06-23 confidence: medium sources: [raw/llms_txt_doc-troubleshooting.md, raw/llms_txt_doc-faq.md, raw/github_issue-ollama-serve-fails-to-detect-nvidia-gpus-after-updating-to-t.md, raw/github_issue-ollama-not-using-nvidia-gpus-with-gpt-oss-models.md, raw/github_issue-amd-7900xtx-fails-with-could-not-initialize-tensile-host-no-.md, raw/github_issue-amd-gpu-rocm-support.md, raw/github_issue-integrated-amd-gpu-support.md, raw/github_issue-out-of-memory-errors-when-running-gemma3.md, raw/github_issue-ollama-500-error-on-larger-models.md, raw/github_issue-qwen3-5-35b-error-500-internal-server-error.md, raw/github_issue-preview-0-5-13-rc2-uses-5-times-more-ram.md, raw/github_issue-pull-model-manifest-500.md, raw/github_issue-downloading-a-model-with-ollama-pull-or-ollama-run-stalls.md, raw/github_issue-issue-with-ollama-model-download-progress-reverting-during-d.md, raw/github_issue-ollama-0-6-6-memory-leak-with-different-models.md, raw/github_issue-ollama-stops-serving-requests-after-10-15-minutes.md, raw/github_issue-ollama-stuck-after-few-runs.md, raw/github_issue-ollama-stops-generating-output-and-fails-to-run-models-after.md, raw/github_issue-llama3-instruct-models-not-stopping-at-stop-token.md, raw/github_issue-allow-listening-on-all-local-interfaces.md, raw/github_issue-support-gpu-runners-on-cpus-without-avx.md, raw/github_issue-ollama-ai-certificate-has-expired-not-possible-to-download-m.md] --- # Troubleshooting Playbook Symptom → cause → fix, from the official troubleshooting/FAQ docs plus 19 solved GitHub issues. **First step: read the logs.** ## Logs and debug * Logs — macOS `~/.ollama/logs/server.log`; Linux `journalctl -u ollama --no-pager --follow --pager-end`; Docker `docker logs `; Windows `explorer %LOCALAPPDATA%\Ollama` (`server.log`); manual `ollama serve` prints to terminal. * Debug — Windows: quit tray, `$env:OLLAMA_DEBUG="1"; & "ollama app.exe"`. `OLLAMA_DEBUG=1`/`2`; NVIDIA `CUDA_ERROR_LEVEL=50`; AMD `AMD_LOG_LEVEL=3`. ## GPU not detected / falls back to CPU * **NVIDIA gone after update; "low vram mode" `total vram=0 B`** (#12618): stale/misformatted `CUDA_VISIBLE_DEVICES` (e.g. `CUDA_VISIBLE_DEVICES:0,1,2`). Fix: **unset `CUDA_VISIBLE_DEVICES`**; scope with UUIDs from `nvidia-smi -L`. * **NVIDIA discovery failures** (codes "3"/"46"/"100"/"999"): latest driver; in containers verify `docker run --gpus all ubuntu nvidia-smi`; load UVM `sudo nvidia-modprobe -u` or reload `sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm`, reboot. * **Linux after suspend/resume:** reload `sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm`. * **Docker GPU→CPU drift over time:** add `"exec-opts": ["native.cgroupdriver=cgroupfs"]` to `/etc/docker/daemon.json`. ## Model loads on CPU though GPU detected * **`gpt-oss:20b`/`gpt-oss:120b` all-CPU while e.g. `qwen3:30b` use GPU** (#11676, `offloaded 0/NN layers to GPU`): raised `OLLAMA_NUM_PARALLEL` inflated VRAM. Fix: **leave `OLLAMA_NUM_PARALLEL` at default**; confirm with `ollama ps` `PROCESSOR` column. ## AMD ROCm * **7900XTX `Could not initialize Tensile host: No devices found`** (#6685, ROCm 6.2, `ollama/ollama:rocm`): container device permissions. Fix: pass `--device /dev/kfd --device /dev/dri`, add numeric group IDs from `ls -lnd /dev/kfd /dev/dri /dev/dri/*` via `--group-add`; on SELinux set `container_use_devices` on. * **AMD driver too old** (`failed to finish discovery before timeout`, `bootstrap discovery took duration=30s`): Ollama bundles **ROCm 7** libs; ROCm 6.x hangs → CPU. Fix: `amdgpu-install`, reboot, restart. * **Self-built binary on CPU** (#738, `Not compiled with GPU offload support`): pass `-tags rocm` to **both** `go generate` and `go build`, set `ROCM_PATH` (e.g. `/opt/rocm`). Debug `AMD_LOG_LEVEL=3` + `OLLAMA_DEBUG=1`. * **Integrated AMD GPU** (#2637): ROCm iGPU support limited ("detects Radeon then says no GPU" fixed in latest binary). Best-effort. * **Multiple AMD GPUs — gibberish on Linux:** see AMD's multi-GPU known-issues guide. ## OOM / 500 "unable to load model" * **>~7B/8B `500 Internal Server Error` on `/api/chat`** (#5892, `check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape`): architecture unsupported by old version. **Fix: upgrade Ollama.** * **`500 ... unable to load model` for `qwen3.5:35b`** (#14419): needs newer Ollama (0.17.0 lacked qwen3.5). Fix: `curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.17.1-rc1 sh`. * **Crashes/freezes at higher context** (`gemma3:12b`, #9791): spilled to CPU (`ollama ps` `7%/93% CPU/GPU`), 8k context crashed the box. Mitigate: lower `OLLAMA_CONTEXT_LENGTH`/`num_ctx`; `OLLAMA_FLASH_ATTENTION=1`; `OLLAMA_KV_CACHE_TYPE=q8_0` (≈½ f16) or `q4_0` (≈¼). See [[concepts/gpu-and-hardware]], [[concepts/configuration-and-serving]]. * **RAM/VRAM far larger than model file** ("5× more RAM", #9457): not a leak — weights + **context buffer + model graph** (2.5G model can show ~5.2G in `ollama ps`). Lower `num_ctx` or K/V quantize. ## Downloads: stall / manifest 500 / progress reverts * **`pull model manifest: 500 {"errors":[{"code":"INTERNAL_ERROR"...}]}`** (#8873): registry **overloaded**. Fix: wait, retry. * **`error max retries exceeded: EOF` / `r2.cloudflarestorage.com ... server misbehaving`, stalls at fixed %** (#8632): **DNS** failure on the Cloudflare R2 host (`127.0.0.53:53: server misbehaving`). Fix: working resolver (e.g. 1.1.1.1), verify `nslookup`. * **Progress reverts (drops after 10–60%); `part N stalled; retrying`** (#8484): no data >5s (stall threshold) on a flaky link. Fix: `Ctrl+C` within ~5s of the drop, re-run to resume. ## Server stops responding * **VRAM/RAM held after model done; `ollama ps` empty but runners linger** (#10433): **orphaned runner processes** (server crashed). Fix: check logs, restart service. Control unload with `keep_alive`/`OLLAMA_KEEP_ALIVE`, `ollama stop `. * **Heavy parallel load: fine 10–15 min then `failed to generate embedding` / `Failed to acquire semaphore: context canceled` / `no slots available after 10 retries`** (#4545; `OLLAMA_NUM_PARALLEL=10/20`, `OLLAMA_MAX_QUEUE=1024`): saturated slots. Levers: `OLLAMA_NUM_PARALLEL` (RAM scales by `NUM_PARALLEL` × `CONTEXT_LENGTH`), `OLLAMA_MAX_QUEUE` (default 512; over-queue → 503), `OLLAMA_MAX_LOADED_MODELS`. * **Hangs/"stuck" after a few runs** (#1863, #2225): version-era bugs (~0.1.16–0.1.22). Fix: **upgrade** (#2225 fixed on 0.1.22); `systemctl restart ollama` restores temporarily. ## Runaway generation (won't stop at stop token) * **`llama3`/`llama3:70b` keeps emitting `<|eot_id|><|start_header_id|>assistant...`** (#3759): (a) over the **OpenAI-compatible endpoint** Modelfile `PARAMETER stop` is ignored — send stop token(s) in the request `stop` field; (b) raw-GGUF import had a wrong `TEMPLATE` — use `PARAMETER stop "<|start_header_id|>"`, `"<|end_header_id|>"`, `"<|eot_id|>"`. See [[concepts/modelfile]], [[syntheses/api-surfaces-compared]]. ## CPU without AVX (SIGILL) * **`CPU does not have vector extensions` then `SIGILL: illegal instruction`** (#2187): runners built for CPU features the host lacks. Newer Ollama **falls back to CPU** (`CPU does not have AVX or AVX2, disabling GPU support`). Force via `OLLAMA_LLM_LIBRARY` (`cpu_avx2` > `cpu_avx` > `cpu`), e.g. `OLLAMA_LLM_LIBRARY="cpu" ollama serve`. Check `cat /proc/cpuinfo | grep flags | head -1`. ## Network binding (OLLAMA_HOST) * **Only listens on loopback; unreachable from containers/proxies** (#703): binds `127.0.0.1:11434` by default. **Fix: `OLLAMA_HOST`**, e.g. `OLLAMA_HOST=0.0.0.0:8080` (no separate `OLLAMA_PORT`); on systemd add `Environment="OLLAMA_HOST=0.0.0.0:8080"`. For proxies/tunnels and `OLLAMA_ORIGINS` (CORS), see [[concepts/configuration-and-serving]]. ## TLS / certificate errors on pull * **`tls: failed to verify certificate: x509: certificate has expired`** (#3336): the **registry cert genuinely expired** (service incident). Fix: wait for renewal, retry; if only you, check local clock and CA trust. ## Related [[summaries/model-library-and-integrations-catalog]] · [[entities/ollama-cloud]] · [[syntheses/api-surfaces-compared]] · [[concepts/gpu-and-hardware]] · [[concepts/configuration-and-serving]] · [[concepts/installation]] · [[concepts/modelfile]] · [[concepts/cli-reference]]