--- title: "Ollama Local Provider" type: entity tags: [provider, local-models, model-switching, well-established, beginner] created: 2026-06-10 updated: 2026-06-10 sources: ["raw/docs-integrations-providers.md", "raw/docs-guides-local-llm-on-mac.md", "raw/docs-user-guide-features-fallback-providers.md", "raw/02-install-and-setup.md", "raw/release-v0.11.0.md", "raw/release-v0.15.0.md"] confidence: high hermes_version: "v0.15.0" --- ## Overview **Ollama** runs open-weight models locally with one command and is the lowest-friction local provider for Hermes: full privacy, zero API cost, offline operation. Hermes talks to it through the **custom endpoint** path — Ollama exposes an OpenAI-compatible API at `http://localhost:11434/v1` with tool-calling support, and any server implementing `/v1/chat/completions` works. The #1 integration pitfall is Ollama's **default context length** (as low as 4,096 tokens), which is too small for an agent whose system prompt + tool schemas alone can fill the window. [[entities/version-v0.11.0]] shipped a batch of Ollama improvements: Cloud provider support, GLM continuation, `think=false` control, surrogate sanitization, and a `/v1` hint. ## Characteristics - **Setup:** `ollama pull ` + `ollama serve` (port **11434**); no API key required for local use - **Hermes side:** custom endpoint — `base_url: http://localhost:11434/v1`, `provider: custom`; configured via `hermes model` → "Custom endpoint" or directly in `config.yaml` - **Context defaults (per VRAM):** <24 GB → **4,096 tokens**; 24-48 GB → 32,768; 48+ GB → 256,000. For agent use with tools you need **at least 16k-32k**. Context length **cannot** be set through the OpenAI-compatible API — it must be set server-side (`OLLAMA_CONTEXT_LENGTH`) or baked into a Modelfile (`PARAMETER num_ctx`). - **Timeouts:** Hermes auto-detects local endpoints (localhost, LAN IPs) and relaxes streaming timeouts — stream read raised 120s → 1800s, stale-stream detection disabled. Manual override: `HERMES_STREAM_READ_TIMEOUT=1800` in `.env`. - **Credential pools:** custom endpoints get their own pools keyed by the auto-generated endpoint name (stored under a `custom:` prefix in `auth.json`) - **Ollama Cloud:** `OLLAMA_API_KEY` is a recognized provider key, and an `ollama-cloud` plugin exists among the 28 provider plugins ([[entities/version-v0.15.0]]). The exact setup flow for the hosted Ollama Cloud path is not documented in current sources (confidence: medium for that flow; the env var and plugin existence are confirmed). - **GPU offloading:** automatic — no configuration for most setups ## How to Use ```bash # Install and run a model ollama pull qwen2.5-coder:32b ollama serve # Starts on port 11434 # Fix the context window FIRST (pick one): OLLAMA_CONTEXT_LENGTH=32768 ollama serve # server-wide # or bake it into a model: echo -e "FROM qwen2.5-coder:32b\nPARAMETER num_ctx 32768" > Modelfile ollama create qwen2.5-coder-32k -f Modelfile # Verify the CONTEXT column shows your value ollama ps ``` Then configure Hermes: ```bash hermes model # Select "Custom endpoint (self-hosted / VLLM / etc.)" # Enter URL: http://localhost:11434/v1 # Skip API key (Ollama doesn't need one) # Enter model name (e.g. qwen2.5-coder:32b) ``` Or in `~/.hermes/config.yaml`: ```yaml model: default: qwen2.5-coder:32b provider: custom base_url: http://localhost:11434/v1 context_length: 32768 ``` Mid-session switching: `/model custom:qwen2.5-coder:32b`, or bare `/model custom` to auto-detect when the server has exactly one model loaded. With named custom providers, use the triple syntax: `/model custom:local:qwen-2.5`. As an airplane-mode fallback for a cloud primary: ```yaml fallback_model: provider: custom model: qwen2.5-coder:32b base_url: http://localhost:11434/v1 ``` ## Related Entities - [[entities/provider-openrouter]], [[entities/provider-nous-portal]] — cloud primaries that Ollama typically backs up - [[entities/version-v0.11.0]] — Ollama improvements batch - [[concepts/local-models-airplane-mode]] — the full local-stack concept - [[concepts/model-switching]] — custom providers, pools, `/model` syntax - [[syntheses/local-stack-playbook]] — end-to-end local recipe (see also the llama.cpp/omlx Mac guide it draws on) - [[entities/backend-local]] — pair a local model with the local backend for a fully offline agent