---
title: "Ollama Local Provider"
type: entity
tags: [provider, local-models, model-switching, well-established, beginner]
created: 2026-06-10
updated: 2026-06-10
sources: ["raw/docs-integrations-providers.md", "raw/docs-guides-local-llm-on-mac.md", "raw/docs-user-guide-features-fallback-providers.md", "raw/02-install-and-setup.md", "raw/release-v0.11.0.md", "raw/release-v0.15.0.md"]
confidence: high
hermes_version: "v0.15.0"
---

## Overview

**Ollama** runs open-weight models locally with one command and is the lowest-friction local provider for Hermes: full privacy, zero API cost, offline operation. Hermes talks to it through the **custom endpoint** path — Ollama exposes an OpenAI-compatible API at `http://localhost:11434/v1` with tool-calling support, and any server implementing `/v1/chat/completions` works. The #1 integration pitfall is Ollama's **default context length** (as low as 4,096 tokens), which is too small for an agent whose system prompt + tool schemas alone can fill the window. [[entities/version-v0.11.0]] shipped a batch of Ollama improvements: Cloud provider support, GLM continuation, `think=false` control, surrogate sanitization, and a `/v1` hint.

## Characteristics

- **Setup:** `ollama pull <model>` + `ollama serve` (port **11434**); no API key required for local use
- **Hermes side:** custom endpoint — `base_url: http://localhost:11434/v1`, `provider: custom`; configured via `hermes model` → "Custom endpoint" or directly in `config.yaml`
- **Context defaults (per VRAM):** <24 GB → **4,096 tokens**; 24-48 GB → 32,768; 48+ GB → 256,000. For agent use with tools you need **at least 16k-32k**. Context length **cannot** be set through the OpenAI-compatible API — it must be set server-side (`OLLAMA_CONTEXT_LENGTH`) or baked into a Modelfile (`PARAMETER num_ctx`).
- **Timeouts:** Hermes auto-detects local endpoints (localhost, LAN IPs) and relaxes streaming timeouts — stream read raised 120s → 1800s, stale-stream detection disabled. Manual override: `HERMES_STREAM_READ_TIMEOUT=1800` in `.env`.
- **Credential pools:** custom endpoints get their own pools keyed by the auto-generated endpoint name (stored under a `custom:` prefix in `auth.json`)
- **Ollama Cloud:** `OLLAMA_API_KEY` is a recognized provider key, and an `ollama-cloud` plugin exists among the 28 provider plugins ([[entities/version-v0.15.0]]). The exact setup flow for the hosted Ollama Cloud path is not documented in current sources (confidence: medium for that flow; the env var and plugin existence are confirmed).
- **GPU offloading:** automatic — no configuration for most setups

## How to Use

```bash
# Install and run a model
ollama pull qwen2.5-coder:32b
ollama serve   # Starts on port 11434

# Fix the context window FIRST (pick one):
OLLAMA_CONTEXT_LENGTH=32768 ollama serve              # server-wide
# or bake it into a model:
echo -e "FROM qwen2.5-coder:32b\nPARAMETER num_ctx 32768" > Modelfile
ollama create qwen2.5-coder-32k -f Modelfile

# Verify the CONTEXT column shows your value
ollama ps
```

Then configure Hermes:

```bash
hermes model
# Select "Custom endpoint (self-hosted / VLLM / etc.)"
# Enter URL: http://localhost:11434/v1
# Skip API key (Ollama doesn't need one)
# Enter model name (e.g. qwen2.5-coder:32b)
```

Or in `~/.hermes/config.yaml`:

```yaml
model:
  default: qwen2.5-coder:32b
  provider: custom
  base_url: http://localhost:11434/v1
  context_length: 32768
```

Mid-session switching: `/model custom:qwen2.5-coder:32b`, or bare `/model custom` to auto-detect when the server has exactly one model loaded. With named custom providers, use the triple syntax: `/model custom:local:qwen-2.5`.

As an airplane-mode fallback for a cloud primary:

```yaml
fallback_model:
  provider: custom
  model: qwen2.5-coder:32b
  base_url: http://localhost:11434/v1
```

## Related Entities

- [[entities/provider-openrouter]], [[entities/provider-nous-portal]] — cloud primaries that Ollama typically backs up
- [[entities/version-v0.11.0]] — Ollama improvements batch
- [[concepts/local-models-airplane-mode]] — the full local-stack concept
- [[concepts/model-switching]] — custom providers, pools, `/model` syntax
- [[syntheses/local-stack-playbook]] — end-to-end local recipe (see also the llama.cpp/omlx Mac guide it draws on)
- [[entities/backend-local]] — pair a local model with the local backend for a fully offline agent
