Ollama Local Provider — Hermes

wikis / Hermes / wiki / entities / provider-ollama-local.md view as markdown

type: entityconfidence: highupdated: 2026-06-10hermes_version: v0.15.0sources: 6

Overview

Ollama runs open-weight models locally with one command and is the lowest-friction local provider for Hermes: full privacy, zero API cost, offline operation. Hermes talks to it through the custom endpoint path — Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1 with tool-calling support, and any server implementing /v1/chat/completions works. The #1 integration pitfall is Ollama's default context length (as low as 4,096 tokens), which is too small for an agent whose system prompt + tool schemas alone can fill the window. version v0.11.0 shipped a batch of Ollama improvements: Cloud provider support, GLM continuation, think=false control, surrogate sanitization, and a /v1 hint.

Characteristics

Setup: ollama pull <model> + ollama serve (port 11434); no API key required for local use
Hermes side: custom endpoint — base_url: http://localhost:11434/v1, provider: custom; configured via hermes model → "Custom endpoint" or directly in config.yaml
Context defaults (per VRAM): <24 GB → 4,096 tokens; 24-48 GB → 32,768; 48+ GB → 256,000. For agent use with tools you need at least 16k-32k. Context length cannot be set through the OpenAI-compatible API — it must be set server-side (OLLAMA_CONTEXT_LENGTH) or baked into a Modelfile (PARAMETER num_ctx).
Timeouts: Hermes auto-detects local endpoints (localhost, LAN IPs) and relaxes streaming timeouts — stream read raised 120s → 1800s, stale-stream detection disabled. Manual override: HERMES_STREAM_READ_TIMEOUT=1800 in .env.
Credential pools: custom endpoints get their own pools keyed by the auto-generated endpoint name (stored under a custom: prefix in auth.json)
Ollama Cloud: OLLAMA_API_KEY is a recognized provider key, and an ollama-cloud plugin exists among the 28 provider plugins (version v0.15.0). The exact setup flow for the hosted Ollama Cloud path is not documented in current sources (confidence: medium for that flow; the env var and plugin existence are confirmed).
GPU offloading: automatic — no configuration for most setups

How to Use

# Install and run a model
ollama pull qwen2.5-coder:32b
ollama serve   # Starts on port 11434

# Fix the context window FIRST (pick one):
OLLAMA_CONTEXT_LENGTH=32768 ollama serve              # server-wide
# or bake it into a model:
echo -e "FROM qwen2.5-coder:32b\nPARAMETER num_ctx 32768" > Modelfile
ollama create qwen2.5-coder-32k -f Modelfile

# Verify the CONTEXT column shows your value
ollama ps

Then configure Hermes:

hermes model
# Select "Custom endpoint (self-hosted / VLLM / etc.)"
# Enter URL: http://localhost:11434/v1
# Skip API key (Ollama doesn't need one)
# Enter model name (e.g. qwen2.5-coder:32b)

Or in ~/.hermes/config.yaml:

model:
  default: qwen2.5-coder:32b
  provider: custom
  base_url: http://localhost:11434/v1
  context_length: 32768

Mid-session switching: /model custom:qwen2.5-coder:32b, or bare /model custom to auto-detect when the server has exactly one model loaded. With named custom providers, use the triple syntax: /model custom:local:qwen-2.5.

As an airplane-mode fallback for a cloud primary:

fallback_model:
  provider: custom
  model: qwen2.5-coder:32b
  base_url: http://localhost:11434/v1

Related Entities

provider openrouter, provider nous portal — cloud primaries that Ollama typically backs up
version v0.11.0 — Ollama improvements batch
local models airplane mode — the full local-stack concept
model switching — custom providers, pools, /model syntax
local stack playbook — end-to-end local recipe (see also the llama.cpp/omlx Mac guide it draws on)
backend local — pair a local model with the local backend for a fully offline agent