wikis / Hermes / wiki / entities / provider-ollama-local.md view as markdown
Overview
Ollama runs open-weight models locally with one command and is the lowest-friction local provider for Hermes: full privacy, zero API cost, offline operation. Hermes talks to it through the custom endpoint path — Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1 with tool-calling support, and any server implementing /v1/chat/completions works. The #1 integration pitfall is Ollama's default context length (as low as 4,096 tokens), which is too small for an agent whose system prompt + tool schemas alone can fill the window. version v0.11.0 shipped a batch of Ollama improvements: Cloud provider support, GLM continuation, think=false control, surrogate sanitization, and a /v1 hint.
Characteristics
- Setup:
ollama pull <model>+ollama serve(port 11434); no API key required for local use - Hermes side: custom endpoint —
base_url: http://localhost:11434/v1,provider: custom; configured viahermes model→ "Custom endpoint" or directly inconfig.yaml - Context defaults (per VRAM): <24 GB → 4,096 tokens; 24-48 GB → 32,768; 48+ GB → 256,000. For agent use with tools you need at least 16k-32k. Context length cannot be set through the OpenAI-compatible API — it must be set server-side (
OLLAMA_CONTEXT_LENGTH) or baked into a Modelfile (PARAMETER num_ctx). - Timeouts: Hermes auto-detects local endpoints (localhost, LAN IPs) and relaxes streaming timeouts — stream read raised 120s → 1800s, stale-stream detection disabled. Manual override:
HERMES_STREAM_READ_TIMEOUT=1800in.env. - Credential pools: custom endpoints get their own pools keyed by the auto-generated endpoint name (stored under a
custom:prefix inauth.json) - Ollama Cloud:
OLLAMA_API_KEYis a recognized provider key, and anollama-cloudplugin exists among the 28 provider plugins (version v0.15.0). The exact setup flow for the hosted Ollama Cloud path is not documented in current sources (confidence: medium for that flow; the env var and plugin existence are confirmed). - GPU offloading: automatic — no configuration for most setups
How to Use
# Install and run a model
ollama pull qwen2.5-coder:32b
ollama serve # Starts on port 11434
# Fix the context window FIRST (pick one):
OLLAMA_CONTEXT_LENGTH=32768 ollama serve # server-wide
# or bake it into a model:
echo -e "FROM qwen2.5-coder:32b\nPARAMETER num_ctx 32768" > Modelfile
ollama create qwen2.5-coder-32k -f Modelfile
# Verify the CONTEXT column shows your value
ollama ps
Then configure Hermes:
hermes model
# Select "Custom endpoint (self-hosted / VLLM / etc.)"
# Enter URL: http://localhost:11434/v1
# Skip API key (Ollama doesn't need one)
# Enter model name (e.g. qwen2.5-coder:32b)
Or in ~/.hermes/config.yaml:
model:
default: qwen2.5-coder:32b
provider: custom
base_url: http://localhost:11434/v1
context_length: 32768
Mid-session switching: /model custom:qwen2.5-coder:32b, or bare /model custom to auto-detect when the server has exactly one model loaded. With named custom providers, use the triple syntax: /model custom:local:qwen-2.5.
As an airplane-mode fallback for a cloud primary:
fallback_model:
provider: custom
model: qwen2.5-coder:32b
base_url: http://localhost:11434/v1
Related Entities
- provider openrouter, provider nous portal — cloud primaries that Ollama typically backs up
- version v0.11.0 — Ollama improvements batch
- local models airplane mode — the full local-stack concept
- model switching — custom providers, pools,
/modelsyntax - local stack playbook — end-to-end local recipe (see also the llama.cpp/omlx Mac guide it draws on)
- backend local — pair a local model with the local backend for a fully offline agent
