wikis / llama.cpp / wiki / log.md view as markdown

Activity Log

type: logupdated: 2026-06-10

Append-only log of KB changes.

2026-05-30 — KB scaffolded

Triggered by: user creating a llama.cpp knowledge base as the research backbone for a YouTube video on llama.cpp. Modeled on the llm-wiki template (Karpathy's "LLM Wiki" pattern), customized for the llama.cpp domain.

Added:

CLAUDE.md — schema tailored to llama.cpp (directory layout, page format, llama.cpp tagging taxonomy, build-tag version awareness, ingest/query/lint workflows).
README.md
wiki/index.md — empty master catalog with planned starter pages listed.
wiki/log.md — this file.
wiki/journal/template.md — research-session note template.
Empty subdirs: raw/, wiki/summaries/, wiki/concepts/, wiki/entities/, wiki/syntheses/, wiki/presentations/.

Next: drop llama.cpp source material into raw/ (official docs mirrors, video transcripts, discussion/PR dumps) and run "ingest".

2026-05-30 — seeded raw/ with official docs (13 mirrors)

Triggered by: user request to fetch the official llama.cpp docs and seed raw/. Verbatim mirrors pulled from ggml-org/llama.cpp@master (and the GGUF spec from ggml-org/ggml@master) on 2026-05-30.

Added raw sources (13):

llamacpp-readme.md — main repo README (overview, features, supported models/backends)
docs-build.md — full build guide + backend matrix (CPU/CUDA/Metal/Vulkan/ROCm/SYCL/MUSA/CANN)
docs-install.md — package-manager install methods
server-readme.md — llama-server docs + OpenAI-compatible HTTP API (largest source, ~92KB)
cli-readme.md — llama-cli (tools/cli) docs and flags
quantize-readme.md — llama-quantize + quant type table
imatrix-readme.md — importance-matrix generation for quantization
llama-bench-readme.md — llama-bench benchmarking tool
grammars-readme.md — GBNF grammar syntax + structured output
docs-function-calling.md — tool/function calling support
docs-multimodal.md — multimodal overview
mtmd-readme.md — mtmd multimodal CLI/lib
gguf-spec.md — GGUF file-format specification (from ggml repo)

Note: these are immutable mirrors — do not edit. Build tag at fetch time not pinned (master). When ingesting, record an approximate llama_build from the current release tag.

Next: run "ingest" to generate summary + concept/entity pages from these sources.

2026-05-30 — ingested all 13 official-doc sources

Triggered by: user request to ingest all 13 seeded raw sources. Done in two phases (parallel summarizers → parallel page authors) with a shared canonical slug list for consistent cross-links.

Added wiki pages (37 total):

Summaries (13) — one per raw source: llamacpp-readme, docs-build, docs-install, server-readme, cli-readme, quantize-readme, imatrix-readme, llama-bench-readme, grammars-readme, docs-function-calling, docs-multimodal, mtmd-readme, gguf-spec.
Concepts (12) — gguf-format, quantization, imatrix, sampling-parameters, kv-cache-and-context, speculative-decoding, gbnf-grammars, function-calling, embeddings, multimodal-mtmd, server-api, build-and-backends.
Entities (12) — project-llama-cpp; binaries binary-llama-cli, binary-llama-server, binary-llama-quantize, binary-imatrix, binary-llama-bench, binary-mtmd; backends backend-cpu, backend-cuda, backend-metal, backend-vulkan, backend-rocm.

Updated: index.md — fully populated master catalog (page count 0 → 37); this log.

Caveats recorded on pages (per CLAUDE.md version-awareness):

All pages set llama_build: "master (~2026-05)" — sources were master, not a pinned b####.
Doc-vs-code lags flagged inline: GGUF spec general.file_type enum stale; speculative-decoding flags reworked (legacy --draft* removed); function-calling doc self-TODO; multimodal roster "under heavy development."
Real doc inconsistency captured: server /completion repeat_penalty default 1.1 vs CLI --repeat-penalty 1.00.

Not yet created (deferred): backend pages for SYCL/MUSA/CANN/OpenCL/RPC/etc. (present in the build-and-backends matrix); all syntheses (need a community/benchmark source pass); llama-perplexity / llama-simple / conversion-script entities.

Next: add community/benchmark sources, then build the planned syntheses (quant-types-compared, backend-selection-guide, llamacpp-vs-ollama).

2026-05-30 — community sweep: ingested 15 community sources + 5 syntheses

Triggered by: user request to sweep popular community sources. Ran a 4-agent web discovery sweep (~~28 verified-live sources found), user chose breadth "primary + best secondary" (~~15) and synthesis focus = quant selection, vs Ollama/vLLM, deployment/OpenAI API, and a custom "customizable features / tuning" page. Fetched in 5 parallel agents (mirror + summary each), then 3 parallel agents authored the syntheses.

Added raw mirrors (15) under raw/community/ — markdown-converted mirrors with provenance headers (NOT pristine):

quant: community-pr1684-kquants, community-artefact2-quant-table, community-arxiv-quant-eval, community-kaitchup-gguf-guide, community-bartowski-quant-guide, community-mradermacher-imatrix, community-unsloth-dynamic-ggufs
benchmarks: community-bench-apple-silicon (GH#4167), community-bench-nvidia-cuda (GH#15013), community-smcleod-kv-quant, community-dgxspark-kv-quant
comparisons/guides: community-redhat-vllm-vs-llamacpp, community-gh15180-vllm-vs-llamacpp, community-steelphoenix-guide, community-hf-gguf-usage

Added wiki pages (20):

Summaries (15) — one per community source above (all confidence: medium, dated).
Syntheses (5) — quant-types-compared, llamacpp-vs-ollama, llamacpp-vs-vllm, server-deployment, customization-and-tuning.

Updated: index.md (page count 37 → 57; two source tiers; community caveats section); this log.

Honesty/provenance notes recorded on pages:

Reddit unfetchable by tooling — no r/LocalLLaMA threads fabricated.
KV-quant conflict (smcleod vs DGX Spark) captured and reconciled → q8_0 safe default.
Vendor bias flagged: Unsloth (Dynamic 2.0), Red Hat (pro-vLLM).
No dedicated Ollama source mirrored — llamacpp-vs-ollama rests on the wrapper relationship + community consensus (noted on the page).
Partial fetches: SteelPh0enix (long-form prose refused → structured technical extraction); Kaitchup (partial paywall — taxonomy only, no number tables); arXiv tables from HTML extraction (verify vs PDF).
Absolute perplexity scales differ across quant sources (LLaMA-1-7B vs Llama-3.1-8B) — only relative ordering transfers.

Deferred: dedicated benchmark/backend-selection-guide synthesis (data is ingested; user deprioritized); a mirrored Ollama-vs-llama.cpp benchmark source; Reddit threads (user can paste).

2026-05-30 — drafted standalone video outline

Triggered by: user wants a standalone "what is llama.cpp" video — what it is, what makes it different from other local servers, customization/flags, ways to serve, and a head-to-head test vs Ollama (memory/speed). (NOT a quantization module.)

Added (1):

presentations/standalone-llamacpp-explainer-outline.md — 14–20 min, 5 segments matching the brief + cold open/outro. Grounded in project-llama-cpp, llamacpp-vs-ollama, llamacpp-vs-vllm, customization-and-tuning, server-deployment, server-api. Includes on-screen demos with verbatim commands, a fair head-to-head methodology (controls: same GGUF/quant, ctx, -ngl, flash-attn, KV type; metrics table; what to measure with llama-bench vs ollama --verbose), honest expectation-setting (Ollama wraps llama.cpp → expect ~few-% speed gap, story is footprint/control), pull-quotes, and a pre-record flag-verification checklist.

Updated: index.md (Presentations 0 → 1; page count 57 → 58); this log.

Honesty notes baked into the outline: don't oversell a speed gap (same engine); no mirrored Ollama benchmark source exists so the test is "your hardware, this build"; flags are master (~2026-05) → pre-record checklist added.

Next: user records/measures the live test; could then write back real numbers as a community benchmark source + a llamacpp-vs-ollama data update.

2026-05-30 — added ggml page + name clarifier + Ollama nuance

Triggered by: user fact-checked the "wrappers built on llama.cpp" claim (confused Meta's Llama models with llama.cpp/Ollama) and asked what ggml is. Verified the wrapping claim via web search (Ollama/LM Studio/Jan/KoboldCpp/GPT4All/llamafile all build on llama.cpp/ggml; none are Meta) — claim holds, no correction needed.

Added (1 entity):

entities/ggml.md — the C tensor library by Georgi Gerganov that llama.cpp is built on (compute graph, backends, quantization; GGUF is its format; also powers whisper.cpp; ggml.ai joined Hugging Face in 2026, confidence medium). Was a long-standing dangling reference across the KB — now a real page.

Updated:

entities/project-llama-cpp.md — added a "Name clarifier" callout (Meta = the models; llama.cpp = independent engine; Ollama = separate company wrapper) + ggml backlink in Related Entities.
syntheses/llamacpp-vs-ollama.md — added the 2026 nuance: Ollama now has its own model-loading engine for some architectures but still built on ggml; "only a thin shell over llama.cpp" is now slightly overstated.
presentations/standalone-llamacpp-explainer-outline.md — added an on-screen beat in Segment 1 clearing up the three-things-named-llama confusion + ggml name-drop.
index.md — entities 12 → 13, total 58 → 59.

Sources for verification: llama.cpp Wikipedia; ggml-org/llama.cpp README; SitePoint & Starmorph local-LLM tool guides.

2026-05-30 — built slide deck for the explainer video

Triggered by: user wants a slideshow for the informational beats of the standalone outline (install/terminal demos done live), themed to the llama.cpp README head image.

Palette: sampled the actual README head image (downloaded, viewed) — charcoal card #1e2228 / page #16191d, white wordmark, orange flame/C++ accent #f0883e (lighter #f7a85a, deep #e2702a), gray text #9aa4ae, mono for code. Recreated the "LLaMA C++" wordmark in CSS (no external image → fully portable single file).

Added (1 asset):

presentations/standalone-llamacpp-explainer-slides.html — self-contained, dependency-free, keyboard-navigable 15-slide deck (title, hook, what-it-is, name clarifier, the stack, what-is-ggml, why-use-directly, vs vLLM, the six knob families, ways-to-serve, test setup, scoreboard, what-you'll-find, decision, outro). Progress bar, slide counter, click/arrow/F-fullscreen nav, print-to-PDF friendly. Content mirrors the outline's informational segments; install/curl/docker commands intentionally omitted (live terminal).

Updated: index.md (presentations entry); this log. (HTML asset, not counted in the .md page total of 59.)

2026-05-30 — expanded the customization segment in the slide deck

Triggered by: user wants the customization segment (the video's highlight) given real depth — break the single six-card grid into a slide per family.

Changed presentations/standalone-llamacpp-explainer-slides.html: the one customization slide became 8 — a quick "map" grid, then a dedicated deep-dive per family (Sampling / Structured output / Tool calling / Context & memory / Speed / Hardware), then a "change one knob, measure" closer. Each deep-dive uses a new two-column layout: left = the key flags (orange mono) with what each does + defaults; right = a "Why it matters" card + the real trade-off/pitfall. Flags/defaults grounded in the concept pages. Deck grew 15 → 22 slides. Added CSS (.sub, .split, .flaglist, .fl, .explain). Verified render of the Sampling and Tool-calling slides (no overflow).

Updated: index.md (slide count + deep-dive note); the outline's Segment 3 (note pointing to the per-family slides); this log.

2026-05-30 — wrote two separate recording scripts (slides + demos)

Triggered by: user wants distinct scripts for the slide voiceover and the non-slide live terminal demos (recorded in separate passes).

Added (2):

presentations/standalone-llamacpp-explainer-script-slides.md — VO narration for all 22 deck slides, with on-screen cues, rough timings, and → DEMO n cut markers.
presentations/standalone-llamacpp-explainer-script-demos.md — runbook for the 4 live terminal segments: (1) install + first run, (2) customization in action — structured output + KV-cache memory, (3) serve as OpenAI API — server/Web UI/curl/Python/Docker, (4) the fair Ollama head-to-head (same GGUF via Modelfile, llama-bench vs ollama run --verbose/ollama ps, scoreboard). Each demo block has exact commands, a SAY:/POINT: talk track, ⚠︎ gotchas/fallbacks, a prep/shopping list, and transitions.

Both share one running order: S1–S3 → DEMO1 → S4–S9 → S10–S16 → DEMO2 → S17 → DEMO3 → S18 → DEMO4 → S19–S22. Commands grounded in the server/cli/bench docs + HF GGUF usage; flagged master (~2026-05).

Updated: index.md (Presentations 1 → 4 .md pages; both scripts listed); this log.

2026-05-31 — pre-record accuracy audit of the deck + both scripts

Triggered by: user asked for a final accuracy sanity check before recording. Ran 3 parallel verifiers: (A) llama.cpp flags/commands vs the cli/server/bench/grammars mirrors; (B) Ollama commands vs live Ollama docs; (C) conceptual claims vs KB pages + web.

Result: no conceptual errors (license, ggml/whisper.cpp, wrapper relationship, vLLM framing, quant/KV/spec-decoding, Raspberry Pi — all confirmed). One factual bug + several stale-default / over-stated-as-fact tightenings.

Fixes applied (deck HTML + both scripts):

❌→✅ /bye removed from DEMO 1 — it's Ollama's exit, not llama-cli's (confirmed: zero /bye in official cli README). Now Ctrl-C / Ctrl-D, with a note that /bye is Ollama-only.
-fa 1 → -fa on (flag now takes on|off|auto, default auto); softened the "KV quant requires -fa" gotcha to "generally requires (esp. V-cache), verify on your build."
--jinja reframed from "Required. Turns on…" to "Enables tool calling + chat templates (on by default now)" — default changed to enabled (deck slide 12 + script S12).
Speed parity ~2–8% → "usually within a few percent… we'll test it ourselves" (slide 7 note + script S7) — it's community consensus, not a benchmarked source, and DEMO 4 tests it.
"you're running llama.cpp underneath" → "llama.cpp / ggml underneath" (slide 21 + S21) — Ollama now has its own ggml-based engine for some models.
GPT4All card "llama.cpp backend" → "GGUF via llama.cpp" (weakest of the six wrappers); sampling VO "penalties/DRY stop repetition" → "…when you turn them on" (both off by default).
Added DEMO 4 note: KV is f16 on both sides by default (so the bench matches); set OLLAMA_KV_CACHE_TYPE=q8_0 only if quantizing KV.

Verified correct, left as-is: all sampling defaults (temp 0.8 / top-k 40 / top-p 0.95 / min-p 0.05), -c 0=from model, -ctk/-ctv q8_0, --rope-scaling/--yarn-*, -np/-cb, --spec-type strategy names, -sm/-ts/-cmoe, server bind 127.0.0.1:8080, /v1/chat/completions, --api-key/LLAMA_API_KEY, Web UI at /, Docker :server/:server-cuda, llama-bench -p(pp)/-n(tg) excl. tokenization. Ollama side fully confirmed against docs.ollama.com (Modelfile FROM ./*.gguf, num_ctx, ollama create -f, OLLAMA_FLASH_ATTENTION=1, run --verbose rate fields, ps PROCESSOR/GPU%, show quant). Tool-calling specifics (native template list, parallel_tool_calls default-off, q4_0-KV-degrades-tools) trace to raw/docs-function-calling.md — sourced & correct.

For live verification on the recording machine (docs don't pin these): exact -fa value form, whether -ngl even needs setting (default now auto), and current --spec-* flag spellings — all already in the outline's pre-record checklist.

2026-06-10 — removed Obsidian scaffolding from the served wiki

Deleted analytics.md, dashboard.md, flashcards.md (Obsidian plugin pages — Dataview/Charts View/Spaced Repetition markup, unusable when served as plain Markdown to agents) and the journal/ scaffold (template only). The 4 video-production files in presentations/ moved to repo root (not served); index count 59 -> 58. CLAUDE.md directory layout updated: production/planning material lives at repo root, never under wiki/ (everything under wiki/ is served publicly).