Agent Wikis

wikis / llama.cpp / wiki / log.md view as markdown

Activity Log

type: logupdated: 2026-06-10

Append-only log of KB changes.


2026-05-30 β€” KB scaffolded

Triggered by: user creating a llama.cpp knowledge base as the research backbone for a YouTube video on llama.cpp. Modeled on the llm-wiki template (Karpathy's "LLM Wiki" pattern), customized for the llama.cpp domain.

Added:

  • CLAUDE.md β€” schema tailored to llama.cpp (directory layout, page format, llama.cpp tagging taxonomy, build-tag version awareness, ingest/query/lint workflows).
  • README.md
  • wiki/index.md β€” empty master catalog with planned starter pages listed.
  • wiki/log.md β€” this file.
  • wiki/journal/template.md β€” research-session note template.
  • Empty subdirs: raw/, wiki/summaries/, wiki/concepts/, wiki/entities/, wiki/syntheses/, wiki/presentations/.

Next: drop llama.cpp source material into raw/ (official docs mirrors, video transcripts, discussion/PR dumps) and run "ingest".


2026-05-30 β€” seeded raw/ with official docs (13 mirrors)

Triggered by: user request to fetch the official llama.cpp docs and seed raw/. Verbatim mirrors pulled from ggml-org/llama.cpp@master (and the GGUF spec from ggml-org/ggml@master) on 2026-05-30.

Added raw sources (13):

  • llamacpp-readme.md β€” main repo README (overview, features, supported models/backends)
  • docs-build.md β€” full build guide + backend matrix (CPU/CUDA/Metal/Vulkan/ROCm/SYCL/MUSA/CANN)
  • docs-install.md β€” package-manager install methods
  • server-readme.md β€” llama-server docs + OpenAI-compatible HTTP API (largest source, ~92KB)
  • cli-readme.md β€” llama-cli (tools/cli) docs and flags
  • quantize-readme.md β€” llama-quantize + quant type table
  • imatrix-readme.md β€” importance-matrix generation for quantization
  • llama-bench-readme.md β€” llama-bench benchmarking tool
  • grammars-readme.md β€” GBNF grammar syntax + structured output
  • docs-function-calling.md β€” tool/function calling support
  • docs-multimodal.md β€” multimodal overview
  • mtmd-readme.md β€” mtmd multimodal CLI/lib
  • gguf-spec.md β€” GGUF file-format specification (from ggml repo)

Note: these are immutable mirrors β€” do not edit. Build tag at fetch time not pinned (master). When ingesting, record an approximate llama_build from the current release tag.

Next: run "ingest" to generate summary + concept/entity pages from these sources.


2026-05-30 β€” ingested all 13 official-doc sources

Triggered by: user request to ingest all 13 seeded raw sources. Done in two phases (parallel summarizers β†’ parallel page authors) with a shared canonical slug list for consistent cross-links.

Added wiki pages (37 total):

  • Summaries (13) β€” one per raw source: llamacpp-readme, docs-build, docs-install, server-readme, cli-readme, quantize-readme, imatrix-readme, llama-bench-readme, grammars-readme, docs-function-calling, docs-multimodal, mtmd-readme, gguf-spec.
  • Concepts (12) β€” gguf-format, quantization, imatrix, sampling-parameters, kv-cache-and-context, speculative-decoding, gbnf-grammars, function-calling, embeddings, multimodal-mtmd, server-api, build-and-backends.
  • Entities (12) β€” project-llama-cpp; binaries binary-llama-cli, binary-llama-server, binary-llama-quantize, binary-imatrix, binary-llama-bench, binary-mtmd; backends backend-cpu, backend-cuda, backend-metal, backend-vulkan, backend-rocm.

Updated: index.md β€” fully populated master catalog (page count 0 β†’ 37); this log.

Caveats recorded on pages (per CLAUDE.md version-awareness):

  • All pages set llama_build: "master (~2026-05)" β€” sources were master, not a pinned b####.
  • Doc-vs-code lags flagged inline: GGUF spec general.file_type enum stale; speculative-decoding flags reworked (legacy --draft* removed); function-calling doc self-TODO; multimodal roster "under heavy development."
  • Real doc inconsistency captured: server /completion repeat_penalty default 1.1 vs CLI --repeat-penalty 1.00.

Not yet created (deferred): backend pages for SYCL/MUSA/CANN/OpenCL/RPC/etc. (present in the build-and-backends matrix); all syntheses (need a community/benchmark source pass); llama-perplexity / llama-simple / conversion-script entities.

Next: add community/benchmark sources, then build the planned syntheses (quant-types-compared, backend-selection-guide, llamacpp-vs-ollama).


2026-05-30 β€” community sweep: ingested 15 community sources + 5 syntheses

Triggered by: user request to sweep popular community sources. Ran a 4-agent web discovery sweep (28 verified-live sources found), user chose breadth "primary + best secondary" (15) and synthesis focus = quant selection, vs Ollama/vLLM, deployment/OpenAI API, and a custom "customizable features / tuning" page. Fetched in 5 parallel agents (mirror + summary each), then 3 parallel agents authored the syntheses.

Added raw mirrors (15) under raw/community/ β€” markdown-converted mirrors with provenance headers (NOT pristine):

  • quant: community-pr1684-kquants, community-artefact2-quant-table, community-arxiv-quant-eval, community-kaitchup-gguf-guide, community-bartowski-quant-guide, community-mradermacher-imatrix, community-unsloth-dynamic-ggufs
  • benchmarks: community-bench-apple-silicon (GH#4167), community-bench-nvidia-cuda (GH#15013), community-smcleod-kv-quant, community-dgxspark-kv-quant
  • comparisons/guides: community-redhat-vllm-vs-llamacpp, community-gh15180-vllm-vs-llamacpp, community-steelphoenix-guide, community-hf-gguf-usage

Added wiki pages (20):

  • Summaries (15) β€” one per community source above (all confidence: medium, dated).
  • Syntheses (5) β€” quant-types-compared, llamacpp-vs-ollama, llamacpp-vs-vllm, server-deployment, customization-and-tuning.

Updated: index.md (page count 37 β†’ 57; two source tiers; community caveats section); this log.

Honesty/provenance notes recorded on pages:

  • Reddit unfetchable by tooling β€” no r/LocalLLaMA threads fabricated.
  • KV-quant conflict (smcleod vs DGX Spark) captured and reconciled β†’ q8_0 safe default.
  • Vendor bias flagged: Unsloth (Dynamic 2.0), Red Hat (pro-vLLM).
  • No dedicated Ollama source mirrored β€” llamacpp-vs-ollama rests on the wrapper relationship + community consensus (noted on the page).
  • Partial fetches: SteelPh0enix (long-form prose refused β†’ structured technical extraction); Kaitchup (partial paywall β€” taxonomy only, no number tables); arXiv tables from HTML extraction (verify vs PDF).
  • Absolute perplexity scales differ across quant sources (LLaMA-1-7B vs Llama-3.1-8B) β€” only relative ordering transfers.

Deferred: dedicated benchmark/backend-selection-guide synthesis (data is ingested; user deprioritized); a mirrored Ollama-vs-llama.cpp benchmark source; Reddit threads (user can paste).


2026-05-30 β€” drafted standalone video outline

Triggered by: user wants a standalone "what is llama.cpp" video β€” what it is, what makes it different from other local servers, customization/flags, ways to serve, and a head-to-head test vs Ollama (memory/speed). (NOT a quantization module.)

Added (1):

  • presentations/standalone-llamacpp-explainer-outline.md β€” 14–20 min, 5 segments matching the brief + cold open/outro. Grounded in project-llama-cpp, llamacpp-vs-ollama, llamacpp-vs-vllm, customization-and-tuning, server-deployment, server-api. Includes on-screen demos with verbatim commands, a fair head-to-head methodology (controls: same GGUF/quant, ctx, -ngl, flash-attn, KV type; metrics table; what to measure with llama-bench vs ollama --verbose), honest expectation-setting (Ollama wraps llama.cpp β†’ expect ~few-% speed gap, story is footprint/control), pull-quotes, and a pre-record flag-verification checklist.

Updated: index.md (Presentations 0 β†’ 1; page count 57 β†’ 58); this log.

Honesty notes baked into the outline: don't oversell a speed gap (same engine); no mirrored Ollama benchmark source exists so the test is "your hardware, this build"; flags are master (~2026-05) β†’ pre-record checklist added.

Next: user records/measures the live test; could then write back real numbers as a community benchmark source + a llamacpp-vs-ollama data update.


2026-05-30 β€” added ggml page + name clarifier + Ollama nuance

Triggered by: user fact-checked the "wrappers built on llama.cpp" claim (confused Meta's Llama models with llama.cpp/Ollama) and asked what ggml is. Verified the wrapping claim via web search (Ollama/LM Studio/Jan/KoboldCpp/GPT4All/llamafile all build on llama.cpp/ggml; none are Meta) β€” claim holds, no correction needed.

Added (1 entity):

  • entities/ggml.md β€” the C tensor library by Georgi Gerganov that llama.cpp is built on (compute graph, backends, quantization; GGUF is its format; also powers whisper.cpp; ggml.ai joined Hugging Face in 2026, confidence medium). Was a long-standing dangling reference across the KB β€” now a real page.

Updated:

  • entities/project-llama-cpp.md β€” added a "Name clarifier" callout (Meta = the models; llama.cpp = independent engine; Ollama = separate company wrapper) + ggml backlink in Related Entities.
  • syntheses/llamacpp-vs-ollama.md β€” added the 2026 nuance: Ollama now has its own model-loading engine for some architectures but still built on ggml; "only a thin shell over llama.cpp" is now slightly overstated.
  • presentations/standalone-llamacpp-explainer-outline.md β€” added an on-screen beat in Segment 1 clearing up the three-things-named-llama confusion + ggml name-drop.
  • index.md β€” entities 12 β†’ 13, total 58 β†’ 59.

Sources for verification: llama.cpp Wikipedia; ggml-org/llama.cpp README; SitePoint & Starmorph local-LLM tool guides.


2026-05-30 β€” built slide deck for the explainer video

Triggered by: user wants a slideshow for the informational beats of the standalone outline (install/terminal demos done live), themed to the llama.cpp README head image.

Palette: sampled the actual README head image (downloaded, viewed) β€” charcoal card #1e2228 / page #16191d, white wordmark, orange flame/C++ accent #f0883e (lighter #f7a85a, deep #e2702a), gray text #9aa4ae, mono for code. Recreated the "LLaMA C++" wordmark in CSS (no external image β†’ fully portable single file).

Added (1 asset):

  • presentations/standalone-llamacpp-explainer-slides.html β€” self-contained, dependency-free, keyboard-navigable 15-slide deck (title, hook, what-it-is, name clarifier, the stack, what-is-ggml, why-use-directly, vs vLLM, the six knob families, ways-to-serve, test setup, scoreboard, what-you'll-find, decision, outro). Progress bar, slide counter, click/arrow/F-fullscreen nav, print-to-PDF friendly. Content mirrors the outline's informational segments; install/curl/docker commands intentionally omitted (live terminal).

Updated: index.md (presentations entry); this log. (HTML asset, not counted in the .md page total of 59.)


2026-05-30 β€” expanded the customization segment in the slide deck

Triggered by: user wants the customization segment (the video's highlight) given real depth β€” break the single six-card grid into a slide per family.

Changed presentations/standalone-llamacpp-explainer-slides.html: the one customization slide became 8 β€” a quick "map" grid, then a dedicated deep-dive per family (Sampling / Structured output / Tool calling / Context & memory / Speed / Hardware), then a "change one knob, measure" closer. Each deep-dive uses a new two-column layout: left = the key flags (orange mono) with what each does + defaults; right = a "Why it matters" card + the real trade-off/pitfall. Flags/defaults grounded in the concept pages. Deck grew 15 β†’ 22 slides. Added CSS (.sub, .split, .flaglist, .fl, .explain). Verified render of the Sampling and Tool-calling slides (no overflow).

Updated: index.md (slide count + deep-dive note); the outline's Segment 3 (note pointing to the per-family slides); this log.


2026-05-30 β€” wrote two separate recording scripts (slides + demos)

Triggered by: user wants distinct scripts for the slide voiceover and the non-slide live terminal demos (recorded in separate passes).

Added (2):

  • presentations/standalone-llamacpp-explainer-script-slides.md β€” VO narration for all 22 deck slides, with on-screen cues, rough timings, and β†’ DEMO n cut markers.
  • presentations/standalone-llamacpp-explainer-script-demos.md β€” runbook for the 4 live terminal segments: (1) install + first run, (2) customization in action β€” structured output + KV-cache memory, (3) serve as OpenAI API β€” server/Web UI/curl/Python/Docker, (4) the fair Ollama head-to-head (same GGUF via Modelfile, llama-bench vs ollama run --verbose/ollama ps, scoreboard). Each demo block has exact commands, a SAY:/POINT: talk track, ⚠︎ gotchas/fallbacks, a prep/shopping list, and transitions.

Both share one running order: S1–S3 β†’ DEMO1 β†’ S4–S9 β†’ S10–S16 β†’ DEMO2 β†’ S17 β†’ DEMO3 β†’ S18 β†’ DEMO4 β†’ S19–S22. Commands grounded in the server/cli/bench docs + HF GGUF usage; flagged master (~2026-05).

Updated: index.md (Presentations 1 β†’ 4 .md pages; both scripts listed); this log.


2026-05-31 β€” pre-record accuracy audit of the deck + both scripts

Triggered by: user asked for a final accuracy sanity check before recording. Ran 3 parallel verifiers: (A) llama.cpp flags/commands vs the cli/server/bench/grammars mirrors; (B) Ollama commands vs live Ollama docs; (C) conceptual claims vs KB pages + web.

Result: no conceptual errors (license, ggml/whisper.cpp, wrapper relationship, vLLM framing, quant/KV/spec-decoding, Raspberry Pi β€” all confirmed). One factual bug + several stale-default / over-stated-as-fact tightenings.

Fixes applied (deck HTML + both scripts):

  • βŒβ†’βœ… /bye removed from DEMO 1 β€” it's Ollama's exit, not llama-cli's (confirmed: zero /bye in official cli README). Now Ctrl-C / Ctrl-D, with a note that /bye is Ollama-only.
  • -fa 1 β†’ -fa on (flag now takes on|off|auto, default auto); softened the "KV quant requires -fa" gotcha to "generally requires (esp. V-cache), verify on your build."
  • --jinja reframed from "Required. Turns on…" to "Enables tool calling + chat templates (on by default now)" β€” default changed to enabled (deck slide 12 + script S12).
  • Speed parity ~2–8% β†’ "usually within a few percent… we'll test it ourselves" (slide 7 note + script S7) β€” it's community consensus, not a benchmarked source, and DEMO 4 tests it.
  • "you're running llama.cpp underneath" β†’ "llama.cpp / ggml underneath" (slide 21 + S21) β€” Ollama now has its own ggml-based engine for some models.
  • GPT4All card "llama.cpp backend" β†’ "GGUF via llama.cpp" (weakest of the six wrappers); sampling VO "penalties/DRY stop repetition" β†’ "…when you turn them on" (both off by default).
  • Added DEMO 4 note: KV is f16 on both sides by default (so the bench matches); set OLLAMA_KV_CACHE_TYPE=q8_0 only if quantizing KV.

Verified correct, left as-is: all sampling defaults (temp 0.8 / top-k 40 / top-p 0.95 / min-p 0.05), -c 0=from model, -ctk/-ctv q8_0, --rope-scaling/--yarn-*, -np/-cb, --spec-type strategy names, -sm/-ts/-cmoe, server bind 127.0.0.1:8080, /v1/chat/completions, --api-key/LLAMA_API_KEY, Web UI at /, Docker :server/:server-cuda, llama-bench -p(pp)/-n(tg) excl. tokenization. Ollama side fully confirmed against docs.ollama.com (Modelfile FROM ./*.gguf, num_ctx, ollama create -f, OLLAMA_FLASH_ATTENTION=1, run --verbose rate fields, ps PROCESSOR/GPU%, show quant). Tool-calling specifics (native template list, parallel_tool_calls default-off, q4_0-KV-degrades-tools) trace to raw/docs-function-calling.md β€” sourced & correct.

For live verification on the recording machine (docs don't pin these): exact -fa value form, whether -ngl even needs setting (default now auto), and current --spec-* flag spellings β€” all already in the outline's pre-record checklist.


2026-06-10 β€” removed Obsidian scaffolding from the served wiki

Deleted analytics.md, dashboard.md, flashcards.md (Obsidian plugin pages β€” Dataview/Charts View/Spaced Repetition markup, unusable when served as plain Markdown to agents) and the journal/ scaffold (template only). The 4 video-production files in presentations/ moved to repo root (not served); index count 59 -> 58. CLAUDE.md directory layout updated: production/planning material lives at repo root, never under wiki/ (everything under wiki/ is served publicly).