wikis / llama.cpp / wiki / log.md view as markdown
Activity Log
Append-only log of KB changes.
2026-05-30 β KB scaffolded
Triggered by: user creating a llama.cpp knowledge base as the research backbone for a YouTube video on llama.cpp. Modeled on the llm-wiki template (Karpathy's "LLM Wiki" pattern), customized for the llama.cpp domain.
Added:
CLAUDE.mdβ schema tailored to llama.cpp (directory layout, page format, llama.cpp tagging taxonomy, build-tag version awareness, ingest/query/lint workflows).README.mdwiki/index.mdβ empty master catalog with planned starter pages listed.wiki/log.mdβ this file.wiki/journal/template.mdβ research-session note template.- Empty subdirs:
raw/,wiki/summaries/,wiki/concepts/,wiki/entities/,wiki/syntheses/,wiki/presentations/.
Next: drop llama.cpp source material into raw/ (official docs mirrors, video transcripts, discussion/PR dumps) and run "ingest".
2026-05-30 β seeded raw/ with official docs (13 mirrors)
Triggered by: user request to fetch the official llama.cpp docs and seed raw/. Verbatim mirrors pulled from ggml-org/llama.cpp@master (and the GGUF spec from ggml-org/ggml@master) on 2026-05-30.
Added raw sources (13):
llamacpp-readme.mdβ main repo README (overview, features, supported models/backends)docs-build.mdβ full build guide + backend matrix (CPU/CUDA/Metal/Vulkan/ROCm/SYCL/MUSA/CANN)docs-install.mdβ package-manager install methodsserver-readme.mdβllama-serverdocs + OpenAI-compatible HTTP API (largest source, ~92KB)cli-readme.mdβllama-cli(tools/cli) docs and flagsquantize-readme.mdβllama-quantize+ quant type tableimatrix-readme.mdβ importance-matrix generation for quantizationllama-bench-readme.mdβllama-benchbenchmarking toolgrammars-readme.mdβ GBNF grammar syntax + structured outputdocs-function-calling.mdβ tool/function calling supportdocs-multimodal.mdβ multimodal overviewmtmd-readme.mdβmtmdmultimodal CLI/libgguf-spec.mdβ GGUF file-format specification (from ggml repo)
Note: these are immutable mirrors β do not edit. Build tag at fetch time not pinned (master). When ingesting, record an approximate llama_build from the current release tag.
Next: run "ingest" to generate summary + concept/entity pages from these sources.
2026-05-30 β ingested all 13 official-doc sources
Triggered by: user request to ingest all 13 seeded raw sources. Done in two phases (parallel summarizers β parallel page authors) with a shared canonical slug list for consistent cross-links.
Added wiki pages (37 total):
- Summaries (13) β one per raw source:
llamacpp-readme,docs-build,docs-install,server-readme,cli-readme,quantize-readme,imatrix-readme,llama-bench-readme,grammars-readme,docs-function-calling,docs-multimodal,mtmd-readme,gguf-spec. - Concepts (12) β
gguf-format,quantization,imatrix,sampling-parameters,kv-cache-and-context,speculative-decoding,gbnf-grammars,function-calling,embeddings,multimodal-mtmd,server-api,build-and-backends. - Entities (12) β
project-llama-cpp; binariesbinary-llama-cli,binary-llama-server,binary-llama-quantize,binary-imatrix,binary-llama-bench,binary-mtmd; backendsbackend-cpu,backend-cuda,backend-metal,backend-vulkan,backend-rocm.
Updated: index.md β fully populated master catalog (page count 0 β 37); this log.
Caveats recorded on pages (per CLAUDE.md version-awareness):
- All pages set
llama_build: "master (~2026-05)"β sources were master, not a pinnedb####. - Doc-vs-code lags flagged inline: GGUF spec
general.file_typeenum stale; speculative-decoding flags reworked (legacy--draft*removed); function-calling doc self-TODO; multimodal roster "under heavy development." - Real doc inconsistency captured: server
/completionrepeat_penaltydefault 1.1 vs CLI--repeat-penalty1.00.
Not yet created (deferred): backend pages for SYCL/MUSA/CANN/OpenCL/RPC/etc. (present in the build-and-backends matrix); all syntheses (need a community/benchmark source pass); llama-perplexity / llama-simple / conversion-script entities.
Next: add community/benchmark sources, then build the planned syntheses (quant-types-compared, backend-selection-guide, llamacpp-vs-ollama).
2026-05-30 β community sweep: ingested 15 community sources + 5 syntheses
Triggered by: user request to sweep popular community sources. Ran a 4-agent web discovery sweep (28 verified-live sources found), user chose breadth "primary + best secondary" (15) and synthesis focus = quant selection, vs Ollama/vLLM, deployment/OpenAI API, and a custom "customizable features / tuning" page. Fetched in 5 parallel agents (mirror + summary each), then 3 parallel agents authored the syntheses.
Added raw mirrors (15) under raw/community/ β markdown-converted mirrors with provenance headers (NOT pristine):
- quant:
community-pr1684-kquants,community-artefact2-quant-table,community-arxiv-quant-eval,community-kaitchup-gguf-guide,community-bartowski-quant-guide,community-mradermacher-imatrix,community-unsloth-dynamic-ggufs - benchmarks:
community-bench-apple-silicon(GH#4167),community-bench-nvidia-cuda(GH#15013),community-smcleod-kv-quant,community-dgxspark-kv-quant - comparisons/guides:
community-redhat-vllm-vs-llamacpp,community-gh15180-vllm-vs-llamacpp,community-steelphoenix-guide,community-hf-gguf-usage
Added wiki pages (20):
- Summaries (15) β one per community source above (all
confidence: medium, dated). - Syntheses (5) β
quant-types-compared,llamacpp-vs-ollama,llamacpp-vs-vllm,server-deployment,customization-and-tuning.
Updated: index.md (page count 37 β 57; two source tiers; community caveats section); this log.
Honesty/provenance notes recorded on pages:
- Reddit unfetchable by tooling β no r/LocalLLaMA threads fabricated.
- KV-quant conflict (smcleod vs DGX Spark) captured and reconciled β q8_0 safe default.
- Vendor bias flagged: Unsloth (Dynamic 2.0), Red Hat (pro-vLLM).
- No dedicated Ollama source mirrored β
llamacpp-vs-ollamarests on the wrapper relationship + community consensus (noted on the page). - Partial fetches: SteelPh0enix (long-form prose refused β structured technical extraction); Kaitchup (partial paywall β taxonomy only, no number tables); arXiv tables from HTML extraction (verify vs PDF).
- Absolute perplexity scales differ across quant sources (LLaMA-1-7B vs Llama-3.1-8B) β only relative ordering transfers.
Deferred: dedicated benchmark/backend-selection-guide synthesis (data is ingested; user deprioritized); a mirrored Ollama-vs-llama.cpp benchmark source; Reddit threads (user can paste).
2026-05-30 β drafted standalone video outline
Triggered by: user wants a standalone "what is llama.cpp" video β what it is, what makes it different from other local servers, customization/flags, ways to serve, and a head-to-head test vs Ollama (memory/speed). (NOT a quantization module.)
Added (1):
presentations/standalone-llamacpp-explainer-outline.mdβ 14β20 min, 5 segments matching the brief + cold open/outro. Grounded in project-llama-cpp, llamacpp-vs-ollama, llamacpp-vs-vllm, customization-and-tuning, server-deployment, server-api. Includes on-screen demos with verbatim commands, a fair head-to-head methodology (controls: same GGUF/quant, ctx, -ngl, flash-attn, KV type; metrics table; what to measure withllama-benchvsollama --verbose), honest expectation-setting (Ollama wraps llama.cpp β expect ~few-% speed gap, story is footprint/control), pull-quotes, and a pre-record flag-verification checklist.
Updated: index.md (Presentations 0 β 1; page count 57 β 58); this log.
Honesty notes baked into the outline: don't oversell a speed gap (same engine); no mirrored Ollama benchmark source exists so the test is "your hardware, this build"; flags are master (~2026-05) β pre-record checklist added.
Next: user records/measures the live test; could then write back real numbers as a community benchmark source + a llamacpp-vs-ollama data update.
2026-05-30 β added ggml page + name clarifier + Ollama nuance
Triggered by: user fact-checked the "wrappers built on llama.cpp" claim (confused Meta's Llama models with llama.cpp/Ollama) and asked what ggml is. Verified the wrapping claim via web search (Ollama/LM Studio/Jan/KoboldCpp/GPT4All/llamafile all build on llama.cpp/ggml; none are Meta) β claim holds, no correction needed.
Added (1 entity):
entities/ggml.mdβ the C tensor library by Georgi Gerganov that llama.cpp is built on (compute graph, backends, quantization; GGUF is its format; also powers whisper.cpp; ggml.ai joined Hugging Face in 2026, confidence medium). Was a long-standing dangling reference across the KB β now a real page.
Updated:
entities/project-llama-cpp.mdβ added a "Name clarifier" callout (Meta = the models; llama.cpp = independent engine; Ollama = separate company wrapper) + ggml backlink in Related Entities.syntheses/llamacpp-vs-ollama.mdβ added the 2026 nuance: Ollama now has its own model-loading engine for some architectures but still built on ggml; "only a thin shell over llama.cpp" is now slightly overstated.presentations/standalone-llamacpp-explainer-outline.mdβ added an on-screen beat in Segment 1 clearing up the three-things-named-llama confusion + ggml name-drop.index.mdβ entities 12 β 13, total 58 β 59.
Sources for verification: llama.cpp Wikipedia; ggml-org/llama.cpp README; SitePoint & Starmorph local-LLM tool guides.
2026-05-30 β built slide deck for the explainer video
Triggered by: user wants a slideshow for the informational beats of the standalone outline (install/terminal demos done live), themed to the llama.cpp README head image.
Palette: sampled the actual README head image (downloaded, viewed) β charcoal card #1e2228 / page #16191d, white wordmark, orange flame/C++ accent #f0883e (lighter #f7a85a, deep #e2702a), gray text #9aa4ae, mono for code. Recreated the "LLaMA C++" wordmark in CSS (no external image β fully portable single file).
Added (1 asset):
presentations/standalone-llamacpp-explainer-slides.htmlβ self-contained, dependency-free, keyboard-navigable 15-slide deck (title, hook, what-it-is, name clarifier, the stack, what-is-ggml, why-use-directly, vs vLLM, the six knob families, ways-to-serve, test setup, scoreboard, what-you'll-find, decision, outro). Progress bar, slide counter, click/arrow/F-fullscreen nav, print-to-PDF friendly. Content mirrors the outline's informational segments; install/curl/docker commands intentionally omitted (live terminal).
Updated: index.md (presentations entry); this log. (HTML asset, not counted in the .md page total of 59.)
2026-05-30 β expanded the customization segment in the slide deck
Triggered by: user wants the customization segment (the video's highlight) given real depth β break the single six-card grid into a slide per family.
Changed presentations/standalone-llamacpp-explainer-slides.html: the one customization slide became 8 β a quick "map" grid, then a dedicated deep-dive per family (Sampling / Structured output / Tool calling / Context & memory / Speed / Hardware), then a "change one knob, measure" closer. Each deep-dive uses a new two-column layout: left = the key flags (orange mono) with what each does + defaults; right = a "Why it matters" card + the real trade-off/pitfall. Flags/defaults grounded in the concept pages. Deck grew 15 β 22 slides. Added CSS (.sub, .split, .flaglist, .fl, .explain). Verified render of the Sampling and Tool-calling slides (no overflow).
Updated: index.md (slide count + deep-dive note); the outline's Segment 3 (note pointing to the per-family slides); this log.
2026-05-30 β wrote two separate recording scripts (slides + demos)
Triggered by: user wants distinct scripts for the slide voiceover and the non-slide live terminal demos (recorded in separate passes).
Added (2):
presentations/standalone-llamacpp-explainer-script-slides.mdβ VO narration for all 22 deck slides, with on-screen cues, rough timings, andβ DEMO ncut markers.presentations/standalone-llamacpp-explainer-script-demos.mdβ runbook for the 4 live terminal segments: (1) install + first run, (2) customization in action β structured output + KV-cache memory, (3) serve as OpenAI API β server/Web UI/curl/Python/Docker, (4) the fair Ollama head-to-head (same GGUF via Modelfile, llama-bench vsollama run --verbose/ollama ps, scoreboard). Each demo block has exact commands, aSAY:/POINT:talk track,β οΈgotchas/fallbacks, a prep/shopping list, and transitions.
Both share one running order: S1βS3 β DEMO1 β S4βS9 β S10βS16 β DEMO2 β S17 β DEMO3 β S18 β DEMO4 β S19βS22. Commands grounded in the server/cli/bench docs + HF GGUF usage; flagged master (~2026-05).
Updated: index.md (Presentations 1 β 4 .md pages; both scripts listed); this log.
2026-05-31 β pre-record accuracy audit of the deck + both scripts
Triggered by: user asked for a final accuracy sanity check before recording. Ran 3 parallel verifiers: (A) llama.cpp flags/commands vs the cli/server/bench/grammars mirrors; (B) Ollama commands vs live Ollama docs; (C) conceptual claims vs KB pages + web.
Result: no conceptual errors (license, ggml/whisper.cpp, wrapper relationship, vLLM framing, quant/KV/spec-decoding, Raspberry Pi β all confirmed). One factual bug + several stale-default / over-stated-as-fact tightenings.
Fixes applied (deck HTML + both scripts):
- βββ
/byeremoved from DEMO 1 β it's Ollama's exit, not llama-cli's (confirmed: zero/byein official cli README). NowCtrl-C/Ctrl-D, with a note that/byeis Ollama-only. -fa 1β-fa on(flag now takeson|off|auto, defaultauto); softened the "KV quant requires -fa" gotcha to "generally requires (esp. V-cache), verify on your build."--jinjareframed from "Required. Turns onβ¦" to "Enables tool calling + chat templates (on by default now)" β default changed to enabled (deck slide 12 + script S12).- Speed parity ~2β8% β "usually within a few percentβ¦ we'll test it ourselves" (slide 7 note + script S7) β it's community consensus, not a benchmarked source, and DEMO 4 tests it.
- "you're running llama.cpp underneath" β "llama.cpp / ggml underneath" (slide 21 + S21) β Ollama now has its own ggml-based engine for some models.
- GPT4All card "llama.cpp backend" β "GGUF via llama.cpp" (weakest of the six wrappers); sampling VO "penalties/DRY stop repetition" β "β¦when you turn them on" (both off by default).
- Added DEMO 4 note: KV is f16 on both sides by default (so the bench matches); set
OLLAMA_KV_CACHE_TYPE=q8_0only if quantizing KV.
Verified correct, left as-is: all sampling defaults (temp 0.8 / top-k 40 / top-p 0.95 / min-p 0.05), -c 0=from model, -ctk/-ctv q8_0, --rope-scaling/--yarn-*, -np/-cb, --spec-type strategy names, -sm/-ts/-cmoe, server bind 127.0.0.1:8080, /v1/chat/completions, --api-key/LLAMA_API_KEY, Web UI at /, Docker :server/:server-cuda, llama-bench -p(pp)/-n(tg) excl. tokenization. Ollama side fully confirmed against docs.ollama.com (Modelfile FROM ./*.gguf, num_ctx, ollama create -f, OLLAMA_FLASH_ATTENTION=1, run --verbose rate fields, ps PROCESSOR/GPU%, show quant). Tool-calling specifics (native template list, parallel_tool_calls default-off, q4_0-KV-degrades-tools) trace to raw/docs-function-calling.md β sourced & correct.
For live verification on the recording machine (docs don't pin these): exact -fa value form, whether -ngl even needs setting (default now auto), and current --spec-* flag spellings β all already in the outline's pre-record checklist.
2026-06-10 β removed Obsidian scaffolding from the served wiki
Deleted analytics.md, dashboard.md, flashcards.md (Obsidian plugin pages β Dataview/Charts View/Spaced Repetition markup, unusable when served as plain Markdown to agents) and the journal/ scaffold (template only). The 4 video-production files in presentations/ moved to repo root (not served); index count 59 -> 58. CLAUDE.md directory layout updated: production/planning material lives at repo root, never under wiki/ (everything under wiki/ is served publicly).
