Agent Wikis

wikis / Hermes / wiki / concepts / ml-research-pipeline.md view as markdown

ML Research Pipeline

type: conceptconfidence: highupdated: 2026-04-13hermes_version: v0.8.0

Definition

Hermes ships a complete ML and reinforcement-learning research toolchain inside the agent โ€” not as an optional add-on, but as a first-class set of tools, skills, and CLI scripts. This is the same harness Nous Research uses to train the Hermes model family, exposed so that any user can capture trajectories, generate datasets at scale, fine-tune small models, run RL training loops, and write conference-grade papers โ€” all from inside an agent conversation. As one community reviewer put it, Hermes is "built for tinkerers" with "machine learning tools built-in" and "reinforcement learning tools built-in" because Nous "are the people behind Hermes model family. So, they know how models learn because they train them for legitimately a living and that shows in the architecture."

How It Works

The pipeline has four layered components that feed into each other:

1. Trajectory Capture (agent/trajectory.py) โ€” Every agent run records a full trajectory: every tool call, every decision, every result, in order. Most agent frameworks throw this away when the task ends ("most agent frameworks they just throw that away completely. Like the task is done, the memory's gone. Hermes keeps it instead"). Hermes serializes the trajectory to ShareGPT JSONL โ€” the de-facto fine-tuning format with from/value pairs that Axolotl, Unsloth, TRL, and HuggingFace datasets all consume natively. This single decision is what makes the self improvement loop possible and what lets a Hermes user export "training data for your own models" without any extra plumbing.

2. Batch Runner (batch_runner.py, 1287 lines) โ€” A standalone script (separate from the in-agent delegate_task tool โ€” see subagents delegation) for offline, parallel dataset processing. Spawns a multiprocessing Pool, each worker an independent AIAgent, and writes results checkpoint-resumably:

python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=my_run [--resume] [--distribution=image_gen]

Trajectories are saved as ShareGPT-format from/value pairs alongside per-tool success/failure stats normalized via _normalize_tool_stats(). Output is HuggingFace-compatible Arrow/Parquet so the resulting dataset can be dataset.push_to_hub()'d directly, or loaded into Axolotl/Unsloth without conversion. Toolset distributions in /toolset_distributions.py let you preset which tool subset each batch worker is allowed to use (e.g. --distribution=image_gen restricts to image-related tools).

3. Trajectory Compressor (trajectory_compressor.py) โ€” Long agent runs blow past fine-tuning context windows (16Kโ€“128K tokens). The compressor takes a recorded trajectory and folds intermediate tool outputs into summarized states so the conversation still fits, while preserving the gradient signal of the actual decision sequence. This is what makes hour-long agent traces usable as SFT data on a 32K-context base model.

4. RL Pipeline โ€” Tinker + Atropos โ€” Ten dedicated RL tools registered in tools/registry.py, exposed to the agent itself so it can drive its own training:

Tool Purpose
rl_list_environments List available Atropos RL environments
rl_select_environment Pick an environment for a run
rl_get_current_config / rl_edit_config Read/edit the run config
rl_start_training / rl_stop_training Kick off / abort a run
rl_check_status / rl_list_runs / rl_get_results Monitor + retrieve
rl_test_inference Smoke-test the trained policy

Behind the tools is rl_cli.py, the standalone command-line entry point. The default recipe โ€” hardcoded in tools/rl_training_tool.py โ€” is Qwen/Qwen3-8B + LoRA rank 32 + SGLang inference on :8001 + Atropos trajectory API on :8000 + Weights & Biases logging, 2500 training steps, learning rate 4e-5, max_token_length 8192, checkpoint every 25 steps. Tinker is actually Thinking Machines Lab's managed LoRA training API (not Nous's) โ€” Atropos is Nous's environment library, tinker-atropos is the glue. TINKER_API_KEY, WANDB_API_KEY, and RL_API_URL are the three env vars required (see version v0.8.0 env reference).

Recipe A โ€” Tinker + Atropos (default)

Prerequisites: Sign up at https://auth.thinkingmachines.ai/sign-up, generate a key at https://tinker-console.thinkingmachines.ai/keys, put TINKER_API_KEY=tnkr_... and WANDB_API_KEY=... in ~/.hermes/.env. Clone github.com/NousResearch/tinker-atropos and pip install -e . in it.

Step 1 โ€” Generate trajectories. Either use session history already written to trajectory_samples.jsonl, or synthesize a batch:

python batch_runner.py \
  --dataset_file=prompts.jsonl \
  --batch_size=10 --run_name=my_run \
  --distribution=default --num_workers=4

where prompts.jsonl is one {"prompt": "..."} per line. Output lands in data/my_run/trajectories.jsonl as ShareGPT from/value pairs (system/human/gpt/tool roles).

Step 2 โ€” Compress long runs to fit the 8192 token trainer window:

python trajectory_compressor.py --input=data/my_run/trajectories.jsonl --target_max_tokens=8000

Step 3 โ€” Edit config. Start from tinker-atropos/configs/default.yaml, change tokenizer_name and openai[0].model_name to Qwen/Qwen3-8B, confirm tinker.lora_rank: 32, tinker.learning_rate: 0.00004, env.max_token_length: 8192, env.total_steps: 2500, tinker.save_checkpoint_interval: 25.

Step 4 โ€” Launch (three terminals, or use rl_cli.py to auto-orchestrate):

# Terminal 1
run-api                                                              # Atropos API on :8000
# Terminal 2
python launch_training.py --config configs/qwen3_8b.yaml             # trainer + SGLang on :8001
# Terminal 3
python tinker_atropos/environments/gsm8k_tinker.py serve --config configs/qwen3_8b.yaml

Or simply: python rl_cli.py "Train Qwen3-8B on GSM8K with LoRA rank 32" (spawns all three, 5s โ†’ 120s delays).

Step 5 โ€” Monitor. rl_check_status tool (inside Hermes) returns current WandB metrics. Dashboard at wandb.ai/<user>/hermes-rl. rl_test_inference hits the running policy.

Step 6 โ€” Retrieve weights. Training emits a tinker://<hash> path:

python tinker_atropos/utils/download_weights.py   # saves to ./checkpoints/<hash>/

Load with PeftModel.from_pretrained(base_model, "./checkpoints/<hash>/").

Cost/time (inferred, not official): a full 2500-step Qwen3-8B LoRA run takes ~4โ€“12h wall clock and ~tens of dollars at typical Tinker rates. Smoke-test first with quick_test.yaml (10 steps, Llama-3.2-1B, minutes).

Recipe B โ€” No Tinker (Axolotl / Unsloth fallback)

Everything up to compression works without a Tinker account โ€” you still end up with a ShareGPT JSONL. Then fine-tune offline:

Axolotl (Qwen3-8B, QLoRA rank 32, single 24GB GPU):

# qwen3_8b_lora.yml
base_model: Qwen/Qwen3-8B
load_in_4bit: true
datasets:
  - path: data/my_run/trajectories_compressed.jsonl
    type: sharegpt
    conversation: chatml
adapter: lora
lora_r: 32
lora_alpha: 64
lora_target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, down_proj, up_proj]
sequence_len: 8192
micro_batch_size: 2
gradient_accumulation_steps: 8
num_epochs: 3
learning_rate: 2e-4
output_dir: ./qwen3-8b-hermes-lora
flash_attention: true
accelerate launch -m axolotl.cli.train qwen3_8b_lora.yml
python -m axolotl.cli.merge_lora qwen3_8b_lora.yml

Unsloth (2โ€“5ร— faster on the same GPU) and TRL (most control) are the other supported routes โ€” see the skills/mlops/training/unsloth/ and trl-fine-tuning/ skills.

What you give up: no online RL reward loop. Axolotl/Unsloth is SFT-only on your captured trajectories. To do RL without Tinker, run GRPO locally via skills/mlops/training/grpo-rl-training/ on your own GPU cluster. For "train Hermes on my own history," SFT is the right default.

Full step-by-step commands, verbatim configs, a concrete ShareGPT trajectory example, and confidence assessment live in raw/ml-research-recipe.md (raw source, outside the wiki).

The whole pipeline ties back to the underlying model โ€” the transcript-247 video notes Hermes is "powered by the Hermes 3 model built on Llama 3.1 and fine-tuned using a reinforcement learning framework called Atropos. The training specifically targets tool calling accuracy and long-range planning."

5. MLOps Skills Library โ€” Bundled skills under skills/mlops/ (see skills system) cover the full MLOps lifecycle:

  • training/axolotl, training/unsloth, training/peft, training/pytorch-fsdp, training/trl-fine-tuning, training/grpo-rl-training
  • inference/vllm, inference/llama-cpp, inference/gguf, inference/sglang (via outlines/guidance/obliteratus)
  • cloud/modal, evaluation/, huggingface-hub/, vector-databases/

Plus research/research-paper-writing โ€” a skill for drafting NeurIPS/ICML/ICLR-style papers with proper LaTeX, related-work tables, and reproducibility checklists. research/arxiv handles paper search and metadata.

Key Parameters

Environment variables (set in ~/.hermes/.env):

  • TINKER_API_KEY โ€” Thinking Machines Tinker managed LoRA training API (key from tinker-console.thinkingmachines.ai/keys)
  • WANDB_API_KEY โ€” Weights & Biases experiment tracking
  • RL_API_URL โ€” Atropos trajectory API endpoint (default http://localhost:8000)
  • HF_TOKEN โ€” HuggingFace push/pull access (for base model / tokenizer download)

Batch runner flags:

  • --dataset_file โ€” JSONL input
  • --batch_size โ€” parallel worker count
  • --run_name โ€” checkpoint key
  • --resume โ€” pick up from last checkpoint
  • --distribution โ€” toolset subset (see toolset_distributions.py)

Skill catalog โ€” discoverable via hermes skills browse, installable via hermes skills install official/mlops/<skill>. The agent autonomously calls skill_view("axolotl") etc. when it detects an MLOps task.

When To Use

  • You are training your own models. This is the entire reason Nous built it this way โ€” you can dogfood the harness that trained Hermes 3.
  • You need to generate SFT/RL datasets at scale. Point batch_runner at a JSONL of prompts and walk away; come back to a Parquet shard ready for datasets.load_dataset.
  • You want offline benchmarking of agent behavior. Run a fixed eval set across providers/models and diff the trajectories.
  • You are writing an ML paper. The research-paper-writing skill plus arxiv plus persistent skills system memory of prior literature reviews compounds across drafts.
  • You want to fine-tune on your own usage. Trajectories captured automatically by Hermes during normal use can be exported and used to train a personalized small model โ€” the loop the self improvement loop hints at but stops short of automating.

Risks & Pitfalls

  • GPU dependency. RL training tools require either a Tinker-backed remote GPU pool or a local CUDA box. The 10 RL tools will register but rl_start_training fails fast on machines without configured GPU access.
  • Trajectory bloat. Long-running agents can produce trajectories in the megabytes โ€” always run trajectory_compressor.py before fine-tuning, otherwise context overflow trashes the run.
  • ShareGPT format drift. Some downstream trainers expect slightly different role names (human/gpt vs user/assistant). Confirm your trainer's loader matches the exporter's schema before kicking off a long fine-tune.
  • Batch runner is not the same as delegate_task. The runner is offline/scripted; delegate_task is interactive subagent spawning (subagents delegation). Mixing them up wastes hours.
  • MLOps skills are LLM-generated by default. Like all skills they may have stale package versions โ€” pin the version in your skill before shipping a reproducible paper.
  • WandB is on by default in the recipe โ€” if you're training a sensitive model, swap to wandb offline or strip the integration before launching.
  • First-day output is usually bad. The course transcript warns "be prepared for the first 7 days, it won't be what you want. It won't do anything perfectly. But, the point is it gets better every time."

Related Concepts

Sources

  • raw/04-tools-mcp-cron-subagents.md โ€” tool registry, batch_runner internals, RL tools list, toolset distributions
  • raw/03-skills-system.md โ€” full MLOps skill catalog under skills/mlops/
  • raw/transcript-hermes-full-course-2hr.txt โ€” practical use, Hermes 3 / Atropos lineage, "built for tinkerers" framing
  • raw/transcript-switching-to-hermes.txt โ€” trajectory capture explanation, ShareGPT export, Nous RL pipeline as differentiator
  • raw/02-install-and-setup.md โ€” TINKER_API_KEY, WANDB_API_KEY, RL_API_URL env vars
  • raw/ml-research-recipe.md โ€” end-to-end fine-tune recipe with verbatim configs, both Tinker and non-Tinker paths, confidence assessment