wikis / Hermes / wiki / concepts / ml-research-pipeline.md view as markdown
ML Research Pipeline
Definition
Hermes ships a complete ML and reinforcement-learning research toolchain inside the agent โ not as an optional add-on, but as a first-class set of tools, skills, and CLI scripts. This is the same harness Nous Research uses to train the Hermes model family, exposed so that any user can capture trajectories, generate datasets at scale, fine-tune small models, run RL training loops, and write conference-grade papers โ all from inside an agent conversation. As one community reviewer put it, Hermes is "built for tinkerers" with "machine learning tools built-in" and "reinforcement learning tools built-in" because Nous "are the people behind Hermes model family. So, they know how models learn because they train them for legitimately a living and that shows in the architecture."
How It Works
The pipeline has four layered components that feed into each other:
1. Trajectory Capture (agent/trajectory.py) โ Every agent run records a full trajectory: every tool call, every decision, every result, in order. Most agent frameworks throw this away when the task ends ("most agent frameworks they just throw that away completely. Like the task is done, the memory's gone. Hermes keeps it instead"). Hermes serializes the trajectory to ShareGPT JSONL โ the de-facto fine-tuning format with from/value pairs that Axolotl, Unsloth, TRL, and HuggingFace datasets all consume natively. This single decision is what makes the self improvement loop possible and what lets a Hermes user export "training data for your own models" without any extra plumbing.
2. Batch Runner (batch_runner.py, 1287 lines) โ A standalone script (separate from the in-agent delegate_task tool โ see subagents delegation) for offline, parallel dataset processing. Spawns a multiprocessing Pool, each worker an independent AIAgent, and writes results checkpoint-resumably:
python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=my_run [--resume] [--distribution=image_gen]
Trajectories are saved as ShareGPT-format from/value pairs alongside per-tool success/failure stats normalized via _normalize_tool_stats(). Output is HuggingFace-compatible Arrow/Parquet so the resulting dataset can be dataset.push_to_hub()'d directly, or loaded into Axolotl/Unsloth without conversion. Toolset distributions in /toolset_distributions.py let you preset which tool subset each batch worker is allowed to use (e.g. --distribution=image_gen restricts to image-related tools).
3. Trajectory Compressor (trajectory_compressor.py) โ Long agent runs blow past fine-tuning context windows (16Kโ128K tokens). The compressor takes a recorded trajectory and folds intermediate tool outputs into summarized states so the conversation still fits, while preserving the gradient signal of the actual decision sequence. This is what makes hour-long agent traces usable as SFT data on a 32K-context base model.
4. RL Pipeline โ Tinker + Atropos โ Ten dedicated RL tools registered in tools/registry.py, exposed to the agent itself so it can drive its own training:
| Tool | Purpose |
|---|---|
rl_list_environments |
List available Atropos RL environments |
rl_select_environment |
Pick an environment for a run |
rl_get_current_config / rl_edit_config |
Read/edit the run config |
rl_start_training / rl_stop_training |
Kick off / abort a run |
rl_check_status / rl_list_runs / rl_get_results |
Monitor + retrieve |
rl_test_inference |
Smoke-test the trained policy |
Behind the tools is rl_cli.py, the standalone command-line entry point. The default recipe โ hardcoded in tools/rl_training_tool.py โ is Qwen/Qwen3-8B + LoRA rank 32 + SGLang inference on :8001 + Atropos trajectory API on :8000 + Weights & Biases logging, 2500 training steps, learning rate 4e-5, max_token_length 8192, checkpoint every 25 steps. Tinker is actually Thinking Machines Lab's managed LoRA training API (not Nous's) โ Atropos is Nous's environment library, tinker-atropos is the glue. TINKER_API_KEY, WANDB_API_KEY, and RL_API_URL are the three env vars required (see version v0.8.0 env reference).
Recipe A โ Tinker + Atropos (default)
Prerequisites: Sign up at https://auth.thinkingmachines.ai/sign-up, generate a key at https://tinker-console.thinkingmachines.ai/keys, put TINKER_API_KEY=tnkr_... and WANDB_API_KEY=... in ~/.hermes/.env. Clone github.com/NousResearch/tinker-atropos and pip install -e . in it.
Step 1 โ Generate trajectories. Either use session history already written to trajectory_samples.jsonl, or synthesize a batch:
python batch_runner.py \
--dataset_file=prompts.jsonl \
--batch_size=10 --run_name=my_run \
--distribution=default --num_workers=4
where prompts.jsonl is one {"prompt": "..."} per line. Output lands in data/my_run/trajectories.jsonl as ShareGPT from/value pairs (system/human/gpt/tool roles).
Step 2 โ Compress long runs to fit the 8192 token trainer window:
python trajectory_compressor.py --input=data/my_run/trajectories.jsonl --target_max_tokens=8000
Step 3 โ Edit config. Start from tinker-atropos/configs/default.yaml, change tokenizer_name and openai[0].model_name to Qwen/Qwen3-8B, confirm tinker.lora_rank: 32, tinker.learning_rate: 0.00004, env.max_token_length: 8192, env.total_steps: 2500, tinker.save_checkpoint_interval: 25.
Step 4 โ Launch (three terminals, or use rl_cli.py to auto-orchestrate):
# Terminal 1
run-api # Atropos API on :8000
# Terminal 2
python launch_training.py --config configs/qwen3_8b.yaml # trainer + SGLang on :8001
# Terminal 3
python tinker_atropos/environments/gsm8k_tinker.py serve --config configs/qwen3_8b.yaml
Or simply: python rl_cli.py "Train Qwen3-8B on GSM8K with LoRA rank 32" (spawns all three, 5s โ 120s delays).
Step 5 โ Monitor. rl_check_status tool (inside Hermes) returns current WandB metrics. Dashboard at wandb.ai/<user>/hermes-rl. rl_test_inference hits the running policy.
Step 6 โ Retrieve weights. Training emits a tinker://<hash> path:
python tinker_atropos/utils/download_weights.py # saves to ./checkpoints/<hash>/
Load with PeftModel.from_pretrained(base_model, "./checkpoints/<hash>/").
Cost/time (inferred, not official): a full 2500-step Qwen3-8B LoRA run takes ~4โ12h wall clock and ~tens of dollars at typical Tinker rates. Smoke-test first with quick_test.yaml (10 steps, Llama-3.2-1B, minutes).
Recipe B โ No Tinker (Axolotl / Unsloth fallback)
Everything up to compression works without a Tinker account โ you still end up with a ShareGPT JSONL. Then fine-tune offline:
Axolotl (Qwen3-8B, QLoRA rank 32, single 24GB GPU):
# qwen3_8b_lora.yml
base_model: Qwen/Qwen3-8B
load_in_4bit: true
datasets:
- path: data/my_run/trajectories_compressed.jsonl
type: sharegpt
conversation: chatml
adapter: lora
lora_r: 32
lora_alpha: 64
lora_target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, down_proj, up_proj]
sequence_len: 8192
micro_batch_size: 2
gradient_accumulation_steps: 8
num_epochs: 3
learning_rate: 2e-4
output_dir: ./qwen3-8b-hermes-lora
flash_attention: true
accelerate launch -m axolotl.cli.train qwen3_8b_lora.yml
python -m axolotl.cli.merge_lora qwen3_8b_lora.yml
Unsloth (2โ5ร faster on the same GPU) and TRL (most control) are the other supported routes โ see the skills/mlops/training/unsloth/ and trl-fine-tuning/ skills.
What you give up: no online RL reward loop. Axolotl/Unsloth is SFT-only on your captured trajectories. To do RL without Tinker, run GRPO locally via skills/mlops/training/grpo-rl-training/ on your own GPU cluster. For "train Hermes on my own history," SFT is the right default.
Full step-by-step commands, verbatim configs, a concrete ShareGPT trajectory example, and confidence assessment live in raw/ml-research-recipe.md (raw source, outside the wiki).
The whole pipeline ties back to the underlying model โ the transcript-247 video notes Hermes is "powered by the Hermes 3 model built on Llama 3.1 and fine-tuned using a reinforcement learning framework called Atropos. The training specifically targets tool calling accuracy and long-range planning."
5. MLOps Skills Library โ Bundled skills under skills/mlops/ (see skills system) cover the full MLOps lifecycle:
training/axolotl,training/unsloth,training/peft,training/pytorch-fsdp,training/trl-fine-tuning,training/grpo-rl-traininginference/vllm,inference/llama-cpp,inference/gguf,inference/sglang(via outlines/guidance/obliteratus)cloud/modal,evaluation/,huggingface-hub/,vector-databases/
Plus research/research-paper-writing โ a skill for drafting NeurIPS/ICML/ICLR-style papers with proper LaTeX, related-work tables, and reproducibility checklists. research/arxiv handles paper search and metadata.
Key Parameters
Environment variables (set in ~/.hermes/.env):
TINKER_API_KEYโ Thinking Machines Tinker managed LoRA training API (key fromtinker-console.thinkingmachines.ai/keys)WANDB_API_KEYโ Weights & Biases experiment trackingRL_API_URLโ Atropos trajectory API endpoint (defaulthttp://localhost:8000)HF_TOKENโ HuggingFace push/pull access (for base model / tokenizer download)
Batch runner flags:
--dataset_fileโ JSONL input--batch_sizeโ parallel worker count--run_nameโ checkpoint key--resumeโ pick up from last checkpoint--distributionโ toolset subset (seetoolset_distributions.py)
Skill catalog โ discoverable via hermes skills browse, installable via hermes skills install official/mlops/<skill>. The agent autonomously calls skill_view("axolotl") etc. when it detects an MLOps task.
When To Use
- You are training your own models. This is the entire reason Nous built it this way โ you can dogfood the harness that trained Hermes 3.
- You need to generate SFT/RL datasets at scale. Point batch_runner at a JSONL of prompts and walk away; come back to a Parquet shard ready for
datasets.load_dataset. - You want offline benchmarking of agent behavior. Run a fixed eval set across providers/models and diff the trajectories.
- You are writing an ML paper. The
research-paper-writingskill plusarxivplus persistent skills system memory of prior literature reviews compounds across drafts. - You want to fine-tune on your own usage. Trajectories captured automatically by Hermes during normal use can be exported and used to train a personalized small model โ the loop the self improvement loop hints at but stops short of automating.
Risks & Pitfalls
- GPU dependency. RL training tools require either a Tinker-backed remote GPU pool or a local CUDA box. The 10 RL tools will register but
rl_start_trainingfails fast on machines without configured GPU access. - Trajectory bloat. Long-running agents can produce trajectories in the megabytes โ always run
trajectory_compressor.pybefore fine-tuning, otherwise context overflow trashes the run. - ShareGPT format drift. Some downstream trainers expect slightly different role names (
human/gptvsuser/assistant). Confirm your trainer's loader matches the exporter's schema before kicking off a long fine-tune. - Batch runner is not the same as
delegate_task. The runner is offline/scripted;delegate_taskis interactive subagent spawning (subagents delegation). Mixing them up wastes hours. - MLOps skills are LLM-generated by default. Like all skills they may have stale package versions โ pin the version in your skill before shipping a reproducible paper.
- WandB is on by default in the recipe โ if you're training a sensitive model, swap to
wandb offlineor strip the integration before launching. - First-day output is usually bad. The course transcript warns "be prepared for the first 7 days, it won't be what you want. It won't do anything perfectly. But, the point is it gets better every time."
Related Concepts
- self improvement loop โ trajectory capture is the mechanism behind GAPA and skill auto-creation
- skills system โ the MLOps skills library and
research-paper-writinglive here - subagents delegation โ
delegate_task(in-agent) vsbatch_runner.py(offline) - local models airplane mode โ vLLM, SGLang, llama.cpp inference skills double as local serving for trained models
- nous research โ the lab and its model family
- hermes self evolution
- version v0.8.0 โ current verified version
Sources
raw/04-tools-mcp-cron-subagents.mdโ tool registry, batch_runner internals, RL tools list, toolset distributionsraw/03-skills-system.mdโ full MLOps skill catalog underskills/mlops/raw/transcript-hermes-full-course-2hr.txtโ practical use, Hermes 3 / Atropos lineage, "built for tinkerers" framingraw/transcript-switching-to-hermes.txtโ trajectory capture explanation, ShareGPT export, Nous RL pipeline as differentiatorraw/02-install-and-setup.mdโTINKER_API_KEY,WANDB_API_KEY,RL_API_URLenv varsraw/ml-research-recipe.mdโ end-to-end fine-tune recipe with verbatim configs, both Tinker and non-Tinker paths, confidence assessment
