OrchRM is a reward model for multi-agent orchestration systems, trained on preference data across three domains: math reasoning (DeepScaler), multi-hop QA (HotpotQA), and web-based research (BrowseComp+). It is integrated into a GRPO training pipeline to improve a multi-agent system's ability to plan and execute complex tasks.
OrchRM/
├── reward_model/ # RM training, inference, tokenization
│ └── configs/ # training config (YAML)
├── grpo/ # GRPO training pipeline
│ ├── reward_function.py # custom reward function (RM server / LLM judge / exact match)
│ ├── make_dataset.py # build parquet datasets for verl
│ └── scripts/run.sh # training launch script
├── mas_orchestra/ # multi-agent system (agents, rewards, trainer, prompts)
├── utils/
│ └── prompts/ # prompt templates (system, user, developer, block-level)
│ ├── make_prompt.py # prompt builder entry point
│ ├── init_archive.py # agent block archive
│ ├── blocks_harmony/ # CoT, CoT-SC, Reflexion, Debate, WebSearch blocks
│ ├── system_prompts/
│ ├── user_prompts/
│ └── developer_prompts/
└── verl/ # modified verl fork (GRPO/PPO training backend)
The preference dataset used to train OrchRM is publicly available on Hugging Face:
tsangkingyeung/OrchRM-datasets
| Split | Domain | Rows |
|---|---|---|
deepscaler_train |
Math (DeepScaler) | 4,587 |
deepscaler_val |
Math (DeepScaler) | 531 |
hotpotqa_train |
Multi-hop QA (HotpotQA) | 16,932 |
hotpotqa_val |
Multi-hop QA (HotpotQA) | 1,828 |
browsecomp_train |
Web Research (BrowseComp+) | 3,706 |
browsecomp_val |
Web Research (BrowseComp+) | 387 |
Fields: input (question), chosen (preferred response), rejected (dispreferred response), source (correct-over-incorrect | specialized-over-base)
from datasets import load_dataset
ds = load_dataset("tsangkingyeung/OrchRM-datasets", split="deepscaler_train")# 1. Install verl
cd verl
pip install --no-deps -e .
pip install ray==2.49.2 --force-reinstall
pip install protobuf==4.25.8 --force-reinstall
cd ..
# 2. Install OrchRM
pip install -e .
# 3. Install remaining dependencies
pip install -r requirements.txtFor web-search agent support (BrowseComp+), also install:
pip install langchain-core langchain-together langchain-community \
duckduckgo-search tavily-python ddgs langchain_brightdata bs4 \
pyserini faiss-gpu
pip install git+https://github.com/texttron/tevatron.gitStep 1 — Tokenize the dataset offline (optional but recommended for speed):
python reward_model/tokenize.py \
--train-path /path/to/train.jsonl \
--val-path /path/to/val.jsonl \
--tokenizer-path Skywork/Skywork-Reward-Llama-3.1-8B \
--out-dir /path/to/tokenized_output \
--max-length 7168Step 2 — Train with LoRA:
# Edit reward_model/configs/train_lora.yaml to set paths, then:
python reward_model/train.py reward_model/configs/train_lora.yaml
# Or pass overrides inline:
python reward_model/train.py reward_model/configs/train_lora.yaml \
model.base_model=Skywork/Skywork-Reward-Llama-3.1-8B \
train.rolling_dir=./outputs/orchrmKey config fields in reward_model/configs/train_lora.yaml:
model.base_model: base reward model (e.g.Skywork/Skywork-Reward-Llama-3.1-8B)data.tokenized_dir: path to pre-tokenized dataset (or setdata.train_file/data.val_filefor raw JSONL)train.rolling_dir: output directory for checkpoints
Step 3 — Run inference:
python reward_model/inference.py \
--input /path/to/rollout/ \
--output /path/to/scores/ \
--model-path /path/to/trained/orchrm \
--qwen-model-path Qwen/Qwen2.5-7B-Instruct \
--dataset-name math \
--gpu-ids 0,1Step 1 — Build training parquets:
python grpo/make_dataset.py \
--output-dir ./datasets/grpo \
--qwen-tokenizer-path Qwen/Qwen2.5-7B-Instruct \
--problem-type harmony_mediumStep 2 — Run GRPO:
export MODEL_PATH=/path/to/base/model # actor model (e.g. Qwen2.5-7B-Instruct)
export RM_MODEL_PATH=/path/to/trained/orchrm # trained reward model
bash grpo/scripts/run.sh \
./datasets/grpo/train/train.parquet \
./datasets/grpo/train/val.parquetSee grpo/scripts/run.sh for all configurable env vars (LORA_RANK, GPU_IDS, TRAIN_BATCH_SIZE, TOTAL_EPOCHS, etc.).
grpo/reward_function.py supports three reward paths, tried in priority order:
- LLM judge — set
OPENAI_API_KEYand optionallyLLM_JUDGE_MODEL(default:gpt-5-mini) - RM server — set
reward_router_addressto a running vLLM classify endpoint backed by OrchRM - Exact match — fallback when neither above is available
- Multi-agent orchestration framework adapted from MAS-Orchestra (Salesforce AI Research)
- Reward model training based on Skywork-Reward
- GRPO training backend: verl
This project is licensed under CC BY-NC 4.0 — free for research and non-commercial use. For commercial use, please contact kingyeung.tsang@gmail.com.
Third-party components (verl, Skywork-Reward, MAS-Orchestra) retain their original Apache 2.0 licenses.
For questions or feedback, feel free to open a GitHub Issue or reach out directly:
- King Yeung Tsang — kingyeung.tsang@gmail.com
@article{tsang2026orchrm,
title = {Reward Modeling for Multi-Agent Orchestration},
author = {Tsang, King Yeung and Zhao, Zihao and Venkataramani, Vishal and Shi, Haizhou and Ke, Zixuan and Yavuz, Semih and Joty, Shafiq and Wang, Hao},
journal = {arXiv preprint arXiv:2606.13598},
year = {2026},
archivePrefix = {arXiv},
eprint = {2606.13598},
primaryClass = {cs.AI},
doi = {10.48550/arXiv.2606.13598},
url = {https://arxiv.org/abs/2606.13598}
}