OrchRM

OrchRM is a reward model for multi-agent orchestration systems, trained on preference data across three domains: math reasoning (DeepScaler), multi-hop QA (HotpotQA), and web-based research (BrowseComp+). It is integrated into a GRPO training pipeline to improve a multi-agent system's ability to plan and execute complex tasks.

Repository Structure

OrchRM/
├── reward_model/            # RM training, inference, tokenization
│   └── configs/             # training config (YAML)
├── grpo/                    # GRPO training pipeline
│   ├── reward_function.py   # custom reward function (RM server / LLM judge / exact match)
│   ├── make_dataset.py      # build parquet datasets for verl
│   └── scripts/run.sh       # training launch script
├── mas_orchestra/           # multi-agent system (agents, rewards, trainer, prompts)
├── utils/
│   └── prompts/             # prompt templates (system, user, developer, block-level)
│       ├── make_prompt.py   # prompt builder entry point
│       ├── init_archive.py  # agent block archive
│       ├── blocks_harmony/  # CoT, CoT-SC, Reflexion, Debate, WebSearch blocks
│       ├── system_prompts/
│       ├── user_prompts/
│       └── developer_prompts/
└── verl/                    # modified verl fork (GRPO/PPO training backend)

Dataset

The preference dataset used to train OrchRM is publicly available on Hugging Face:

tsangkingyeung/OrchRM-datasets

Split	Domain	Rows
`deepscaler_train`	Math (DeepScaler)	4,587
`deepscaler_val`	Math (DeepScaler)	531
`hotpotqa_train`	Multi-hop QA (HotpotQA)	16,932
`hotpotqa_val`	Multi-hop QA (HotpotQA)	1,828
`browsecomp_train`	Web Research (BrowseComp+)	3,706
`browsecomp_val`	Web Research (BrowseComp+)	387

Fields: input (question), chosen (preferred response), rejected (dispreferred response), source (correct-over-incorrect | specialized-over-base)

from datasets import load_dataset

ds = load_dataset("tsangkingyeung/OrchRM-datasets", split="deepscaler_train")

Installation

# 1. Install verl
cd verl
pip install --no-deps -e .
pip install ray==2.49.2 --force-reinstall
pip install protobuf==4.25.8 --force-reinstall
cd ..

# 2. Install OrchRM
pip install -e .

# 3. Install remaining dependencies
pip install -r requirements.txt

For web-search agent support (BrowseComp+), also install:

pip install langchain-core langchain-together langchain-community \
    duckduckgo-search tavily-python ddgs langchain_brightdata bs4 \
    pyserini faiss-gpu
pip install git+https://github.com/texttron/tevatron.git

Training the Reward Model

Step 1 — Tokenize the dataset offline (optional but recommended for speed):

python reward_model/tokenize.py \
    --train-path /path/to/train.jsonl \
    --val-path   /path/to/val.jsonl \
    --tokenizer-path Skywork/Skywork-Reward-Llama-3.1-8B \
    --out-dir    /path/to/tokenized_output \
    --max-length 7168

Step 2 — Train with LoRA:

# Edit reward_model/configs/train_lora.yaml to set paths, then:
python reward_model/train.py reward_model/configs/train_lora.yaml

# Or pass overrides inline:
python reward_model/train.py reward_model/configs/train_lora.yaml \
    model.base_model=Skywork/Skywork-Reward-Llama-3.1-8B \
    train.rolling_dir=./outputs/orchrm

Key config fields in reward_model/configs/train_lora.yaml:

model.base_model: base reward model (e.g. Skywork/Skywork-Reward-Llama-3.1-8B)
data.tokenized_dir: path to pre-tokenized dataset (or set data.train_file / data.val_file for raw JSONL)
train.rolling_dir: output directory for checkpoints

Step 3 — Run inference:

python reward_model/inference.py \
    --input  /path/to/rollout/ \
    --output /path/to/scores/ \
    --model-path      /path/to/trained/orchrm \
    --qwen-model-path Qwen/Qwen2.5-7B-Instruct \
    --dataset-name    math \
    --gpu-ids         0,1

GRPO Training with OrchRM

Step 1 — Build training parquets:

python grpo/make_dataset.py \
    --output-dir        ./datasets/grpo \
    --qwen-tokenizer-path Qwen/Qwen2.5-7B-Instruct \
    --problem-type      harmony_medium

Step 2 — Run GRPO:

export MODEL_PATH=/path/to/base/model        # actor model (e.g. Qwen2.5-7B-Instruct)
export RM_MODEL_PATH=/path/to/trained/orchrm # trained reward model

bash grpo/scripts/run.sh \
    ./datasets/grpo/train/train.parquet \
    ./datasets/grpo/train/val.parquet

See grpo/scripts/run.sh for all configurable env vars (LORA_RANK, GPU_IDS, TRAIN_BATCH_SIZE, TOTAL_EPOCHS, etc.).

Reward Function

grpo/reward_function.py supports three reward paths, tried in priority order:

LLM judge — set OPENAI_API_KEY and optionally LLM_JUDGE_MODEL (default: gpt-5-mini)
RM server — set reward_router_address to a running vLLM classify endpoint backed by OrchRM
Exact match — fallback when neither above is available

Acknowledgements

Multi-agent orchestration framework adapted from MAS-Orchestra (Salesforce AI Research)
Reward model training based on Skywork-Reward
GRPO training backend: verl

License

This project is licensed under CC BY-NC 4.0 — free for research and non-commercial use. For commercial use, please contact kingyeung.tsang@gmail.com.

Third-party components (verl, Skywork-Reward, MAS-Orchestra) retain their original Apache 2.0 licenses.

Contact

For questions or feedback, feel free to open a GitHub Issue or reach out directly:

King Yeung Tsang — kingyeung.tsang@gmail.com

Citation

@article{tsang2026orchrm,
  title   = {Reward Modeling for Multi-Agent Orchestration},
  author  = {Tsang, King Yeung and Zhao, Zihao and Venkataramani, Vishal and Shi, Haizhou and Ke, Zixuan and Yavuz, Semih and Joty, Shafiq and Wang, Hao},
  journal = {arXiv preprint arXiv:2606.13598},
  year    = {2026},
  archivePrefix = {arXiv},
  eprint  = {2606.13598},
  primaryClass = {cs.AI},
  doi      = {10.48550/arXiv.2606.13598},
  url      = {https://arxiv.org/abs/2606.13598}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
grpo		grpo
mas_orchestra		mas_orchestra
reward_model		reward_model
utils		utils
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OrchRM

Repository Structure

Dataset

Installation

Training the Reward Model

GRPO Training with OrchRM

Reward Function

Acknowledgements

License

Contact

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

OrchRM

Repository Structure

Dataset

Installation

Training the Reward Model

GRPO Training with OrchRM

Reward Function

Acknowledgements

License

Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages