Seohyun Lee1,* Seoung Choi1,* Dohwan Ko2,* Jongha Kim2 Hyunwoo J. Kim1,†
1 KAIST 2 Korea University (* equal contribution † corresponding author)
TL;DR — VideoSearch-R1 is an agentic framework that unifies inter-video retrieval and intra-video reasoning through multi-turn interaction with a video search engine. We introduce Soft Query Refinement (SQR), which refines query tokens in a continuous latent space instead of rewriting text, and train it with GRPO. VideoSearch-R1 reaches state-of-the-art Video Corpus Moment Retrieval (VCMR) on three benchmarks while using far fewer generated tokens than text-level refinement.
🎉 VideoSearch-R1 is accepted to ECCV 2026.
Code released.
Trained model checkpoints released.
Dataset release coming soon.
Paper preprint coming soon.
As video corpora grow in scale and task complexity, real applications need both inter-video reasoning (retrieving the right video from a large corpus) and intra-video reasoning (fine-grained, query-conditioned tasks such as temporal grounding). Existing pipelines treat retrieval as a one-shot preprocessing step, so a retrieval failure dooms the downstream reasoning; recent video agents often assume the relevant video is already given, bypassing retrieval entirely.
VideoSearch-R1 closes this gap with an iterative retrieve → verify → refine → ground loop:
- Retrieve — query a video search engine (Qwen3-VL-Embedding-2B) and return the top-1 candidate from a large-scale corpus.
- Verify — reason over the retrieved video and decide match / not match, emitting a reasoning trace.
- Soft Query Refinement (SQR) — if not matched, generate
N = 8soft query tokens in latent space and append them to the original query, then re-retrieve. - Temporal Grounding — on a match, predict the precise start/end timestamps of the query-relevant moment.
Unlike hard query refinement (rewriting the query as text), SQR adjusts the query representation directly. The soft tokens are trained with an InfoNCE retrieval objective for richer discriminative supervision, and the whole loop is optimized with GRPO under format, verification, retrieval, and temporal-grounding rewards — reaching superior retrieval with just 8 latent tokens instead of 26.8 rewritten text tokens.
Video Corpus Moment Retrieval (VCMR, reported as IoU/R@1), verification accuracy (VER), and video retrieval recall (VR).
SQR lifts video retrieval despite using the same search engine, and consistently improves VCMR and verification over zero-shot baselines. See the project page for analyses and qualitative examples.
VideoSearch-R1 provides three click-through paths:
- Quick Start — download prepared data and run inference with released checkpoints.
- Quick Training — download prepared data, run Stage 1 SFT, then Stage 2 GRPO.
- Start From Scratch — rebuild data artifacts from raw annotations/videos.
Supported dataset aliases are didemo, charades, and activitynet.
⚙️ Installation & Environment
conda create -n videosearchr1 python=3.11.14 -y
conda activate videosearchr1
# CUDA 12.8 system install, if needed:
# apt-get install -y cuda-toolkit-12-8
# update-alternatives --set cuda /usr/local/cuda-12.8
export MAX_JOBS=8
pip install -U pip
pip install -r requirements.txt \
--extra-index-url https://download.pytorch.org/whl/cu128 \
--no-build-isolation
pip install -e .(coming soon) 📦 Prepared Artifacts (Datasets & Checkpoints)
Prepared artifacts are hosted under VideoSearchR1.
Datasets
hf://buckets/VideoSearchR1/data/datasets/activitynethf://buckets/VideoSearchR1/data/datasets/didemohf://buckets/VideoSearchR1/data/datasets/charades-sta
The bucket shards include released annotations, query/video embeddings, FAISS indices, SFT/GRPO training JSONL files, and video_npy_with_meta tensors.
Checkpoints
VideoSearchR1/didemo-sftVideoSearchR1/didemo-grpoVideoSearchR1/charades-sftVideoSearchR1/charades-grpo
ActivityNet checkpoints can be added later with the same aliases: activitynet-sft, activitynet-grpo.
(coming soon) ⚡ Quick Start: Inference with Released Checkpoints
Download the prepared data for the dataset you want to evaluate:
bash scripts/data_construct/download_preextracted_data.bash didemoRun inference on GPU 0. The script downloads the released Hugging Face checkpoint automatically.
EVAL_GPUS=0 bash scripts/inference/inference.bash didemoCharades uses the same command shape:
bash scripts/data_construct/download_preextracted_data.bash charades
EVAL_GPUS=0 bash scripts/inference/inference.bash charadesUse a custom checkpoint from local disk or Hugging Face:
EVAL_GPUS=0 bash scripts/inference/inference.bash didemo --checkpoint /path/to/checkpoint
EVAL_GPUS=0 bash scripts/inference/inference.bash charades --checkpoint VideoSearchR1/charades-sftThe inference command writes .json and .jsonl outputs under the checkpoint log directory. Generate metrics and a compact result JSON with:
bash scripts/inference/report.bash /path/to/external_verified_test_temporal_grounding_checkpoint-XXXX.jsonl(coming soon) 🏋️ Quick Training: Prepared Data → Stage 1 → Stage 2
Download prepared data:
bash scripts/data_construct/download_preextracted_data.bash didemoStage 1 trains the SFT model from the default Qwen3-VL base model:
GPUS=0 bash scripts/training/stage1/train.bash didemoStage 2 trains GRPO from the dataset Stage 1 checkpoint. If MODEL_PATH is omitted, the script uses the released Stage 1 checkpoint alias.
MODEL_PATH=/path/to/sft/checkpoint \
GPUS=0 bash scripts/training/stage2/train.bash didemoRun inference from the checkpoint you just trained:
EVAL_GPUS=0 bash scripts/inference/inference.bash didemo --checkpoint /path/to/stage2/checkpoint(coming soon) ⬇️ Download Pre-Extracted Data
Use this when you want to skip preprocessing and train/evaluate directly:
bash scripts/data_construct/download_preextracted_data.bash all
bash scripts/data_construct/download_preextracted_data.bash didemo
bash scripts/data_construct/download_preextracted_data.bash charades
bash scripts/data_construct/download_preextracted_data.bash activitynetThis downloads the released pre-extracted artifacts from the VideoSearch-R1 Hugging Face bucket, including annotations, query/video embeddings, FAISS indices, SFT/GRPO training JSONL files, and video_npy_with_meta tensors used by training and inference.
🛠️ Start From Scratch: Data Construction
The prepared artifacts let most users skip this section. To rebuild everything from raw annotations and raw videos, first create the expected local layout:
bash scripts/data_construct/prepare_raw_layout.bash allDownload the VERIFIED FIG annotations:
bash scripts/data_construct/download_verified_annotations.bash allRaw videos are not redistributed by VERIFIED or this repository. Download them from the original benchmark sources, then place or symlink them under the printed paths:
- ActivityNet-FIG annotations:
activitynet_fig_train.jsonl,activitynet_fig_val_1.jsonl,activitynet_fig_val_2.jsonl - DiDeMo-FIG annotations:
didemo_fig_train.jsonl,didemo_fig_val.jsonl,didemo_fig_test.jsonl - Charades-FIG annotations:
charades_fig_train.jsonl,charades_fig_test.jsonl - Raw videos:
raw_videos/{activitynet,didemo,charades_sta}/videos/<video_id>.mp4
Video sources:
- ActivityNet: use the official ActivityNet / ActivityNet Captions video access. VERIFIED uses ActivityNet ids such as
v_vYxBAbbvSxc; save the downloaded video asraw_videos/activitynet/videos/v_vYxBAbbvSxc.mp4. - DiDeMo: use the official
LisaAnne/LocalizingMomentsdownload scripts, preferablydownload/download_videos_AWS.py, then save files asraw_videos/didemo/videos/<video_id>.mp4. - Charades-STA: download the official Charades videos, for example
Charades_v1_480.zipfrom the Charades project page, then save files asraw_videos/charades_sta/videos/<video_id>.mp4.
Check that the raw videos match the annotation ids:
bash scripts/data_construct/check_raw_videos.bash didemo
bash scripts/data_construct/check_raw_videos.bash charades
bash scripts/data_construct/check_raw_videos.bash activitynetThen run the full construction pipeline. It generates split annotations, extracts video_npy_with_meta, embeds queries and videos, builds retrieval indices, constructs SFT reasoning data, and builds the GRPO data:
bash scripts/data_construct/start_from_scratch.bash didemo
bash scripts/data_construct/start_from_scratch.bash charades
bash scripts/data_construct/start_from_scratch.bash activitynetResume from a later construction step with:
RUN_FROM_STEP=5 bash scripts/data_construct/start_from_scratch.bash didemoActivityNet raw example:
ANNO_ROOT=/path/to/activitynet-fig \
VIDEO_BASE=/path/to/activitynet/videos \
bash scripts/data_construct/start_from_scratch.bash activitynetDiDeMo and Charades-STA follow the same ordered pipeline through their dataset-specific preprocessing scripts.
Toy ActivityNet Full Process
Use this smoke test before launching a full rebuild. It creates a tiny ActivityNet-FIG workspace with about 10 videos per split, then runs data construction, Stage 1 SFT, Stage 2 GRPO, inference, and metric reporting.
TOY_GPUS=1,2,3 \
TOY_ACTIVITYNET_VIDEO_SOURCE=/path/to/ActivityNet/videos_or_split_root \
bash scripts/toy/toy_activitynet_full_process.bashIf you find VideoSearch-R1 useful, please consider citing:
@inproceedings{lee2026videosearchr1,
title = {VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement},
author = {Lee, Seohyun and Choi, Seoung and Ko, Dohwan and Kim, Jongha and Kim, Hyunwoo J.},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}This project builds upon excellent open-source work including VideoAuto-R1, Qwen-VL, TRL, and lmms-eval. Our evaluation is based on the VERIFIED benchmark and uses ActivityNet Captions, DiDeMo, and Charades-STA. We thank the creators of these codebases, benchmarks, and datasets for providing valuable resources to the research community.

