Skip to content

mlvlab/VideoSearch-R1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement

ECCV 2026 arXiv Project Page Hugging Face License

Seohyun Lee1,*   Seoung Choi1,*   Dohwan Ko2,*   Jongha Kim2   Hyunwoo J. Kim1,†

1 KAIST    2 Korea University    (* equal contribution   † corresponding author)


VideoSearch-R1 teaser

TL;DRVideoSearch-R1 is an agentic framework that unifies inter-video retrieval and intra-video reasoning through multi-turn interaction with a video search engine. We introduce Soft Query Refinement (SQR), which refines query tokens in a continuous latent space instead of rewriting text, and train it with GRPO. VideoSearch-R1 reaches state-of-the-art Video Corpus Moment Retrieval (VCMR) on three benchmarks while using far fewer generated tokens than text-level refinement.


📰 News

  • 2026.06.17 🎉 VideoSearch-R1 is accepted to ECCV 2026.
  • 2026.06.20 Code released.
  • 2026.06.20 Trained model checkpoints released.
  • Dataset release coming soon Dataset release coming soon.
  • Paper preprint coming soon Paper preprint coming soon.

🧭 Overview

As video corpora grow in scale and task complexity, real applications need both inter-video reasoning (retrieving the right video from a large corpus) and intra-video reasoning (fine-grained, query-conditioned tasks such as temporal grounding). Existing pipelines treat retrieval as a one-shot preprocessing step, so a retrieval failure dooms the downstream reasoning; recent video agents often assume the relevant video is already given, bypassing retrieval entirely.

VideoSearch-R1 closes this gap with an iterative retrieve → verify → refine → ground loop:

  1. Retrieve — query a video search engine (Qwen3-VL-Embedding-2B) and return the top-1 candidate from a large-scale corpus.
  2. Verify — reason over the retrieved video and decide match / not match, emitting a reasoning trace.
  3. Soft Query Refinement (SQR) — if not matched, generate N = 8 soft query tokens in latent space and append them to the original query, then re-retrieve.
  4. Temporal Grounding — on a match, predict the precise start/end timestamps of the query-relevant moment.
VideoSearch-R1 pipeline

Unlike hard query refinement (rewriting the query as text), SQR adjusts the query representation directly. The soft tokens are trained with an InfoNCE retrieval objective for richer discriminative supervision, and the whole loop is optimized with GRPO under format, verification, retrieval, and temporal-grounding rewards — reaching superior retrieval with just 8 latent tokens instead of 26.8 rewritten text tokens.

📊 Main Results

Video Corpus Moment Retrieval (VCMR, reported as IoU/R@1), verification accuracy (VER), and video retrieval recall (VR).

Main results on Charades-FIG, DiDeMo-FIG, and ActivityNet-FIG

SQR lifts video retrieval despite using the same search engine, and consistently improves VCMR and verification over zero-shot baselines. See the project page for analyses and qualitative examples.


🚀 Getting Started

VideoSearch-R1 provides three click-through paths:

  1. Quick Start — download prepared data and run inference with released checkpoints.
  2. Quick Training — download prepared data, run Stage 1 SFT, then Stage 2 GRPO.
  3. Start From Scratch — rebuild data artifacts from raw annotations/videos.

Supported dataset aliases are didemo, charades, and activitynet.

⚙️  Installation & Environment
conda create -n videosearchr1 python=3.11.14 -y
conda activate videosearchr1

# CUDA 12.8 system install, if needed:
# apt-get install -y cuda-toolkit-12-8
# update-alternatives --set cuda /usr/local/cuda-12.8

export MAX_JOBS=8
pip install -U pip
pip install -r requirements.txt \
  --extra-index-url https://download.pytorch.org/whl/cu128 \
  --no-build-isolation
pip install -e .
(coming soon) 📦  Prepared Artifacts (Datasets & Checkpoints)

Prepared artifacts are hosted under VideoSearchR1.

Datasets

  • hf://buckets/VideoSearchR1/data/datasets/activitynet
  • hf://buckets/VideoSearchR1/data/datasets/didemo
  • hf://buckets/VideoSearchR1/data/datasets/charades-sta

The bucket shards include released annotations, query/video embeddings, FAISS indices, SFT/GRPO training JSONL files, and video_npy_with_meta tensors.

Checkpoints

  • VideoSearchR1/didemo-sft
  • VideoSearchR1/didemo-grpo
  • VideoSearchR1/charades-sft
  • VideoSearchR1/charades-grpo

ActivityNet checkpoints can be added later with the same aliases: activitynet-sft, activitynet-grpo.

(coming soon) ⚡  Quick Start: Inference with Released Checkpoints

Download the prepared data for the dataset you want to evaluate:

bash scripts/data_construct/download_preextracted_data.bash didemo

Run inference on GPU 0. The script downloads the released Hugging Face checkpoint automatically.

EVAL_GPUS=0 bash scripts/inference/inference.bash didemo

Charades uses the same command shape:

bash scripts/data_construct/download_preextracted_data.bash charades
EVAL_GPUS=0 bash scripts/inference/inference.bash charades

Use a custom checkpoint from local disk or Hugging Face:

EVAL_GPUS=0 bash scripts/inference/inference.bash didemo --checkpoint /path/to/checkpoint
EVAL_GPUS=0 bash scripts/inference/inference.bash charades --checkpoint VideoSearchR1/charades-sft

The inference command writes .json and .jsonl outputs under the checkpoint log directory. Generate metrics and a compact result JSON with:

bash scripts/inference/report.bash /path/to/external_verified_test_temporal_grounding_checkpoint-XXXX.jsonl
(coming soon) 🏋️  Quick Training: Prepared Data → Stage 1 → Stage 2

Download prepared data:

bash scripts/data_construct/download_preextracted_data.bash didemo

Stage 1 trains the SFT model from the default Qwen3-VL base model:

GPUS=0 bash scripts/training/stage1/train.bash didemo

Stage 2 trains GRPO from the dataset Stage 1 checkpoint. If MODEL_PATH is omitted, the script uses the released Stage 1 checkpoint alias.

MODEL_PATH=/path/to/sft/checkpoint \
GPUS=0 bash scripts/training/stage2/train.bash didemo

Run inference from the checkpoint you just trained:

EVAL_GPUS=0 bash scripts/inference/inference.bash didemo --checkpoint /path/to/stage2/checkpoint
(coming soon) ⬇️  Download Pre-Extracted Data

Use this when you want to skip preprocessing and train/evaluate directly:

bash scripts/data_construct/download_preextracted_data.bash all
bash scripts/data_construct/download_preextracted_data.bash didemo
bash scripts/data_construct/download_preextracted_data.bash charades
bash scripts/data_construct/download_preextracted_data.bash activitynet

This downloads the released pre-extracted artifacts from the VideoSearch-R1 Hugging Face bucket, including annotations, query/video embeddings, FAISS indices, SFT/GRPO training JSONL files, and video_npy_with_meta tensors used by training and inference.

🛠️  Start From Scratch: Data Construction

The prepared artifacts let most users skip this section. To rebuild everything from raw annotations and raw videos, first create the expected local layout:

bash scripts/data_construct/prepare_raw_layout.bash all

Download the VERIFIED FIG annotations:

bash scripts/data_construct/download_verified_annotations.bash all

Raw videos are not redistributed by VERIFIED or this repository. Download them from the original benchmark sources, then place or symlink them under the printed paths:

  • ActivityNet-FIG annotations: activitynet_fig_train.jsonl, activitynet_fig_val_1.jsonl, activitynet_fig_val_2.jsonl
  • DiDeMo-FIG annotations: didemo_fig_train.jsonl, didemo_fig_val.jsonl, didemo_fig_test.jsonl
  • Charades-FIG annotations: charades_fig_train.jsonl, charades_fig_test.jsonl
  • Raw videos: raw_videos/{activitynet,didemo,charades_sta}/videos/<video_id>.mp4

Video sources:

  • ActivityNet: use the official ActivityNet / ActivityNet Captions video access. VERIFIED uses ActivityNet ids such as v_vYxBAbbvSxc; save the downloaded video as raw_videos/activitynet/videos/v_vYxBAbbvSxc.mp4.
  • DiDeMo: use the official LisaAnne/LocalizingMoments download scripts, preferably download/download_videos_AWS.py, then save files as raw_videos/didemo/videos/<video_id>.mp4.
  • Charades-STA: download the official Charades videos, for example Charades_v1_480.zip from the Charades project page, then save files as raw_videos/charades_sta/videos/<video_id>.mp4.

Check that the raw videos match the annotation ids:

bash scripts/data_construct/check_raw_videos.bash didemo
bash scripts/data_construct/check_raw_videos.bash charades
bash scripts/data_construct/check_raw_videos.bash activitynet

Then run the full construction pipeline. It generates split annotations, extracts video_npy_with_meta, embeds queries and videos, builds retrieval indices, constructs SFT reasoning data, and builds the GRPO data:

bash scripts/data_construct/start_from_scratch.bash didemo
bash scripts/data_construct/start_from_scratch.bash charades
bash scripts/data_construct/start_from_scratch.bash activitynet

Resume from a later construction step with:

RUN_FROM_STEP=5 bash scripts/data_construct/start_from_scratch.bash didemo

ActivityNet raw example:

ANNO_ROOT=/path/to/activitynet-fig \
VIDEO_BASE=/path/to/activitynet/videos \
bash scripts/data_construct/start_from_scratch.bash activitynet

DiDeMo and Charades-STA follow the same ordered pipeline through their dataset-specific preprocessing scripts.

Toy ActivityNet Full Process

Use this smoke test before launching a full rebuild. It creates a tiny ActivityNet-FIG workspace with about 10 videos per split, then runs data construction, Stage 1 SFT, Stage 2 GRPO, inference, and metric reporting.

TOY_GPUS=1,2,3 \
TOY_ACTIVITYNET_VIDEO_SOURCE=/path/to/ActivityNet/videos_or_split_root \
bash scripts/toy/toy_activitynet_full_process.bash

📝 Citation

If you find VideoSearch-R1 useful, please consider citing:

@inproceedings{lee2026videosearchr1,
  title     = {VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement},
  author    = {Lee, Seohyun and Choi, Seoung and Ko, Dohwan and Kim, Jongha and Kim, Hyunwoo J.},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

🙏 Acknowledgements

This project builds upon excellent open-source work including VideoAuto-R1, Qwen-VL, TRL, and lmms-eval. Our evaluation is based on the VERIFIED benchmark and uses ActivityNet Captions, DiDeMo, and Charades-STA. We thank the creators of these codebases, benchmarks, and datasets for providing valuable resources to the research community.

About

[ECCV2026] Official Implementation of "VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors