Skip to content

LJungang/Awesome-Video-Reasoning-Landscape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

136 Commits
 
 
 
 
 
 

Repository files navigation

The Landscape of Video Reasoning: Tasks, Paradigms and Benchmarks— An Open-Source Survey

Awesome

🗺️ Overview

This Awesome list systematically curates and tracks the latest progress in Video Reasoning, covering diverse modalities, tasks, and modeling paradigms. Rather than focusing on a single line of research, we organize the landscape from multiple complementary perspectives. Following the emerging taxonomy of the field, current works are grouped into four major paradigms:

  • 🗒️ CoT-based Video Reasoning — language-centric, chain-of-thought reasoning with Video-LMMs
  • 🕹️ CoF-based Video Reasoning — vision-centric reasoning grounded in world models or video generation
  • 🌈 Interleaved Video Reasoning — unified models that integrate multimodal interaction and iterative inference
  • 🔁 Streaming Video Reasoning — continuous, low-latency reasoning over long or unbounded video streams with online perception and incremental state updates.

We additionally maintain a dedicated Benchmark section that summarizes datasets, evaluation settings, and standardized tasks to support fair comparison across paradigms.

Note

This repository aims to provide a structured, up-to-date, and open-source overview of the evolving landscape of video reasoning. Contributions and PRs are warmly welcome — preferably in reverse chronological order (newest first) to keep the list fresh and easy to browse.

📖 Contents

📑 Task Definition

TBD

😎 Paradigms

🕹️ CoT-based Video Reasoning

Title Model & Code Checkpoint Input Modalities Time Venue
Rethinking Chain-of-Thought Reasoning for Videos GitHub N/A Text Video 2025-12 Arxiv
1+1 > 2 : Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning GitHub N/A Text Video 2025-12 Arxiv
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning N/A N/A Text Video 2025-12 Arxiv
OneThinker: All-in-one Reasoning Model for Image and Video GitHub Hugging Face Text Video 2025-12 Arxiv
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning GitHub N/A Text Video 2025-12 Arxiv
Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding N/A N/A Text Video 2025-12 Arxiv
Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models GitHub N/A Text Video 2025-11 Arxiv
Video-CoM: Interactive Video Reasoning via Chain of Manipulations GitHub N/A Text Video 2025-11 Arxiv
VideoSeg-R1: Reasoning Video Object Segmentation via Reinforcement Learning GitHub N/A Text Video 2025-11 Arxiv
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning N/A N/A Audio Video 2025-11 Arxiv
Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding N/A N/A Text Video 2025-11 Arxiv
Video Spatial Reasoning with Object-Centric 3D Rollout N/A N/A Text Video 2025-11 Arxiv
ViSS-R1: Self-Supervised Reinforcement Video Reasoning N/A N/A Text Video 2025-11 Arxiv
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning GitHub Hugging Face Text Video 2025-10 Arxiv
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence GitHub Hugging Face Text Video 2025-10 Arxiv
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception GitHub Hugging Face Text Video 2025-09 Arxiv
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning N/A N/A Text Video 2025-09 Arxiv
Kwai Keye-VL 1.5 Technical Report GitHub Hugging Face Text Video 2025-09 Arxiv
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data GitHub Google_Drive Text Video 2025-09 Arxiv
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding N/A N/A Text Video 2025-08 Arxiv
Ovis2.5 Technical Report GitHub Hugging Face Text Video 2025-08 Arxiv
Veason-R1: Reinforcing Video Reasoning Segmentation to Think Before It Segments N/A N/A Text Video 2025-08 Arxiv
ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking GitHub N/A Text Video 2025-08 Arxiv
TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding N/A N/A Text Video 2025-08 Arxiv
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning GitHub Hugging Face Text Video 2025-08 Arxiv
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video GitHub Hugging Face Audio Video Text 2025-08 Arxiv
ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models N/A N/A Text Video 2025-08 Arxiv
VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering N/A N/A Text Video 2025-08 ACM-MM 2025
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts GitHub Hugging Face Text Audio Video 2025-07 Arxiv
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark N/A N/A Text Audio Video 2025-07 Arxiv
CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks N/A N/A Text Video 2025-07 Arxiv
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments N/A N/A Text Video 2025-07 Arxiv
Scaling RL to Long Videos GitHub Hugging Face Text Video 2025-07 NeurIPS 2025
Kwai Keye-VL Technical Report GitHub N/A Text Video 2025-07 Arxiv
ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models GitHub N/A Text Video 2025-07 ACM-MM 2025
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning GitHub Hugging Face Text Video 2025-07 EMNLP 2025
Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames N/A N/A Text Video 2025-07 Arxiv
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning N/A N/A Text Video 2025-06 Arxiv
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning GitHub N/A Text Video 2025-06 Arxiv
DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning N/A N/A Text Video 2025-06 Arxiv
VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks GitHub Hugging Face Text Video 2025-06 Arxiv
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context GitHub N/A Audio Video Text 2025-06 Arxiv
MiMo-VL Technical Report GitHub Hugging Face Text Video 2025-06 Arxiv
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning GitHub N/A Text Video 2025-06 EMNLP 2025 (Findinds)
EgoVLM: Policy Optimization for Egocentric Video Understanding GitHub Hugging Face Text Video 2025-06 Arxiv
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency GitHub Hugging Face Text Video 2025-06 Arxiv
VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking N/A N/A Text Video 2025-06 Arxiv
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding GitHub N/A Text Video 2025-06 NeurIPS 2025
ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding N/A N/A Text Video 2025-06 Arxiv
DIVE: Deep-search Iterative Video Exploration Github N/A Text Video 2025-06 CVPR 2025
VideoDeepResearch: Long Video Understanding With Agentic Tool Using Github N/A Text Video 2025-06 Arxiv
Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency N/A N/A Text Video 2025-06 Arxiv
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO Github N/A Text Video 2025-06 NeurIPS 2025
Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought N/A Project_Page Text Video 2025-06 Arxiv
VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning N/A N/A Text Video 2025-06 Arxiv
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning Github Hugging Face Text Video 2025-06 Arxiv
Reinforcing Video Reasoning with Focused Thinking Github Hugging Face Text Video 2025-05 Arxiv
A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding Github N/A Text Video 2025-05 Arxiv
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration Github Hugging Face Text Audio Video 2025-05 NeurIPS 2025
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought Github N/A Text Video 2025-05 NeurIPS 2025
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Guided Iterative Policy Optimization Github Hugging Face Text Video 2025-05 Arxiv
Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning Github N/A Text Speech Video 2025-05 NeurIPS 2025
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning Github Hugging Face Text Video 2025-05 NeurIPS 2025
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning Github Hugging Face Text Video 2025-05 Arxiv
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning Github Hugging Face Text Video 2025-05 NeurIPS 2025
Seed1.5-VL Technical Report N/A N/A Text Video 2025-05 Arxiv
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action Github Hugging Face Text Video 2025-05 Arxiv
Fostering Video Reasoning via Next-Event Prediction Github N/A Text Video 2025-05 Arxiv
SiLVR: A Simple Language-based Video Reasoning Framework Github N/A Text Video 2025-05 Arxiv
RVTBench: A Benchmark for Visual Reasoning Tasks GitHub Hugging Face Text Video 2025-05 Arxiv
CoT-Vid: Dynamic Chain-of-Thought Routing with Self-Verification for Training-Free Video Reasoning N/A N/A Text Video 2025-05 Arxiv
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models GitHub N/A Text Video 2025-05 Arxiv
AVA: Towards Agentic Video Analytics with Vision Language Models GitHub N/A Text Video 2025-05 NSDI 2026
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning GitHub Hugging Face Text Video 2025-04 Arxiv
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning GitHub Hugging Face Text Video 2025-04 Arxiv
Spatial-R1: Enhancing MLLMs in Video Spatial Reasoning GitHub Hugging Face Text Video 2025-04 Arxiv
Improved Visual-Spatial Reasoning via R1-Zero-Like Training GitHub Hugging Face Text Video 2025-04 Arxiv
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models GitHub N/A Text Video 2025-04 Arxiv
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding N/A N/A Text Video 2025-04 Arxiv
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models N/A Hugging Face Text Video 2025-04 Arxiv
MR. Video: "MapReduce" is the Principle for Long Video Understanding GitHub N/A Text Video 2025-04 Arxiv
Multimodal Long Video Modeling Based on Temporal Dynamic Context GitHub Hugging Face Text Video 2025-04 Arxiv
WikiVideo: Article Generation from Multiple Videos GitHub N/A Text Video 2025-04 Arxiv
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 GitHub Hugging Face Text Video 2025-03 Arxiv
Video-R1: Reinforcing Video Reasoning in MLLMs GitHub Hugging Face Text Video 2025-03 NeurIPS 2025
TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM GitHub Hugging Face Text Video 2025-03 NeurIPS 2025
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos N/A N/A Text Video 2025-03 NeurIPS 2025
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning GitHub Hugging Face Text Video 2025-03 Arxiv
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs GitHub N/A Audio Video Text 2025-03 ICCV 2025
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model GitHub Hugging Face Audio Video Text 2025-02 Arxiv
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding GitHub N/A Text Video 2025-02 ACL 2025 (Oral)
CoS: Chain-of-Shot Prompting for Long Video Understanding GitHub N/A Text Video 2025-02 Arxiv
Temporal Preference Optimization for Long-Form Video Understanding GitHub Hugging Face Text Video 2025-01 Arxiv
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model GitHub Hugging Face Text Video 2025-01 ACL 2025 (Findings)
MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning GitHub Hugging Face Text Video 2025-01 IEEE TPAMI
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs N/A N/A Text Video 2025-01 Arxiv
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition GitHub N/A Text Video 2025-01 ICML 2024
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling GitHub Hugging Face Text Video 2024-12 Arxiv
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training N/A N/A Text Video 2024-12 CVPR 2025
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection GitHub Hugging Face Text Video 2024-11 CVPR 2025
Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning N/A N/A Text Video 2024-10 NeurIPS 2024 (Workshop)
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs GitHub N/A Text Video 2024-09 EMNLP 2024 (Findinds)
MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning GitHub Hugging Face Text Video 2024-09 NeurIPS 2024 (Spotlight)

🕹️ CoF-based Video Reasoning

Title Code Checkpoint Time Venue
CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation GitHub N/A 2026-01 Arxiv
Unified Video Editing with Temporal Reasoner GitHub Hugging Face 2025-12 Arxiv
Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven’ Matrices GitHub N/A 2025-12 Arxiv
McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning GitHub N/A 2025-11 Arxiv
In-Video Instructions: Visual Signals as Generative Control GitHub N/A 2025-11 Arxiv
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO GitHub Hugging Face 2025-11 Arxiv
Reasoning via Video: The First Evaluation of Video Models’ Reasoning Abilities through Maze-Solving Tasks GitHub Hugging Face 2025-11 Arxiv
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm GitHub N/A 2025-11 Arxiv
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark GitHub Hugging Face 2025-10 Arxiv
VChain : Chain-of-Visual-Thought for Reasoning in Video Generation GitHub N/A 2025-10 Arxiv
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning GitHub N/A 2025-06 Arxiv

🌈 Interleaved Video Reasoning

Title Code Checkpoint Time Venue
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling GitHub Hugging Face 2025-11 Arxiv
JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation GitHub N/A 2025-11 NeurIPS 2025 (Spotlight)
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination GitHub N/A 2025-11 Arxiv
Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution N/A N/A 2025-11 Arxiv
Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning N/A N/A 2025-10 Arxiv
ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models GitHub Hugging Face 2025-10 ACM-MM 2025
FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning N/A N/A 2025-09 Arxiv
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning GitHub Hugging Face 2025-08 Arxiv
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation GitHub Hugging Face 2024-09 ICLR 2025

🔁 Streaming Video Reasoning

Title Code Checkpoint Time Venue
Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously GitHub N/A 2026-03 Arxiv
LiveStar: Live Streaming Assistant for Real-World Online Video Understanding GitHub Hugging Face 2025-11 NeurIPS 2025
StreamingVLM: Real-Time Understanding for Infinite Video Streams GitHub N/A 2025-10 Arxiv
StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding N/A N/A 2025-10 Arxiv
StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA GitHub N/A 2025-10 ACM-MM 2025
StreamForest: Efficient Online Video Understanding with Persistent Event Memory GitHub Hugging Face 2025-09 NeurIPS 2025 (Spotlighht)
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling GitHub Hugging Face 2025-07 Arxiv
Flash-VStream: Efficient Real-Time Understanding for Long Video Streams GitHub Hugging Face 2025-06 ICCV 2025
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant GitHub N/A 2025-05 NeurIPS 2025
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval N/A N/A 2025-05 Arxiv
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos GitHub Hugging Face 2025-04 ACM-MM 2025
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale GitHub N/A 2025-04 Arxiv
ViSpeak: Visual Instruction Feedback in Streaming Videos GitHub Model_Zoo 2025-03 ICCV 2025
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition GitHub N/A 2025-03 ICCV 2025
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval GitHub N/A 2025-03 ICLR 2025
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding GitHub Hugging Face 2025-02 ICLR 2025 (Spotlight)
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction GitHub Hugging Face 2025-01 CVPR 2025
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge GitHub N/A 2025-01 ICLR 2025
Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method GitHub Hugging Face 2025-01 CVPR 2025
StreamChat: Chatting with Streaming Video N/A N/A 2024-11 Arxiv

✨️ Benchmarks

Name Paper Link Task Time Venue
GameplayQA GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents GitHub <br>Hugging Face Language Vision 2026-03 ACL 2026
MMGR MMGR: Multi-Modal Generative Reasoning GitHub <br>Hugging Face Vision 2015-12 Arxiv
MM-CoT MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models N/A Language 2015-12 Arxiv
RULER-Bench RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence GitHub <br>Hugging Face Vision 2025-12 Arxiv
AV-SpeakerBench See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models GitHub Language 2025-12 Arxiv
PAI-Bench PAI-Bench: A Comprehensive Benchmark For Physical AI GitHub Language Vision 2025-12 Arxiv
Envision Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights GitHub Vision 2025-12 Arxiv
STREAMGAZE STREAMGAZE: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos GitHub <br>Hugging Face Streaming Language 2025-12 Arxiv
V-ReasonBench V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models GitHub Vision 2025-11 Arxiv
VR-Bench Reasoning via Video: The First Evaluation of Video Models’ Reasoning Abilities through Maze-Solving Tasks GitHub <br>Hugging Face Vision 2025-11 Arxiv
Gen-ViRe Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark GitHub Vision 2025-11 Arxiv
TiViBench TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models GitHub Vision 2025-11 Arxiv
VideoThinkBench Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm GitHub Vision 2025-11 Arxiv
MME-CoF Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark Hugging Face Vision 2025-10 Arxiv
SciVideoBench SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models GitHub Language 2025-10 Arxiv
ReasoningTrack ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking GitHub Language 2025-08 Arxiv
METER METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark N/A Language 2025-07 Arxiv
Video-TT Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding Hugging Face Language 2025-07 ICCV 2025
ImplicitQA ImplicitQA: Going beyond frames towards Implicit Video Reasoning Hugging Face Language 2025-06 Arxiv
Video-CoT Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought Hugging Face Language 2025-06 Arxiv
Implicit-VideoQA Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning GitHub Language 2025-06 Arxiv
MORSE-500 MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning GitHub <br>Hugging Face Language 2025-06 Arxiv
SpookyBench Time Blindness: Why Video-Language Models Can't See What Humans Can GitHub <br>Hugging Face Language 2025-05 Arxiv
VideoReasonBench VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? GitHub <br>Hugging Face Language 2025-05 Arxiv
Video-Holmes Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? GitHub Language 2025-05 Arxiv
VideoEval-Pro VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation GitHub <br>Hugging Face Language 2025-05 Arxiv
VBenchComp Breaking Down Video LLM Benchmarks N/A Language 2025-05 Arxiv
RVTBench RVTBench: A Benchmark for Visual Reasoning Tasks GitHub <br>Hugging Face Language 2025-05 Arxiv
VCRBench VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models GitHub Language 2025-05 Arxiv
RTV-Bench RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video GitHub <br>Hugging Face Streaming Language 2025-05 NeurIPS 2025 (D&B)
MINERVA MINERVA: Evaluating Complex Video Reasoning GitHub Language 2025-05 Arxiv
VCR-Bench VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning GitHub <br>Hugging Face Language 2025-04 Arxiv
SEED-Bench-R1 Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 GitHub <br>Hugging Face Language 2025-03 Arxiv
H2VU-Benchmark H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding GitHub Streaming Language 2025-03 Arxiv
OmniMMI OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts GitHub <br>Hugging Face Streaming Language 2025-03 CVPR 2025
HAVEN Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation GitHub <br>Hugging Face Language 2025-03 Arxiv
V-STaR V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning GitHub <br>Hugging Face Language 2025-03 Arxiv
COVER Reasoning is All You Need for Video Generalization GitHub Language 2025-03 ACL 2025 (Findinds)
MOMA-QA Towards Fine-Grained Video Question Answering N/A Language 2025-03 Arxiv
SVBench SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding GitHub Streaming Language 2025-02 ICLR 2025 (Spotlight)
StreamBench Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge GitHub <br>Hugging Face Streaming Language 2025-01 ICLR 2025
MMVU MMVU: Measuring Expert-Level Multi-Discipline Video Understanding GitHub <br>Hugging Face Language 2025-01 Arxiv
OVO-Bench OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? GitHub Hugging Face Streaming Language 2025-01 CVPR 2025
HLV-1K HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding GitHub Language 2025-01 ICME 2025
OVBench Online Video Understanding: OVBench and VideoChat-Online GitHub <br>Hugging Face Streaming Language 2025-01 CVPR 2025
VSI-Bench Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces GitHub Language 2024-12 CVPR 2025 (Oral)
3DSRBench 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark Hugging Face Language 2024-12 ICCV 2025
BlackSwanSuite Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events GitHub <br>Hugging Face Language 2024-12 CVPR 2025
TOMATO TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models Github Language 2024-10 CVPR 2025
OmnixR OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities N/A Language 2024-10 ICLR 2025
VideoVista VideoVista: A Versatile Benchmark for Video Understanding and Reasoning Github Language 2024-06 Arxiv
SOK-Bench SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge GitHub Language 2024-05 CVPR 2024
CVRR-ES How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs GitHub Language 2024-05 Arxiv

✈ Related Survey

In addition, several recent and concurrent surveys have discussed multimodal or video reasoning. The works listed below offer complementary perspectives to ours, reflecting the field’s rapid and parallel development:



🌟 Star History

Star History Chart

♥️ Contributors

Contributors for Awesome Video Reasoning Landscape