GitHub - LJungang/Awesome-Video-Reasoning-Landscape: 🔥An open-source survey of the latest video reasoning tasks, paradigms, and benchmarks.

The Landscape of Video Reasoning: Tasks, Paradigms and Benchmarks— An Open-Source Survey

🗺️ Overview

This Awesome list systematically curates and tracks the latest progress in Video Reasoning, covering diverse modalities, tasks, and modeling paradigms. Rather than focusing on a single line of research, we organize the landscape from multiple complementary perspectives. Following the emerging taxonomy of the field, current works are grouped into four major paradigms:

🗒️ CoT-based Video Reasoning — language-centric, chain-of-thought reasoning with Video-LMMs
🕹️ CoF-based Video Reasoning — vision-centric reasoning grounded in world models or video generation
🌈 Interleaved Video Reasoning — unified models that integrate multimodal interaction and iterative inference
🔁 Streaming Video Reasoning — continuous, low-latency reasoning over long or unbounded video streams with online perception and incremental state updates.

We additionally maintain a dedicated Benchmark section that summarizes datasets, evaluation settings, and standardized tasks to support fair comparison across paradigms.

Note

This repository aims to provide a structured, up-to-date, and open-source overview of the evolving landscape of video reasoning. Contributions and PRs are warmly welcome — preferably in reverse chronological order (newest first) to keep the list fresh and easy to browse.

📖 Contents

Awesome-Video-Reasoning-Landscape

📑 Task Definition

TBD

😎 Paradigms

🕹️ CoT-based Video Reasoning

Title	Model & Code	Checkpoint	Time	Venue
Rethinking Chain-of-Thought Reasoning for Videos	GitHub	`N/A`	2025-12	`Arxiv`
1+1 > 2 : Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning	GitHub	`N/A`	2025-12	`Arxiv`
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning	`N/A`	`N/A`	2025-12	`Arxiv`
OneThinker: All-in-one Reasoning Model for Image and Video	GitHub	Hugging Face	2025-12	`Arxiv`
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning	GitHub	`N/A`	2025-12	`Arxiv`
Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding	`N/A`	`N/A`	2025-12	`Arxiv`
Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models	GitHub	`N/A`	2025-11	`Arxiv`
Video-CoM: Interactive Video Reasoning via Chain of Manipulations	GitHub	`N/A`	2025-11	`Arxiv`
VideoSeg-R1: Reasoning Video Object Segmentation via Reinforcement Learning	GitHub	`N/A`	2025-11	`Arxiv`
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning	`N/A`	`N/A`	2025-11	`Arxiv`
Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding	`N/A`	`N/A`	2025-11	`Arxiv`
Video Spatial Reasoning with Object-Centric 3D Rollout	`N/A`	`N/A`	2025-11	`Arxiv`
ViSS-R1: Self-Supervised Reinforcement Video Reasoning	`N/A`	`N/A`	2025-11	`Arxiv`
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning	GitHub	Hugging Face	2025-10	`Arxiv`
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence	GitHub	Hugging Face	2025-10	`Arxiv`
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception	GitHub	Hugging Face	2025-09	`Arxiv`
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning	`N/A`	`N/A`	2025-09	`Arxiv`
Kwai Keye-VL 1.5 Technical Report	GitHub	Hugging Face	2025-09	`Arxiv`
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data	GitHub	Google_Drive	2025-09	`Arxiv`
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding	`N/A`	`N/A`	2025-08	`Arxiv`
Ovis2.5 Technical Report	GitHub	Hugging Face	2025-08	`Arxiv`
Veason-R1: Reinforcing Video Reasoning Segmentation to Think Before It Segments	`N/A`	`N/A`	2025-08	`Arxiv`
ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking	GitHub	`N/A`	2025-08	`Arxiv`
TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding	`N/A`	`N/A`	2025-08	`Arxiv`
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning	GitHub	Hugging Face	2025-08	`Arxiv`
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video	GitHub	Hugging Face	2025-08	`Arxiv`
ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models	`N/A`	`N/A`	2025-08	`Arxiv`
VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering	`N/A`	`N/A`	2025-08	`ACM-MM 2025`
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts	GitHub	Hugging Face	2025-07	`Arxiv`
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark	`N/A`	`N/A`	2025-07	`Arxiv`
CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks	`N/A`	`N/A`	2025-07	`Arxiv`
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments	`N/A`	`N/A`	2025-07	`Arxiv`
Scaling RL to Long Videos	GitHub	Hugging Face	2025-07	`NeurIPS 2025`
Kwai Keye-VL Technical Report	GitHub	`N/A`	2025-07	`Arxiv`
ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models	GitHub	`N/A`	2025-07	`ACM-MM 2025`
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning	GitHub	Hugging Face	2025-07	`EMNLP 2025`
Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames	`N/A`	`N/A`	2025-07	`Arxiv`
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning	`N/A`	`N/A`	2025-06	`Arxiv`
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning	GitHub	`N/A`	2025-06	`Arxiv`
DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning	`N/A`	`N/A`	2025-06	`Arxiv`
VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks	GitHub	Hugging Face	2025-06	`Arxiv`
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context	GitHub	`N/A`	2025-06	`Arxiv`
MiMo-VL Technical Report	GitHub	Hugging Face	2025-06	`Arxiv`
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning	GitHub	`N/A`	2025-06	`EMNLP 2025 (Findinds)`
EgoVLM: Policy Optimization for Egocentric Video Understanding	GitHub	Hugging Face	2025-06	`Arxiv`
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency	GitHub	Hugging Face	2025-06	`Arxiv`
VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking	`N/A`	`N/A`	2025-06	`Arxiv`
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding	GitHub	`N/A`	2025-06	`NeurIPS 2025`
ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding	`N/A`	`N/A`	2025-06	`Arxiv`
DIVE: Deep-search Iterative Video Exploration	Github	`N/A`	2025-06	`CVPR 2025`
VideoDeepResearch: Long Video Understanding With Agentic Tool Using	Github	`N/A`	2025-06	`Arxiv`
Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency	`N/A`	`N/A`	2025-06	`Arxiv`
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO	Github	`N/A`	2025-06	`NeurIPS 2025`
Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought	`N/A`	Project_Page	2025-06	`Arxiv`
VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning	`N/A`	`N/A`	2025-06	`Arxiv`
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning	Github	Hugging Face	2025-06	`Arxiv`
Reinforcing Video Reasoning with Focused Thinking	Github	Hugging Face	2025-05	`Arxiv`
A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding	Github	`N/A`	2025-05	`Arxiv`
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration	Github	Hugging Face	2025-05	`NeurIPS 2025`
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought	Github	`N/A`	2025-05	`NeurIPS 2025`
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Guided Iterative Policy Optimization	Github	Hugging Face	2025-05	`Arxiv`
Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning	Github	`N/A`	2025-05	`NeurIPS 2025`
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning	Github	Hugging Face	2025-05	`NeurIPS 2025`
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning	Github	Hugging Face	2025-05	`Arxiv`
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning	Github	Hugging Face	2025-05	`NeurIPS 2025`
Seed1.5-VL Technical Report	`N/A`	`N/A`	2025-05	`Arxiv`
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action	Github	Hugging Face	2025-05	`Arxiv`
Fostering Video Reasoning via Next-Event Prediction	Github	`N/A`	2025-05	`Arxiv`
SiLVR: A Simple Language-based Video Reasoning Framework	Github	`N/A`	2025-05	`Arxiv`
RVTBench: A Benchmark for Visual Reasoning Tasks	GitHub	Hugging Face	2025-05	`Arxiv`
CoT-Vid: Dynamic Chain-of-Thought Routing with Self-Verification for Training-Free Video Reasoning	`N/A`	`N/A`	2025-05	`Arxiv`
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models	GitHub	`N/A`	2025-05	`Arxiv`
AVA: Towards Agentic Video Analytics with Vision Language Models	GitHub	`N/A`	2025-05	`NSDI 2026`
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning	GitHub	Hugging Face	2025-04	`Arxiv`
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning	GitHub	Hugging Face	2025-04	`Arxiv`
Spatial-R1: Enhancing MLLMs in Video Spatial Reasoning	GitHub	Hugging Face	2025-04	`Arxiv`
Improved Visual-Spatial Reasoning via R1-Zero-Like Training	GitHub	Hugging Face	2025-04	`Arxiv`
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models	GitHub	`N/A`	2025-04	`Arxiv`
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding	`N/A`	`N/A`	2025-04	`Arxiv`
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models	`N/A`	Hugging Face	2025-04	`Arxiv`
MR. Video: "MapReduce" is the Principle for Long Video Understanding	GitHub	`N/A`	2025-04	`Arxiv`
Multimodal Long Video Modeling Based on Temporal Dynamic Context	GitHub	Hugging Face	2025-04	`Arxiv`
WikiVideo: Article Generation from Multiple Videos	GitHub	`N/A`	2025-04	`Arxiv`
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1	GitHub	Hugging Face	2025-03	`Arxiv`
Video-R1: Reinforcing Video Reasoning in MLLMs	GitHub	Hugging Face	2025-03	`NeurIPS 2025`
TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM	GitHub	Hugging Face	2025-03	`NeurIPS 2025`
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos	`N/A`	`N/A`	2025-03	`NeurIPS 2025`
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning	GitHub	Hugging Face	2025-03	`Arxiv`
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs	GitHub	`N/A`	2025-03	`ICCV 2025`
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model	GitHub	Hugging Face	2025-02	`Arxiv`
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding	GitHub	`N/A`	2025-02	`ACL 2025 (Oral)`
CoS: Chain-of-Shot Prompting for Long Video Understanding	GitHub	`N/A`	2025-02	`Arxiv`
Temporal Preference Optimization for Long-Form Video Understanding	GitHub	Hugging Face	2025-01	`Arxiv`
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model	GitHub	Hugging Face	2025-01	`ACL 2025 (Findings)`
MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning	GitHub	Hugging Face	2025-01	`IEEE TPAMI`
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs	`N/A`	`N/A`	2025-01	`Arxiv`
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition	GitHub	`N/A`	2025-01	`ICML 2024`
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	GitHub	Hugging Face	2024-12	`Arxiv`
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training	`N/A`	`N/A`	2024-12	`CVPR 2025`
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection	GitHub	Hugging Face	2024-11	`CVPR 2025`
Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning	`N/A`	`N/A`	2024-10	`NeurIPS 2024 (Workshop)`
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs	GitHub	`N/A`	2024-09	`EMNLP 2024 (Findinds)`
MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning	GitHub	Hugging Face	2024-09	`NeurIPS 2024 (Spotlight)`

🕹️ CoF-based Video Reasoning

Title	Code	Checkpoint	Time	Venue
CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation	GitHub	`N/A`	2026-01	`Arxiv`
Unified Video Editing with Temporal Reasoner	GitHub	Hugging Face	2025-12	`Arxiv`
Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven’ Matrices	GitHub	`N/A`	2025-12	`Arxiv`
McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning	GitHub	`N/A`	2025-11	`Arxiv`
In-Video Instructions: Visual Signals as Generative Control	GitHub	`N/A`	2025-11	`Arxiv`
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO	GitHub	Hugging Face	2025-11	`Arxiv`
Reasoning via Video: The First Evaluation of Video Models’ Reasoning Abilities through Maze-Solving Tasks	GitHub	Hugging Face	2025-11	`Arxiv`
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm	GitHub	`N/A`	2025-11	`Arxiv`
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark	GitHub	Hugging Face	2025-10	`Arxiv`
VChain : Chain-of-Visual-Thought for Reasoning in Video Generation	GitHub	`N/A`	2025-10	`Arxiv`
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning	GitHub	`N/A`	2025-06	`Arxiv`

🌈 Interleaved Video Reasoning

Title	Code	Checkpoint	Time	Venue
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling	GitHub	Hugging Face	2025-11	`Arxiv`
JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation	GitHub	`N/A`	2025-11	`NeurIPS 2025 (Spotlight)`
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination	GitHub	`N/A`	2025-11	`Arxiv`
Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution	`N/A`	`N/A`	2025-11	`Arxiv`
Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning	`N/A`	`N/A`	2025-10	`Arxiv`
ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models	GitHub	Hugging Face	2025-10	`ACM-MM 2025`
FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning	`N/A`	`N/A`	2025-09	`Arxiv`
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning	GitHub	Hugging Face	2025-08	`Arxiv`
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation	GitHub	Hugging Face	2024-09	`ICLR 2025`

🔁 Streaming Video Reasoning

Title	Code	Checkpoint	Time	Venue
Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously	GitHub	`N/A`	2026-03	`Arxiv`
LiveStar: Live Streaming Assistant for Real-World Online Video Understanding	GitHub	Hugging Face	2025-11	`NeurIPS 2025`
StreamingVLM: Real-Time Understanding for Infinite Video Streams	GitHub	`N/A`	2025-10	`Arxiv`
StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding	`N/A`	`N/A`	2025-10	`Arxiv`
StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA	GitHub	`N/A`	2025-10	`ACM-MM 2025`
StreamForest: Efficient Online Video Understanding with Persistent Event Memory	GitHub	Hugging Face	2025-09	`NeurIPS 2025 (Spotlighht)`
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling	GitHub	Hugging Face	2025-07	`Arxiv`
Flash-VStream: Efficient Real-Time Understanding for Long Video Streams	GitHub	Hugging Face	2025-06	`ICCV 2025`
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant	GitHub	`N/A`	2025-05	`NeurIPS 2025`
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval	`N/A`	`N/A`	2025-05	`Arxiv`
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos	GitHub	Hugging Face	2025-04	`ACM-MM 2025`
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale	GitHub	`N/A`	2025-04	`Arxiv`
ViSpeak: Visual Instruction Feedback in Streaming Videos	GitHub	Model_Zoo	2025-03	`ICCV 2025`
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition	GitHub	`N/A`	2025-03	`ICCV 2025`
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval	GitHub	`N/A`	2025-03	`ICLR 2025`
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding	GitHub	Hugging Face	2025-02	`ICLR 2025 (Spotlight)`
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction	GitHub	Hugging Face	2025-01	`CVPR 2025`
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge	GitHub	`N/A`	2025-01	`ICLR 2025`
Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method	GitHub	Hugging Face	2025-01	`CVPR 2025`
StreamChat: Chatting with Streaming Video	`N/A`	`N/A`	2024-11	`Arxiv`

✨️ Benchmarks

Name	Paper	Link	Time	Venue
GameplayQA	GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents	GitHub `<br>`Hugging Face	2026-03	`ACL 2026`
MMGR	MMGR: Multi-Modal Generative Reasoning	GitHub `<br>`Hugging Face	2015-12	`Arxiv`
MM-CoT	MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models	`N/A`	2015-12	`Arxiv`
RULER-Bench	RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence	GitHub `<br>`Hugging Face	2025-12	`Arxiv`
AV-SpeakerBench	See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models	GitHub	2025-12	`Arxiv`
PAI-Bench	PAI-Bench: A Comprehensive Benchmark For Physical AI	GitHub	2025-12	`Arxiv`
Envision	Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights	GitHub	2025-12	`Arxiv`
STREAMGAZE	STREAMGAZE: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos	GitHub `<br>`Hugging Face	2025-12	`Arxiv`
V-ReasonBench	V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models	GitHub	2025-11	`Arxiv`
VR-Bench	Reasoning via Video: The First Evaluation of Video Models’ Reasoning Abilities through Maze-Solving Tasks	GitHub `<br>`Hugging Face	2025-11	`Arxiv`
Gen-ViRe	Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark	GitHub	2025-11	`Arxiv`
TiViBench	TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models	GitHub	2025-11	`Arxiv`
VideoThinkBench	Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm	GitHub	2025-11	`Arxiv`
MME-CoF	Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark	Hugging Face	2025-10	`Arxiv`
SciVideoBench	SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models	GitHub	2025-10	`Arxiv`
ReasoningTrack	ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking	GitHub	2025-08	`Arxiv`
METER	METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark	`N/A`	2025-07	`Arxiv`
Video-TT	Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding	Hugging Face	2025-07	`ICCV 2025`
ImplicitQA	ImplicitQA: Going beyond frames towards Implicit Video Reasoning	Hugging Face	2025-06	`Arxiv`
Video-CoT	Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought	Hugging Face	2025-06	`Arxiv`
Implicit-VideoQA	Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning	GitHub	2025-06	`Arxiv`
MORSE-500	MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning	GitHub `<br>`Hugging Face	2025-06	`Arxiv`
SpookyBench	Time Blindness: Why Video-Language Models Can't See What Humans Can	GitHub `<br>`Hugging Face	2025-05	`Arxiv`
VideoReasonBench	VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?	GitHub `<br>`Hugging Face	2025-05	`Arxiv`
Video-Holmes	Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?	GitHub	2025-05	`Arxiv`
VideoEval-Pro	VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation	GitHub `<br>`Hugging Face	2025-05	`Arxiv`
VBenchComp	Breaking Down Video LLM Benchmarks	`N/A`	2025-05	`Arxiv`
RVTBench	RVTBench: A Benchmark for Visual Reasoning Tasks	GitHub `<br>`Hugging Face	2025-05	`Arxiv`
VCRBench	VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models	GitHub	2025-05	`Arxiv`
RTV-Bench	RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video	GitHub `<br>`Hugging Face	2025-05	`NeurIPS 2025 (D&B)`
MINERVA	MINERVA: Evaluating Complex Video Reasoning	GitHub	2025-05	`Arxiv`
VCR-Bench	VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning	GitHub `<br>`Hugging Face	2025-04	`Arxiv`
SEED-Bench-R1	Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1	GitHub `<br>`Hugging Face	2025-03	`Arxiv`
H2VU-Benchmark	H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding	GitHub	2025-03	`Arxiv`
OmniMMI	OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts	GitHub `<br>`Hugging Face	2025-03	`CVPR 2025`
HAVEN	Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation	GitHub `<br>`Hugging Face	2025-03	`Arxiv`
V-STaR	V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning	GitHub `<br>`Hugging Face	2025-03	`Arxiv`
COVER	Reasoning is All You Need for Video Generalization	GitHub	2025-03	`ACL 2025 (Findinds)`
MOMA-QA	Towards Fine-Grained Video Question Answering	`N/A`	2025-03	`Arxiv`
SVBench	SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding	GitHub	2025-02	`ICLR 2025 (Spotlight)`
StreamBench	Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge	GitHub `<br>`Hugging Face	2025-01	`ICLR 2025`
MMVU	MMVU: Measuring Expert-Level Multi-Discipline Video Understanding	GitHub `<br>`Hugging Face	2025-01	`Arxiv`
OVO-Bench	OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?	GitHub Hugging Face	2025-01	`CVPR 2025`
HLV-1K	HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding	GitHub	2025-01	`ICME 2025`
OVBench	Online Video Understanding: OVBench and VideoChat-Online	GitHub `<br>`Hugging Face	2025-01	`CVPR 2025`
VSI-Bench	Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces	GitHub	2024-12	`CVPR 2025 (Oral)`
3DSRBench	3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark	Hugging Face	2024-12	`ICCV 2025`
BlackSwanSuite	Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events	GitHub `<br>`Hugging Face	2024-12	`CVPR 2025`
TOMATO	TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models	Github	2024-10	`CVPR 2025`
OmnixR	OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities	`N/A`	2024-10	`ICLR 2025`
VideoVista	VideoVista: A Versatile Benchmark for Video Understanding and Reasoning	Github	2024-06	`Arxiv`
SOK-Bench	SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge	GitHub	2024-05	`CVPR 2024`
CVRR-ES	How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs	GitHub	2024-05	`Arxiv`

✈ Related Survey

In addition, several recent and concurrent surveys have discussed multimodal or video reasoning. The works listed below offer complementary perspectives to ours, reflecting the field’s rapid and parallel development:

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
assets		assets
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Landscape of Video Reasoning: Tasks, Paradigms and Benchmarks— An Open-Source Survey

🗺️ Overview

📖 Contents

📑 Task Definition