| Rethinking Chain-of-Thought Reasoning for Videos |
GitHub  |
N/A |
 |
2025-12 |
Arxiv |
| 1+1 > 2 : Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning |
GitHub  |
N/A |
 |
2025-12 |
Arxiv |
| TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning |
N/A |
N/A |
 |
2025-12 |
Arxiv |
| OneThinker: All-in-one Reasoning Model for Image and Video |
GitHub  |
Hugging Face |
 |
2025-12 |
Arxiv |
| WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning |
GitHub  |
N/A |
 |
2025-12 |
Arxiv |
| Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding |
N/A |
N/A |
 |
2025-12 |
Arxiv |
| Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models |
GitHub  |
N/A |
 |
2025-11 |
Arxiv |
| Video-CoM: Interactive Video Reasoning via Chain of Manipulations |
GitHub  |
N/A |
 |
2025-11 |
Arxiv |
| VideoSeg-R1: Reasoning Video Object Segmentation via Reinforcement Learning |
GitHub  |
N/A |
 |
2025-11 |
Arxiv |
| AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning |
N/A |
N/A |
 |
2025-11 |
Arxiv |
| Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding |
N/A |
N/A |
 |
2025-11 |
Arxiv |
| Video Spatial Reasoning with Object-Centric 3D Rollout |
N/A |
N/A |
 |
2025-11 |
Arxiv |
| ViSS-R1: Self-Supervised Reinforcement Video Reasoning |
N/A |
N/A |
 |
2025-11 |
Arxiv |
| Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning |
GitHub  |
Hugging Face |
 |
2025-10 |
Arxiv |
| Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence |
GitHub  |
Hugging Face |
 |
2025-10 |
Arxiv |
| VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception |
GitHub  |
Hugging Face |
 |
2025-09 |
Arxiv |
| MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning |
N/A |
N/A |
 |
2025-09 |
Arxiv |
| Kwai Keye-VL 1.5 Technical Report |
GitHub  |
Hugging Face |
 |
2025-09 |
Arxiv |
| Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data |
GitHub  |
Google_Drive |
 |
2025-09 |
Arxiv |
| Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding |
N/A |
N/A |
 |
2025-08 |
Arxiv |
| Ovis2.5 Technical Report |
GitHub  |
Hugging Face |
 |
2025-08 |
Arxiv |
| Veason-R1: Reinforcing Video Reasoning Segmentation to Think Before It Segments |
N/A |
N/A |
 |
2025-08 |
Arxiv |
| ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking |
GitHub  |
N/A |
 |
2025-08 |
Arxiv |
| TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding |
N/A |
N/A |
 |
2025-08 |
Arxiv |
| Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning |
GitHub  |
Hugging Face |
 |
2025-08 |
Arxiv |
| AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video |
GitHub  |
Hugging Face |
 |
2025-08 |
Arxiv |
| ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models |
N/A |
N/A |
 |
2025-08 |
Arxiv |
| VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering |
N/A |
N/A |
 |
2025-08 |
ACM-MM 2025 |
| ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts |
GitHub  |
Hugging Face |
 |
2025-07 |
Arxiv |
| METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark |
N/A |
N/A |
 |
2025-07 |
Arxiv |
| CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks |
N/A |
N/A |
 |
2025-07 |
Arxiv |
| EmbRACE-3K: Embodied Reasoning and Action in Complex Environments |
N/A |
N/A |
 |
2025-07 |
Arxiv |
| Scaling RL to Long Videos |
GitHub  |
Hugging Face |
 |
2025-07 |
NeurIPS 2025 |
| Kwai Keye-VL Technical Report |
GitHub  |
N/A |
 |
2025-07 |
Arxiv |
| ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models |
GitHub  |
N/A |
 |
2025-07 |
ACM-MM 2025 |
| Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning |
GitHub  |
Hugging Face |
 |
2025-07 |
EMNLP 2025 |
| Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames |
N/A |
N/A |
 |
2025-07 |
Arxiv |
| VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning |
N/A |
N/A |
 |
2025-06 |
Arxiv |
| Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning |
GitHub  |
N/A |
 |
2025-06 |
Arxiv |
| DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning |
N/A |
N/A |
 |
2025-06 |
Arxiv |
| VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks |
GitHub  |
Hugging Face |
 |
2025-06 |
Arxiv |
| HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context |
GitHub  |
N/A |
 |
2025-06 |
Arxiv |
| MiMo-VL Technical Report |
GitHub  |
Hugging Face |
 |
2025-06 |
Arxiv |
| Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning |
GitHub  |
N/A |
 |
2025-06 |
EMNLP 2025 (Findinds) |
| EgoVLM: Policy Optimization for Egocentric Video Understanding |
GitHub  |
Hugging Face |
 |
2025-06 |
Arxiv |
| Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency |
GitHub  |
Hugging Face |
 |
2025-06 |
Arxiv |
| VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking |
N/A |
N/A |
 |
2025-06 |
Arxiv |
| ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding |
GitHub  |
N/A |
 |
2025-06 |
NeurIPS 2025 |
| ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding |
N/A |
N/A |
 |
2025-06 |
Arxiv |
| DIVE: Deep-search Iterative Video Exploration |
Github  |
N/A |
 |
2025-06 |
CVPR 2025 |
| VideoDeepResearch: Long Video Understanding With Agentic Tool Using |
Github  |
N/A |
 |
2025-06 |
Arxiv |
| Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency |
N/A |
N/A |
 |
2025-06 |
Arxiv |
| DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO |
Github  |
N/A |
 |
2025-06 |
NeurIPS 2025 |
| Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought |
N/A |
Project_Page |
 |
2025-06 |
Arxiv |
| VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning |
N/A |
N/A |
 |
2025-06 |
Arxiv |
| Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning |
Github  |
Hugging Face |
 |
2025-06 |
Arxiv |
| Reinforcing Video Reasoning with Focused Thinking |
Github  |
Hugging Face |
 |
2025-05 |
Arxiv |
| A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding |
Github  |
N/A |
 |
2025-05 |
Arxiv |
| Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration |
Github  |
Hugging Face |
 |
2025-05 |
NeurIPS 2025 |
| Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought |
Github  |
N/A |
 |
2025-05 |
NeurIPS 2025 |
| VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Guided Iterative Policy Optimization |
Github  |
Hugging Face |
 |
2025-05 |
Arxiv |
| Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning |
Github  |
N/A |
 |
2025-05 |
NeurIPS 2025 |
| Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning |
Github  |
Hugging Face |
 |
2025-05 |
NeurIPS 2025 |
| UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning |
Github  |
Hugging Face |
 |
2025-05 |
Arxiv |
| VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning |
Github  |
Hugging Face |
 |
2025-05 |
NeurIPS 2025 |
| Seed1.5-VL Technical Report |
N/A |
N/A |
 |
2025-05 |
Arxiv |
| TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action |
Github  |
Hugging Face |
 |
2025-05 |
Arxiv |
| Fostering Video Reasoning via Next-Event Prediction |
Github  |
N/A |
 |
2025-05 |
Arxiv |
| SiLVR: A Simple Language-based Video Reasoning Framework |
Github  |
N/A |
 |
2025-05 |
Arxiv |
| RVTBench: A Benchmark for Visual Reasoning Tasks |
GitHub  |
Hugging Face |
 |
2025-05 |
Arxiv |
| CoT-Vid: Dynamic Chain-of-Thought Routing with Self-Verification for Training-Free Video Reasoning |
N/A |
N/A |
 |
2025-05 |
Arxiv |
| VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models |
GitHub  |
N/A |
 |
2025-05 |
Arxiv |
| AVA: Towards Agentic Video Analytics with Vision Language Models |
GitHub  |
N/A |
 |
2025-05 |
NSDI 2026 |
| TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning |
GitHub  |
Hugging Face |
 |
2025-04 |
Arxiv |
| VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning |
GitHub  |
Hugging Face |
 |
2025-04 |
Arxiv |
| Spatial-R1: Enhancing MLLMs in Video Spatial Reasoning |
GitHub  |
Hugging Face |
 |
2025-04 |
Arxiv |
| Improved Visual-Spatial Reasoning via R1-Zero-Like Training |
GitHub  |
Hugging Face |
 |
2025-04 |
Arxiv |
| Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models |
GitHub  |
N/A |
 |
2025-04 |
Arxiv |
| LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding |
N/A |
N/A |
 |
2025-04 |
Arxiv |
| From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models |
N/A |
Hugging Face |
 |
2025-04 |
Arxiv |
| MR. Video: "MapReduce" is the Principle for Long Video Understanding |
GitHub  |
N/A |
 |
2025-04 |
Arxiv |
| Multimodal Long Video Modeling Based on Temporal Dynamic Context |
GitHub  |
Hugging Face |
 |
2025-04 |
Arxiv |
| WikiVideo: Article Generation from Multiple Videos |
GitHub  |
N/A |
 |
2025-04 |
Arxiv |
| Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 |
GitHub  |
Hugging Face |
 |
2025-03 |
Arxiv |
| Video-R1: Reinforcing Video Reasoning in MLLMs |
GitHub  |
Hugging Face |
 |
2025-03 |
NeurIPS 2025 |
| TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM |
GitHub  |
Hugging Face |
 |
2025-03 |
NeurIPS 2025 |
| ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos |
N/A |
N/A |
 |
2025-03 |
NeurIPS 2025 |
| VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning |
GitHub  |
Hugging Face |
 |
2025-03 |
Arxiv |
| Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs |
GitHub  |
N/A |
 |
2025-03 |
ICCV 2025 |
| video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model |
GitHub  |
Hugging Face |
 |
2025-02 |
Arxiv |
| TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding |
GitHub  |
N/A |
 |
2025-02 |
ACL 2025 (Oral) |
| CoS: Chain-of-Shot Prompting for Long Video Understanding |
GitHub  |
N/A |
 |
2025-02 |
Arxiv |
| Temporal Preference Optimization for Long-Form Video Understanding |
GitHub  |
Hugging Face |
 |
2025-01 |
Arxiv |
| InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model |
GitHub  |
Hugging Face |
 |
2025-01 |
ACL 2025 (Findings) |
| MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning |
GitHub  |
Hugging Face |
 |
2025-01 |
IEEE TPAMI |
| Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs |
N/A |
N/A |
 |
2025-01 |
Arxiv |
| Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition |
GitHub  |
N/A |
 |
2025-01 |
ICML 2024 |
| Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling |
GitHub  |
Hugging Face |
 |
2024-12 |
Arxiv |
| STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training |
N/A |
N/A |
 |
2024-12 |
CVPR 2025 |
| VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection |
GitHub  |
Hugging Face |
 |
2024-11 |
CVPR 2025 |
| Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning |
N/A |
N/A |
 |
2024-10 |
NeurIPS 2024 (Workshop) |
| VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs |
GitHub  |
N/A |
 |
2024-09 |
EMNLP 2024 (Findinds) |
| MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning |
GitHub  |
Hugging Face |
 |
2024-09 |
NeurIPS 2024 (Spotlight) |