Benchmarking Causal Chain-Grounded Spatio-TemporalReasoning for Video Question Answering
Woven by Toyota • The University of Tokyo
* Joint first authors
CaST-Bench requires models to identify both cause and effect evidence in video — localized to specific spatial regions and temporal segments — to construct a grounded causal chain before answering.
Cause-and-effect reasoning in video remains a significant challenge for Vision-Language Models, as it requires going beyond surface-level perception to understand causal mechanisms. Existing benchmarks rarely provide the fine-grained, grounded evidence needed to rigorously evaluate this capability. We introduce CaST-Bench, a benchmark for Causal Chain-Grounded Spatio-Temporal Video Reasoning, where models must identify and localize a chain of spatio-temporal evidences — each consisting of a temporal segment, spatial bounding boxes, and a rationale — to support their answers. Our evaluation suite measures not only answer correctness but also the fidelity of the grounded causal chain. Experiments across a wide range of proprietary and open-source VLMs show that current models struggle with causal questions, largely due to their limited ability to construct precise and grounded causal chains — highlighting an important direction for future VLM development.
Loading…
Given a video and a causal question, the model must output a grounded causal chain — a sequence of spatio-temporal evidences — that leads to the final answer.
Video
Causal Question
Why / How / What if…
Jointly evaluate Output 1 & 2
| Model | MCQ Acc ↑ | Temporal | Spatio-Temporal | Combined | |||
|---|---|---|---|---|---|---|---|
| R@0.5 ↑ | IM-tIoU ↑ | R@0.1 ↑ | IM-vIoU ↑ | Faithful ↑ | Spurious ↓ | ||
| Proprietary Models | |||||||
| Gemini-2.5-Pro | 50.34 | 23.29 | 21.53 | 8.14 | 2.46 | 7.60 | 42.26 |
| Gemini-2.5-Flash | 45.60 | 29.45 | 27.63 | 13.29 | 3.52 | 9.97 | 33.35 |
| GPT-5 | 46.32 | 27.83 | 26.61 | 16.19 | 4.31 | 12.68 | 32.91 |
| GPT-5 mini | 37.22 | 21.28 | 19.89 | 8.31 | 2.41 | 6.29 | 30.83 |
| Open-Source Models | |||||||
| GLM-4.1V-9B-Thinking | 39.55 | 17.50 | 16.58 | 6.18 | 1.93 | 4.74 | 34.32 |
| InternVL-3.5-30B-A3B | 44.53 | 9.04 | 8.33 | 0.42 | 0.25 | 0.48 | 44.00 |
| InternVL-3.5-14B | 43.27 | 9.24 | 8.77 | 0.56 | 0.33 | 0.73 | 42.55 |
| InternVL-3.5-8B | 40.95 | 9.40 | 8.59 | 0.61 | 0.28 | 0.58 | 40.37 |
| Qwen3-VL-8B-Thinking | 39.16 | 10.44 | 10.23 | 2.69 | 0.87 | 2.27 | 36.74 |
| Qwen3-VL-8B-Instruct | 43.13 | 10.63 | 10.51 | 3.41 | 1.07 | 2.76 | 39.84 |
| Qwen3-VL-4B-Instruct | 45.30 | 11.65 | 11.21 | 2.94 | 0.93 | 2.76 | 42.30 |
| Qwen2.5-VL-7B-Instruct | 41.09 | 3.80 | 3.72 | 0.13 | 0.09 | 0.29 | 40.80 |
| MiMo-VL-7B-RL-2508 | 32.24 | 10.42 | 9.63 | 2.74 | 0.81 | 2.81 | 31.43 |
Questions asking why an event occurred or how a mechanism unfolded.
Questions inferring what would happen if a key causal element were altered.
Questions predicting the most probable immediate outcome of an ongoing event.
Questions inferring implicit attributes or states such as roles, intentions, or emotions.
Distribution of question types and subcategories
Distribution of video scene categories
Overview of the four-step data collection pipeline: (1) spatio-temporal video object detection and tracking, (2) fine-grained per-instance description via VLMs and human annotation, (3) causal QA and causal chain generation with explicit spatio-temporal evidences, and (4) mask-based QA filtering for causal chain validation.
@inproceedings{zhang2026castbench,
title = {CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering},
author = {Zhang, Mingfang and Pan, Jingjing and Kumar, Ashutosh and Saini, Rajat and Erdogan, Mustafa and Yang, Hsuan-Kung and Kang, Caixin and Huang, Yifei and Sato, Yoichi and Kong, Quan},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}