CaST-Bench

Benchmarking Causal Chain-Grounded Spatio-TemporalReasoning for Video Question Answering

Mingfang Zhang*Jingjing Pan*,  Ashutosh Kumar, Rajat Saini, Mustafa Erdogan,
Hsuan-Kung Yang, Caixin Kang, Yifei Huang, Yoichi Sato, Quan Kong

Woven by Toyota  •  The University of Tokyo
* Joint first authors

CaST-Bench overview

CaST-Bench requires models to identify both cause and effect evidence in video — localized to specific spatial regions and temporal segments — to construct a grounded causal chain before answering.

Abstract

Cause-and-effect reasoning in video remains a significant challenge for Vision-Language Models, as it requires going beyond surface-level perception to understand causal mechanisms. Existing benchmarks rarely provide the fine-grained, grounded evidence needed to rigorously evaluate this capability. We introduce CaST-Bench, a benchmark for Causal Chain-Grounded Spatio-Temporal Video Reasoning, where models must identify and localize a chain of spatio-temporal evidences — each consisting of a temporal segment, spatial bounding boxes, and a rationale — to support their answers. Our evaluation suite measures not only answer correctness but also the fidelity of the grounded causal chain. Experiments across a wide range of proprietary and open-source VLMs show that current models struggle with causal questions, largely due to their limited ability to construct precise and grounded causal chains — highlighting an important direction for future VLM development.

Dataset Samples

Causal Explanation

Loading…

Answer
Loading…
Evidence Chain

Task Definition

Given a video and a causal question, the model must output a grounded causal chain — a sequence of spatio-temporal evidences — that leads to the final answer.

Input

Video


Causal Question
Why / How / What if…

Output 1: Causal Chain
Evidence 1
00:02 – 00:05
Bounding box per second
"The subject begins to crouch down."
Evidence N
00:06 – 00:09
Bounding box per second
"The surrounding object shifts position."
···
Output 2: Final Answer
Answer
Causal Chain Metrics
  • IM-tIoU ↑ Instance-matched temporal IoU of the predicted causal chain
  • IM-vIoU ↑ Instance-matched spatio-temporal IoU (temporal × spatial)
Answer Metrics
  • MCQ Accuracy ↑ Multiple-choice answer correctness
Combined Metrics

Jointly evaluate Output 1 & 2

  • Faithful Rate ↑ Correct answers supported by well-grounded evidence
  • Spurious Rate ↓ Correct answers with no adequate evidence grounding
Model MCQ Acc ↑ Temporal Spatio-Temporal Combined
R@0.5 ↑ IM-tIoU ↑ R@0.1 ↑ IM-vIoU ↑ Faithful ↑ Spurious ↓
Proprietary Models
Gemini-2.5-Pro50.3423.2921.538.142.467.6042.26
Gemini-2.5-Flash45.6029.4527.6313.293.529.9733.35
GPT-546.3227.8326.6116.194.3112.6832.91
GPT-5 mini37.2221.2819.898.312.416.2930.83
Open-Source Models
GLM-4.1V-9B-Thinking39.5517.5016.586.181.934.7434.32
InternVL-3.5-30B-A3B44.539.048.330.420.250.4844.00
InternVL-3.5-14B43.279.248.770.560.330.7342.55
InternVL-3.5-8B40.959.408.590.610.280.5840.37
Qwen3-VL-8B-Thinking39.1610.4410.232.690.872.2736.74
Qwen3-VL-8B-Instruct43.1310.6310.513.411.072.7639.84
Qwen3-VL-4B-Instruct45.3011.6511.212.940.932.7642.30
Qwen2.5-VL-7B-Instruct41.093.803.720.130.090.2940.80
MiMo-VL-7B-RL-250832.2410.429.632.740.812.8131.43

Question & Video Taxonomy

Causal Explanation

Questions asking why an event occurred or how a mechanism unfolded.

Counterfactual Reasoning

Questions inferring what would happen if a key causal element were altered.

Predictive Anticipation

Questions predicting the most probable immediate outcome of an ongoing event.

Inferential Description

Questions inferring implicit attributes or states such as roles, intentions, or emotions.

Video scene distribution

Distribution of question types and subcategories

Question type distribution

Distribution of video scene categories

Dataset Construction

Dataset construction pipeline

Overview of the four-step data collection pipeline: (1) spatio-temporal video object detection and tracking, (2) fine-grained per-instance description via VLMs and human annotation, (3) causal QA and causal chain generation with explicit spatio-temporal evidences, and (4) mask-based QA filtering for causal chain validation.

BibTeX

@inproceedings{zhang2026castbench,
  title = {CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering},
  author = {Zhang, Mingfang and Pan, Jingjing and Kumar, Ashutosh and Saini, Rajat and Erdogan, Mustafa and Yang, Hsuan-Kung and Kang, Caixin and Huang, Yifei and Sato, Yoichi and Kong, Quan},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2026}
}