This is the official repository of EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs.
EgoExoBench is a large-scale benchmark designed to evaluate cross-view video understanding in multimodal large language models (MLLMs). It contains paired egocentric–exocentric videos and over 7,300 multiple-choice questions across 11 subtasks, covering three key dimensions of ego–exo reasoning:
- Ego-Exo Relation
- Ego-Exo View Transition
- Ego-Exo Temporal Reasoning
To get started with EgoExoBench, follow the steps below to prepare data:
We provide pre-processed videos, frames, and multiple-choice question (MCQ) files on Hugging Face. You can download them directly without additional preprocessing.
-
MCQs: The MCQs are provided in
.tsvformat, following the VLMEvalKit data structure. -
Processed Videos and Frames:
- The processed videos directory contains video clips corresponding to each MCQ. These files are suitable for models that accept video input (e.g., Qwen2.5-VL).
- The processed frames directory contains frame sequences extracted from videos. These are used for models that take image sequences as input (e.g., InternVL).
Alternatively, you can build the benchmark yourself by downloading the original datasets.
EgoExoBench builds upon six publicly available ego–exo datasets. Please download the videos from the following sources:
Place all datasets under the data/ directory. The dataset structure is as follows:
EgoExoBench/ └── data/ ├── CVMHAT/ │ └── data ├── Ego-Exo4D/ │ └── takes/ ├── EgoExoLearn/ ├── EgoMe/ ├── LEMMA/ └── TF2023/ └── data/ For the CVMHAT and TF2023 datasets, we utilize the bounding box annotations to augment the original frames by overlaying bounding boxes that indicate the target person. To generate these bboxes, run the following commands:
python data/CVMHAT/tools/process_bbox.py python data/TF2023/tools/process_bbox.pyDownload the EgoExoBench multiple-choice questions (MCQs) file (link) and place it in the MCQ/ directory.
git clone https://github.com/ayiyayi/EgoExoBench.git cd EgoExoBenchPlease note that different VLMs require specific environment configurations (e.g., different versions of transformers). We recommend consulting the official documentation of each VLM to ensure an accurate evaluation and proper setup. Qwen2.5VL, InternVL3, LLaVA-OneVision, LLaVA-NeXT-Video
Evaluation is built upon VLMEvalKit.
# for VLMs that consume small amounts of GPU memory torchrun --nproc-per-node=1 run.py --data EgoExoBench_MCQ --model Qwen2.5-VL-7B-Instruct-ForVideo # for very large VLMs python run.py --data EgoExoBench_MCQ --model Qwen2.5-VL-72B-Instruct-ForVideoThis codebase is based on VLMEvalKit. EgoExoBench builds upon publicly available ego–exo datasets: Ego-Exo4D, LEMMA, EgoExoLearn, TF2023, EgoMe, CVMHAT. Thanks for open-sourcing!