Skip to content

ayiyayi/EgoExoBench

Repository files navigation

EgoExoBench

📄 Report | 📊 Data

This is the official repository of EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs.

📊 Benchmark Overview

EgoExoBench is a large-scale benchmark designed to evaluate cross-view video understanding in multimodal large language models (MLLMs). It contains paired egocentric–exocentric videos and over 7,300 multiple-choice questions across 11 subtasks, covering three key dimensions of ego–exo reasoning:

  • Ego-Exo Relation
  • Ego-Exo View Transition
  • Ego-Exo Temporal Reasoning

📝 Data Preparation

To get started with EgoExoBench, follow the steps below to prepare data:

Method 1: Direct Download (Recommended)

We provide pre-processed videos, frames, and multiple-choice question (MCQ) files on Hugging Face. You can download them directly without additional preprocessing.

  • MCQs: The MCQs are provided in .tsv format, following the VLMEvalKit data structure.

  • Processed Videos and Frames:

    • The processed videos directory contains video clips corresponding to each MCQ. These files are suitable for models that accept video input (e.g., Qwen2.5-VL).
    • The processed frames directory contains frame sequences extracted from videos. These are used for models that take image sequences as input (e.g., InternVL).

Method 2: Build from Source Datasets

Alternatively, you can build the benchmark yourself by downloading the original datasets.

Dataset Collection

EgoExoBench builds upon six publicly available ego–exo datasets. Please download the videos from the following sources:

Place all datasets under the data/ directory. The dataset structure is as follows:

EgoExoBench/ └── data/ ├── CVMHAT/ │ └── data ├── Ego-Exo4D/ │ └── takes/ ├── EgoExoLearn/ ├── EgoMe/ ├── LEMMA/ └── TF2023/ └── data/ 

Data Preparation

For the CVMHAT and TF2023 datasets, we utilize the bounding box annotations to augment the original frames by overlaying bounding boxes that indicate the target person. To generate these bboxes, run the following commands:

python data/CVMHAT/tools/process_bbox.py python data/TF2023/tools/process_bbox.py

Download Multiple-Choice Questions (MCQs)

Download the EgoExoBench multiple-choice questions (MCQs) file (link) and place it in the MCQ/ directory.

Installation

git clone https://github.com/ayiyayi/EgoExoBench.git cd EgoExoBench

Please note that different VLMs require specific environment configurations (e.g., different versions of transformers). We recommend consulting the official documentation of each VLM to ensure an accurate evaluation and proper setup. Qwen2.5VL, InternVL3, LLaVA-OneVision, LLaVA-NeXT-Video

🚀 Model Evaluation

Evaluation is built upon VLMEvalKit.

# for VLMs that consume small amounts of GPU memory torchrun --nproc-per-node=1 run.py --data EgoExoBench_MCQ --model Qwen2.5-VL-7B-Instruct-ForVideo # for very large VLMs python run.py --data EgoExoBench_MCQ --model Qwen2.5-VL-72B-Instruct-ForVideo

🙏 Acknowledgements

This codebase is based on VLMEvalKit. EgoExoBench builds upon publicly available ego–exo datasets: Ego-Exo4D, LEMMA, EgoExoLearn, TF2023, EgoMe, CVMHAT. Thanks for open-sourcing!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages