[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
benchmark action-recognition video-understanding video-data self-supervised multimodal video-dataset open-set-recognition video-retrieval video-question-answering masked-autoencoder temporal-action-localization contrastive-learning spatio-temporal-action-localization zero-shot-retrieval video-clip vision-transformer zero-shot-classification foundation-models instruction-tuning
- Updated
Dec 15, 2025 - Python