SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning
April 8, 2025·
,
,,,,·
0 min read
Fida mohammad thoker
Letian jiang
Chen Zhao
Piyush bagad
Hazel doughty
Bernard ghanem
Cees gm snoek

Abstract
Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by eliminating the need for manual annotations. Despite strong performance on standard action recognition benchmarks, existing video self-supervised learning methods are predominantly evaluated within narrow protocols—typically pre-training on Kinetics-400 and finetuning on similar datasets—limiting our understanding of their generalization capabilities in real-world settings. In this work, we present a comprehensive evaluation of modern video self-supervised learning models, focusing on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Building on our prior work analyzing benchmark sensitivity in CNN-based contrastive learning, we extend the study to cover current state-of-the-art transformer-based video-only and video-text representation models. Specifically, we benchmark 12 transformer-based methods (7 video-only, 5 video-text) and compare them against 10 CNN-based methods, resulting in over 1100 experiments across 8 datasets and 7 downstream tasks. Our analysis reveals that, despite architectural advancements, transformer-based models remain sensitive to downstream conditions. No single method generalizes consistently across all factors; for instance, video-only transformers are more robust to domain shift, CNN-based models perform better on tasks requiring fine-grained temporal reasoning, and video-text transformers underperform both in several downstream settings despite large-scale pretraining. We also observe that recent transformer-based approaches do not universally outperform earlier methods. These findings provide a detailed understanding of the capabilities and limitations of current video self-supervised learning approaches and establish an extended benchmark for evaluating generalization in video representation learning. Our benchmark offers a unified protocol for future research aimed at developing robust, transferable video models.
Type
Publication
arXiv preprint arXiv:2504.05706, 2025
Video Representation Learning
Video Understanding
Self-Supervised Learning
Vision-Language Pretraining
Authors
Authors

Authors
Authors
Authors
Authors