Deep Learning

OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection featured image

OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection

Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos. Although this field …

Shuming liu
Effectiveness of Max-Pooling for Fine-Tuning CLIP on Videos featured image

Effectiveness of Max-Pooling for Fine-Tuning CLIP on Videos

CLIP is a powerful spatial feature extractor trained on a large dataset of image-text pairs. It exhibits strong generalization when extended to other domains and modalities. …

Fatimah zohra
Towards Automated Movie Trailer Generation featured image

Towards Automated Movie Trailer Generation

Movie trailers are an essential tool for promoting films and attracting audiences. However the process of creating trailers can be time-consuming and expensive. To streamline this …

Dawit mureja argaw
Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning featured image

Dr<sup>2</sup>Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning

Large pretrained models are increasingly crucial in modern computer vision tasks. These models are typically used in downstream tasks by end-to-end finetuning, which is highly …

avatar
Chen Zhao
End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames featured image

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames

Recently, temporal action detection (TAD) has seen significant performance improvement with end-to-end training. However, due to the memory bottleneck, only models with limited …

Shuming liu
Re2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization featured image

Re<sup>2</sup>TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization

Temporal action localization (TAL) requires long-form reasoning to predict actions of various durations and complex content. Given limited GPU memory, training TAL end to end …

avatar
Chen Zhao
FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model featured image

FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model

Recently, conditional diffusion models have gained popularity in numerous applications due to their exceptional generation ability. However, many existing methods are …

Jiwen yu
EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with Visual Queries featured image

EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with Visual Queries

With the recent advances in video and 3D understanding, novel 4D spatio-temporal methods fusing both concepts have emerged. Towards this direction, the Ego4D Episodic Memory …

Jinjie mai
A Unified Continual Learning Framework with General Parameter-Efficient Tuning featured image

A Unified Continual Learning Framework with General Parameter-Efficient Tuning

The 'pre-training → downstream adaptation' presents both new opportunities and challenges for Continual Learning (CL). Although the recent state-of-the-art in CL is achieved …

Qiankun gao
Large-capacity and Flexible Video Steganography via Invertible Neural Network featured image

Large-capacity and Flexible Video Steganography via Invertible Neural Network

Video steganography is the art of unobtrusively concealing secret data in a cover video and then recovering the secret data through a decoding protocol at the receiver end. …

Chong mou