CLIP

Beta-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment featured image

Beta-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment

CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long, …

Fatimah zohra
Effectiveness of Max-Pooling for Fine-Tuning CLIP on Videos featured image

Effectiveness of Max-Pooling for Fine-Tuning CLIP on Videos

CLIP is a powerful spatial feature extractor trained on a large dataset of image-text pairs. It exhibits strong generalization when extended to other domains and modalities. …

Fatimah zohra