arxiv:2404.19394

CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation

Published on Apr 30, 2024

Authors:

Yifan Yang

Abstract

State space models and Mamba-based models have been increasingly applied across various domains, achieving state-of-the-art performance. This technical report introduces the first attempt to train a transferable Mamba model utilizing contrastive language-image pretraining (CLIP). We have trained Mamba models of varying sizes and undertaken comprehensive evaluations of these models on 26 zero-shot classification datasets and 16 out-of-distribution (OOD) datasets. Our findings reveal that a Mamba model with 67 million parameters is on par with a 307 million-parameter Vision Transformer (ViT) model in zero-shot classification tasks, highlighting the parameter efficiency of Mamba models. In tests of OOD generalization, Mamba-based models exhibit exceptional performance in conditions of OOD image contrast or when subjected to high-pass filtering. However, a Hessian analysis indicates that Mamba models feature a sharper and more non-convex landscape compared to ViT-based models, making them more challenging to train. The source code is available at https://github.com/raytrun/mamba-clip.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2404.19394 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2404.19394 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2404.19394 in a Space README.md to link it from this page.