arxiv:2310.15308

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

Published on Oct 23, 2023

· Submitted by

akhaliq on Oct 25, 2023

#2 Paper of the day

Upvote

Authors:

Haoxiang Wang ,

Pavan Kumar Anasosalu Vasu ,

Fartash Faghri ,

Mehrdad Farajtabar ,

Sachin Mehta ,

Hadi Pouransari

Abstract

The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that assimilates their expertise. Our proposed method integrates multi-task learning, continual learning techniques, and teacher-student distillation. This strategy entails significantly less computational cost compared to traditional multi-task training from scratch. Additionally, it only demands a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we derive SAM-CLIP: a unified model that amalgamates the strengths of SAM and CLIP into a single backbone, making it apt for edge device applications. We show that SAM-CLIP learns richer visual representations, equipped with both localization and semantic features, suitable for a broad range of vision tasks. SAM-CLIP obtains improved performance on several head probing tasks when compared with SAM and CLIP. We further show that SAM-CLIP not only retains the foundational strengths of its precursor models but also introduces synergistic functionalities, most notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.

View arXiv page View PDF Add to collection

Community

librarian-bot

Oct 26, 2023

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

TheProjectsGuy

Nov 1, 2023

Proposes a method to merge VFMs (vision foundation models) into a unified model using multi-task learning, continual learning techniques, and teacher-student distillation; introduces SAM-CLIP (merging SAM and CLIP) - better localization (spatial) and semantic features, suitable for edge device deployment, zero-shot semantic segmentation. Related to knowledge distillation (from two teachers into single student) with student getting zero-shot capabilities; continual learning based on memory replay; data-dependent merging approach. Use SAM as base VFM (ViT-Det can efficiently parse high-resolution images) and CLIP as auxiliary VFM. Merge SAM and CLIP’s image encoder in one backbone (initialised by SAM encoder); separate heads for SAM (Mask decoder of SAM) and CLIP (random initialization); SAM’s prompt encoder and CLIP’s text encoder are frozen. Baseline is KD on CLIP’s (proxy) training data using cosine distillation loss (frozen CLIP encoder is teacher); leads to catastrophic forgetting of SAM’s abilities; first train the CLIP head in head probing; then apply multi-task distillation (linear combination of SAM and CLIP losses training the heads and the shared/joint SAM-CLIP encoder backbone) with SAM’s FD loss (focal and dice loss). Merged-41M is a combination of partial CLIP (CC3M, CC12M, YFCC-15M, and ImageNet-21k) and SAM (5.7 percent subset of SA-1B); uses ViT-B/16 as backbone, CLIP head is a three layer transformer (with max-pooling for getting image-level embeddings). Resolution adaptation is fine-tuning the CLIP-head with high-resolution images (1024 as opposed to 224/336/448) for few epochs. Shared backbone encoder and SAM head have lower learning rate (avoids forgetting), with CLIP head having higher LR (because it’s rand-init). Resulting SAM-CLIP model (inference pipeline uses frozen CLIP text encoder and SAM geometric encoder) has retained zero-shot instance segmentation of SAM and zero-shot classification of CLIP, while having emergent zero-shot semantic segmentation (Pascal VOC, ADE20k, COCO). Performs head-probing (learn task-specific head with frozen image backbone) with linear head, DeepLab-v3, and PSPNet segmentation heads on Pascal VOC and ADE20k datasets; SAM-CLIP aligns with CLIP’s semantics for semantic segmentation and image classification (ImageNet and Places365). SAM head can take embeddings of SAM-CLIP image encoder/backbone and geometric encoder (which is getting point prompts from coarse CLIP head mask predictions); better mIoU for semantic segmentation and high-resolution image processing. Spatial understanding of SAM and semantic scene understanding of CLIP merged into a single model. Appendix has software details (used CVNets framework), training details, ablation on image resolution; more inference experiments; weight averaging using Wise-FT experiments (different alpha); and limitations. From Apple and University of Illinois (UC).