22 SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding · 9 authors 3