vision language models (VLM)

xb-chang 's Collections

Efficient LLMs

Reinforcement Learning

LLMs

Noisy datasets

Difffusion

vision language models (VLM)

multimedia

Data Generation

Neural Arch

Video Analysis

updated Jul 22, 2024

Upvote

PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 69

Note Code and model available on github and hugging face. PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. strong performance on a wide variety of open-world tasks: evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.
Vision language models are blind

Paper • 2407.06581 • Published Jul 9, 2024 • 83

Note VLMs struggle with tasks that requires precise spatial information and counting (from 0 to 10)，有时候给人感觉VLM模型像是近视了，看不见细节，只能靠猜。
CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging

Paper • 2407.07315 • Published Jul 10, 2024 • 7

Note 用于天文学的CLIP，FT on pretrained CLIP.
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Paper • 2407.06189 • Published Jul 8, 2024 • 26

Note instruction tuning? [Q] the first video self-training approach. Video-STaR allows the utilization of any labeled video dataset for video instruction tuning.
Unveiling Encoder-Free Vision-Language Models

Paper • 2406.11832 • Published Jun 17, 2024 • 54

Note developing a pure decoder-only architecture across modalities. 语言模型为何能没有encoder？视觉模型可以没有encoder吗？How? 不同的大模型transformer结构整理； [2R]

Upvote