Merve Noyan's picture

Merve Noyan PRO

merve

·

AI & ML interests

VLMs, vision & co

Articles

PaliGemma – Google's Cutting-Edge Open Vision Language Model

Vision Language Models Explained

Introduction to Quantization cooked in 🤗 with 💗🧑‍🍳

Deploy MusicGen in no time with Inference Endpoints

Open-Source Text Generation & LLM Ecosystem at Hugging Face

Jupyter X Hugging Face

Using Machine Learning to Aid Survivors and Race through Time

Introducing Skops

Announcing the Hugging Face Fellowship Program

Showcase Your Projects in Spaces using Gradio

Hosting your Models and Datasets on Hugging Face Spaces using Streamlit

Organizations

Posts 31

Post

1553

Do we fully leverage ViT encoders in vision language models?

A new paper (by @HuanjinYao et al) built a dense connector that does it better! HuanjinYao/DenseConnector-v1.5-8B
HuanjinYao/denseconnector-66500e173fc8c9f05dc98dea

VLMs consist of an image encoder block, a projection layer that projects image embeddings to text embedding space and then a text decoder sequentially connected 📖
This paper explores using intermediate states of image encoder and not a single output 🤩
The authors explore three different ways of instantiating dense connector: sparse token integration, sparse channel integration and dense channel integration. (see paper on how they do it Dense Connector for MLLMs (2405.13800))

They explore all three of them integrated to LLaVA 1.5 and found out each of the new models are superior to the original LLaVA 1.5 🥹 I tried the model and it seems to work very well. As part of the release, the authors have released various ckpts based on different decoders (Vicuna 7/13B and Llama 3-8B) that you can find in the collection 🤗

Post

952

We will be providing ZeroGPU grants (for Spaces inference) to those who want to fine-tune PaliGemma and build a Space 🔥

You can pick any dataset of your choice!

Example code: https://colab.research.google.com/drive/1x_OEphRK0H97DqqxEyiMewqsTiLD_Xmi?usp=sharing (you can use a lower GPU with QLoRA)

Datasets:
https://huggingface.co/datasets?task_categories=task_categories:text-to-image&sort=trending
https://huggingface.co/datasets?task_categories=task_categories:image-to-text&sort=trending

Collections 21

spaces 97

Paligemma Tracking

Running on Zero

Owl Tracking

Powerful foundation model for zero-shot object tracking

Running on Zero

Paligemma Doc

Try PaliGemma on document understanding tasks

Running on Zero

BLIP2 with transformers

BLIP2 (cutting edge image captioning) in 🤗transformers

Running on Zero

Compare VLMs

Running on Zero

GroundingDINO ⚔ OWL

models 78

merve/paligemma_vqav2

Updated 6 days ago • 113 • 2

merve/pg-vqav2

Updated 10 days ago

merve/checkpoint

Updated 18 days ago

merve/output8

Updated 22 days ago

merve/output4

merve/VeCap-DFN-h14

Zero-Shot Image Classification • Updated Mar 26 • 3

merve/VeCap-DFN-l14

merve/VeCap-DFN-b16

Zero-Shot Image Classification • Updated Mar 26 • 3

merve/VeCLIP-b16-100m

Zero-Shot Image Classification • Updated Mar 26 • 2

merve/VeCLIP-b16-200m

Zero-Shot Image Classification • Updated Mar 26 • 13 • 1

datasets 22

merve/YouCook2

Viewer • Updated 4 days ago

merve/faiss_embeddings

merve/pokemon-ds-embeddings

Viewer • Updated Jan 10 • 3

merve/tr-h4-norobots

Updated Jan 7 • 10

merve/lego_sets_latest

Viewer • Updated Jan 6 • 5 • 1

merve/ai-tube-dummy

Updated Dec 1, 2023

merve/my-blog-images

Viewer • Updated Aug 25, 2023 • 1

merve/turkish_instructions

Viewer • Updated Apr 27, 2023 • 650 • 32

merve/ner-flags

Updated Feb 13, 2023

merve/xlm-roberta-large-df

Viewer • Updated Feb 7, 2023