A new checkpoint trained using llava-v1.6-mistral-7b-hf with an enhanced training setup (LoRA tuning, batch size of 2048, maximum sub-dataset size of 100k). This model has shown significantly improved performance on MMEB & Flickr30K compared to the previous Phi-3.5-based model.

This repo contains the code and data for VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks. In this paper, we focus on building a unified multimodal embedding model suitable for a wide range of tasks. Our approach is based on transforming an existing, well-trained Vision-Language Model (VLM) into an embedding model. The core idea is to append an [EOS] token at the end of the input sequence, which serves as the representation for the combined multimodal inputs.

Downloads last month
122
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.

Model tree for parasail-ai/VLM2Vec-LLaVa-Next-vllm

Finetuned
(30)
this model

Dataset used to train parasail-ai/VLM2Vec-LLaVa-Next-vllm