dmedhi/flosmolv · Hugging Face

FloSmolV

A vision model for Image-text to Text generation produced by combining HuggingFaceTB/SmolLM-360M-Instruct and microsoft/Florence-2-base.

The Florence2-base models generate texts(captions) from input images significantly faster. This text content can be input for a large language model to answer questions. SmolLM-360M is an excellent model by HuggingFace team to generate rapid text output for input queries. These models are combined together to produce a Visual Question Answering model which can produce answers from Images.

Usage

Make sure to install the necessary dependencies.

pip install -qU transformers accelerate einops bitsandbytes flash_attn timm

# load a free image from pixabay
from PIL import Image 
import requests
url = "https://cdn.pixabay.com/photo/2023/11/01/11/15/cable-car-8357178_640.jpg"
img = Image.open(requests.get(url, stream=True).raw)

# download model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("dmedhi/flosmolv", trust_remote_code=True).cuda()
model(img, "what is the object in the image?")

You can find more about the model and configuration script here: https://huggingface.co/dmedhi/flosmolv/tree/main