Edit model card

We are creating a spatial aware vision-language(VL) model.

This is a trained model on COCO dataset images including extra information regarding the spatial relationship between the entities of the image.

This is a sequence to sequence model for visual question-answering. The architecture is BLIP.(BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation)

Requirements! - 4GB GPU RAM. - CUDA enabled docker

The way to download and run this:

from transformers import BlipProcessor, BlipForQuestionAnswering
import torch
from PIL import Image
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Specify the path to the directory where the model was saved
model_path = "voxeality/rgb-language_vqa"
# Load the model
model = BlipForQuestionAnswering.from_pretrained(model_path).to(device, torch.float16)
question = "any question in the form of where is an object or what is to the left/right/above/below/in front/behind the object"
image_path= 'path/to/file'
image = Image.open(image_path).convert("RGB")

# Load the processor used during training for consistent preprocessing
processor = BlipProcessor.from_pretrained(model_path)
# prepare inputs
encoding = processor(image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**encoding, max_new_tokens=200)
generated_text = processor.decode(out[0], skip_special_tokens=True)
print(generated_text)

The model is trained to produce a spatial answer to any question regarding spaial relationships between objects of the image.

The output of this dialogue is either of that form:

Q. Where is "Object1"?. A. to the "Left/Right etc." of another "Object2".

OR

Q. What is below the "Object1". A. an "Object2".

Downloads last month
2
Safetensors
Model size
385M params
Tensor type
F32
·
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.