We are creating a spatial aware vision-language(VL) model.
This is a trained model on COCO dataset images including extra information regarding the spatial relationship between the entities of the image.
This is a sequence to sequence model for visual question-answering. The architecture is BLIP.(BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation)
Requirements!
- 4GB GPU RAM. - CUDA enabled dockerThe way to download and run this:
from transformers import BlipProcessor, BlipForQuestionAnswering
import torch
from PIL import Image
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Specify the path to the directory where the model was saved
model_path = "voxeality/rgb-language_vqa"
# Load the model
model = BlipForQuestionAnswering.from_pretrained(model_path).to(device, torch.float16)
question = "any question in the form of where is an object or what is to the left/right/above/below/in front/behind the object"
image_path= 'path/to/file'
image = Image.open(image_path).convert("RGB")
# Load the processor used during training for consistent preprocessing
processor = BlipProcessor.from_pretrained(model_path)
# prepare inputs
encoding = processor(image, question, return_tensors="pt").to("cuda", torch.float16)
# Welcome to the VOXReality Horizon Europe Project
out = model.generate(**encoding, max_new_tokens=200)
generated_text = processor.decode(out[0], skip_special_tokens=True)
print(generated_text)
Below you'll find the necessary instructions in order to run our provided code. The instructions refer to the building of the rgb-language_vqa service which exposes 1 endpoint and utilizes the VOXReality vision-language spatial visual question answering (open type) model.
The model is trained to produce a spatial answer to any question regarding spaial relationships between objects of the image.
The output of this dialogue is either of that form:
Q. Where is "Object1"?. A. to the "Left/Right etc." of another "Object2". ## 1. Requirements
- CUDA compatible GPU.
- We recommend at least 4GB of GPU memory.
- The code was tested on Nvidia proprietary driver 515 and 525.
- For LINUX (tested on Ubuntu 20.04).
- Make sure Docker is installed on your system.
- Make sure you have the NVIDIA Container Toolking installed. More info and instructions can be found in the official installation guide
- For Windows (tested on Windows 10 and 11).
- Make sure Docker is installed on your system.
- Downloads last month
- 6