|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
metrics: |
|
- code_eval |
|
library_name: transformers |
|
pipeline_tag: image-to-text |
|
tags: |
|
- text-generation-inference |
|
--- |
|
<u><b>We are creating a spatial aware vision-language(VL) model.</b></u> |
|
|
|
This is a trained model on COCO dataset images including extra information regarding the spatial relationship between the entities of the image. |
|
|
|
This is a sequence to sequence model for visual question-answering. The architecture is <u><b>BLIP.(BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation)</b></u> |
|
|
|
<details> |
|
<summary>Requirements!</summary> |
|
- 4GB GPU RAM. |
|
- CUDA enabled docker |
|
</details> |
|
|
|
The way to download and run this: |
|
```python |
|
from transformers import BlipProcessor, BlipForQuestionAnswering |
|
import torch |
|
from PIL import Image |
|
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") |
|
# Specify the path to the directory where the model was saved |
|
model_path = "voxeality/rgb-language_vqa" |
|
# Load the model |
|
model = BlipForQuestionAnswering.from_pretrained(model_path).to(device, torch.float16) |
|
question = "any question in the form of where is an object or what is to the left/right/above/below/in front/behind the object" |
|
image_path= 'path/to/file' |
|
image = Image.open(image_path).convert("RGB") |
|
|
|
# Load the processor used during training for consistent preprocessing |
|
processor = BlipProcessor.from_pretrained(model_path) |
|
# prepare inputs |
|
encoding = processor(image, question, return_tensors="pt").to("cuda", torch.float16) |
|
# Welcome to the VOXReality Horizon Europe Project |
|
|
|
out = model.generate(**encoding, max_new_tokens=200) |
|
generated_text = processor.decode(out[0], skip_special_tokens=True) |
|
print(generated_text) |
|
``` |
|
Below you'll find the necessary instructions in order to run our provided code. The instructions refer to the building of the rgb-language_vqa service which exposes 1 endpoint and utilizes the VOXReality vision-language spatial visual question answering (open type) model. |
|
|
|
|
|
|
|
The model is trained to produce a spatial answer to any question regarding spaial relationships between objects of the image. |
|
|
|
<i>The output of this dialogue is either of that form: |
|
|
|
Q. Where is "Object1"?. A. to the "Left/Right etc." of another "Object2". |
|
## 1. Requirements |
|
--- |
|
1. CUDA compatible GPU. |
|
1. We recommend at least 4GB of GPU memory. |
|
2. The code was tested on Nvidia proprietary driver 515 and 525. |
|
2. For LINUX (tested on Ubuntu 20.04). |
|
1. Make sure Docker is installed on your system. |
|
2. Make sure you have the NVIDIA Container Toolking installed. More info and instructions can be found in the [official installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) |
|
3. For Windows (tested on Windows 10 and 11). |
|
1. Make sure Docker is installed on your system. |
|
|