voxreality
/

rgb_language_vqa

visual-question-answering

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

rgb_language_vqa / README.md

VCL3D's picture

update readme

4accd6b verified 5 months ago

|

history blame contribute delete

2.86 kB

	---
	license: apache-2.0
	language:
	- en
	metrics:
	- code_eval
	library_name: transformers
	pipeline_tag: image-to-text
	tags:
	- text-generation-inference
	---
	<u><b>We are creating a spatial aware vision-language(VL) model.</b></u>

	This is a trained model on COCO dataset images including extra information regarding the spatial relationship between the entities of the image.

	This is a sequence to sequence model for visual question-answering. The architecture is <u><b>BLIP.(BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation)</b></u>

	<details>
	<summary>Requirements!</summary>
	- 4GB GPU RAM.
	- CUDA enabled docker
	</details>

	The way to download and run this:
	```python
	from transformers import BlipProcessor, BlipForQuestionAnswering
	import torch
	from PIL import Image
	device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
	# Specify the path to the directory where the model was saved
	model_path = "voxeality/rgb-language_vqa"
	# Load the model
	model = BlipForQuestionAnswering.from_pretrained(model_path).to(device, torch.float16)
	question = "any question in the form of where is an object or what is to the left/right/above/below/in front/behind the object"
	image_path= 'path/to/file'
	image = Image.open(image_path).convert("RGB")

	# Load the processor used during training for consistent preprocessing
	processor = BlipProcessor.from_pretrained(model_path)
	# prepare inputs
	encoding = processor(image, question, return_tensors="pt").to("cuda", torch.float16)
	# Welcome to the VOXReality Horizon Europe Project

	out = model.generate(**encoding, max_new_tokens=200)
	generated_text = processor.decode(out[0], skip_special_tokens=True)
	print(generated_text)
	```
	Below you'll find the necessary instructions in order to run our provided code. The instructions refer to the building of the rgb-language_vqa service which exposes 1 endpoint and utilizes the VOXReality vision-language spatial visual question answering (open type) model.



	The model is trained to produce a spatial answer to any question regarding spaial relationships between objects of the image.

	<i>The output of this dialogue is either of that form:

	Q. Where is "Object1"?. A. to the "Left/Right etc." of another "Object2".
	## 1. Requirements
	---
	1. CUDA compatible GPU.
	1. We recommend at least 4GB of GPU memory.
	2. The code was tested on Nvidia proprietary driver 515 and 525.
	2. For LINUX (tested on Ubuntu 20.04).
	1. Make sure Docker is installed on your system.
	2. Make sure you have the NVIDIA Container Toolking installed. More info and instructions can be found in the [official installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker)
	3. For Windows (tested on Windows 10 and 11).
	1. Make sure Docker is installed on your system.