--- license: mit language: - en metrics: - code_eval library_name: transformers pipeline_tag: image-to-text tags: - text-generation-inference --- We are creating a spatial aware vision-language(VL) model. This is a trained model on COCO dataset images including extra information regarding the spatial relationship between the entities of the image. This is a sequence to sequence model for image-captioning. The architecture is ViT encoder and GPT2 decoder.
Requirements! - 4GB GPU RAM. - CUDA enabled docker
The way to download and run this: ```python device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") from transformers import pipeline image_captioner = pipeline("image-to-text", model="sadassa17/rgb-language_cap", max_new_tokens=200, device=device) filename = 'path/to/file' generated_captions = image_captioner(filename) print(generated_captions) ``` The model is trained to produce as many words as possible with a maximum of 200 tokens, which translates to roughly 5 sentences, while the 6th sentence is usually cropped. The output is always of that form: "Object1" is to the "Left/Right etc." of the "Object2". ## IF YOU WANT TO PRODUCE A SPECIFIC NUMBER OF CAPTIONS UP TO 5. ```python import os def print_up_to_n_sentences(captions, n): for caption in captions: generated_text = caption.get('generated_text', '') sentences = generated_text.split('.') result = '.'.join(sentences[:n]) #print(result) return result filename = 'path/to/file' generated_captions = image_captioner(filename) captions = print_up_to_n_sentences(generated_captions, 5) print(captions) ```