Edit model card

Example

The model is by no means a state-of-the-art model, but nevertheless produces reasonable image captioning results. It was mainly fine-tuned as a proof-of-concept for the ๐Ÿค— FlaxVisionEncoderDecoder Framework.

The model can be used as follows:


import requests
from PIL import Image
from transformers import ViTFeatureExtractor, AutoTokenizer, FlaxVisionEncoderDecoderModel

loc = "ydshieh/vit-gpt2-coco-en"

feature_extractor = ViTFeatureExtractor.from_pretrained(loc)
tokenizer = AutoTokenizer.from_pretrained(loc)
model = FlaxVisionEncoderDecoderModel.from_pretrained(loc)

# We will verify our results on an image of cute cats
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
with Image.open(requests.get(url, stream=True).raw) as img:
    pixel_values = feature_extractor(images=img, return_tensors="np").pixel_values

def generate_step(pixel_values):

    output_ids = model.generate(pixel_values, max_length=16, num_beams=4).sequences
    preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
    preds = [pred.strip() for pred in preds]

    return preds

preds = generate_step(pixel_values)
print(preds)

# should produce
# ['a cat laying on top of a couch next to another cat']
Downloads last month
11
Inference API
Drag image file here or click to browse from your device
Inference API (serverless) does not yet support generic models for this pipeline type.

Spaces using ydshieh/vit-gpt2-coco-en-ckpts 10