Model Card for movie-picture-captioning

The model generates descriptions of any photo in the style of movie descriptions. It learned from movie posters and descriptions. ... just for fun

Model Details:

Model Description

This is an encoder-decoder model based on VisionEncoderDecoderModel. Google/vit-base-patch16-224-in21k was used as the encoder, DeepPavlov/rubert-base-cased as the decoder.

We refined the model on the dataset with descriptions and movie posters by the russian app Kinoposk. Now the model generates descriptions of the jargon of blockbusters =).

Model Sources

Repository: github.com/slivka83
Demo [optional]: @MPC_project_bot

How to use

from PIL import Image
import torch
from transformers import AutoTokenizer, ViTFeatureExtractor, VisionEncoderDecoderModel

tokenizer = AutoTokenizer.from_pretrained("dumperize/movie-picture-captioning")
feature_extractor = ViTFeatureExtractor.from_pretrained("dumperize/movie-picture-captioning")
model = VisionEncoderDecoderModel.from_pretrained("dumperize/movie-picture-captioning")

max_length = 128
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

image_path = 'path/to/image.jpg';
image = Image.open(image_path)
image = image.resize([224,224])
if image.mode != "RGB":
  image = image.convert(mode="RGB")

pixel_values = feature_extractor(images=[image], return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device)

output_ids = model.generate(pixel_values, **gen_kwargs)

preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print([pred.strip() for pred in preds])

Bias, Risks, and Limitations

Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions.

Training Details

Training Data

We compiled a dataset from the open source of all Russian-language films for October 2022 - kinopoisk. Films with very short or very long descriptions were not included in the dataset, films with blank or very small images were excluded too.

Preprocessing

The model was trained on 1x1080ti (11Gb) near 24 hours.

Evaluation

This model achieved the following results: sacrebleu 6.84

Metrics

We used sacrebleu metric.

dumperize
/

movie-picture-captioning