---
license: apache-2.0
language:
- ru
metrics:
- bleu
pipeline_tag: image-to-text
widget:
- src: https://huggingface.co/dumperize/movie-picture-captioning/resolve/main/vertical_15x.jpeg
  example_title: Custom Image Sample 1
---
# Model Card for movie-picture-captioning
The model generates descriptions of any photo in the style of movie descriptions. It learned from movie posters and descriptions. ... just for fun

# Model Details:

#### Model Description

This is an encoder-decoder model based on [VisionEncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/vision-encoder-decoder). 
[Google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) was used as the encoder, [DeepPavlov/rubert-base-cased](https://huggingface.co/DeepPavlov/rubert-base-cased) as the decoder. 

We refined the model on the dataset with descriptions and movie posters by the russian app Kinoposk. Now the model generates descriptions of the jargon of blockbusters =).

#### Model Sources

- **Repository:** [github.com/slivka83](https://github.com/slivka83/)
- **Demo [optional]:** [@MPC_project_bot](https://t.me/MPC_project_bot)

# How to use

```python
from PIL import Image
import torch
from transformers import AutoTokenizer, ViTFeatureExtractor, VisionEncoderDecoderModel

tokenizer = AutoTokenizer.from_pretrained("dumperize/movie-picture-captioning")
feature_extractor = ViTFeatureExtractor.from_pretrained("dumperize/movie-picture-captioning")
model = VisionEncoderDecoderModel.from_pretrained("dumperize/movie-picture-captioning")

max_length = 128
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

image_path = 'path/to/image.jpg';
image = Image.open(image_path)
image = image.resize([224,224])
if image.mode != "RGB":
  image = image.convert(mode="RGB")

pixel_values = feature_extractor(images=[image], return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device)

output_ids = model.generate(pixel_values, **gen_kwargs)

preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print([pred.strip() for pred in preds])

```

# Bias, Risks, and Limitations

Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions. 

# Training Details

#### Training Data

We compiled a dataset from the open source of all Russian-language films for October 2022 - [kinopoisk](https://www.kinopoisk.ru/). Films with very short or very long descriptions were not included in the dataset, films with blank or very small images were excluded too.

#### Preprocessing

The model was trained on 1x1080ti (11Gb) near 24 hours. 

# Evaluation

This model achieved the following results: sacrebleu 6.84

#### Metrics

We used [sacrebleu](https://huggingface.co/spaces/evaluate-metric/sacrebleu) metric.