license: apache-2.0
- ru
- bleu
pipeline_tag: image-to-text
- src: https://huggingface.co/dumperize/movie-picture-captioning/resolve/main/vertical_15x.jpeg
example_title: Custom Image Sample 1
# Model Card for movie-picture-captioning
This model generate a description for movie posters ... mm, in principle, for any photo.
# Model Details:
#### Model Description
This is an encoder decoder model based on [VisionEncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/vision-encoder-decoder).
[Google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) was used as encoder, [DeepPavlov/rubert-base-cased](https://huggingface.co/DeepPavlov/rubert-base-cased) as decoder.
We refined the model on the dataset with descriptions and movie posters by russian app Kinoposk. Now the model generates descriptions on the jargon of blockbusters =).
#### Model Sources
- **Repository:** [github.com/slivka83](https://github.com/slivka83/)
- **Demo [optional]:** [@MPC_project_bot](https://t.me/MPC_project_bot)
# How to use
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("dumperize/movie-picture-captioning")
feature_extractor = ViTFeatureExtractor.from_pretrained("dumperize/movie-picture-captioning")
model = VisionEncoderDecoderModel.from_pretrained("dumperize/movie-picture-captioning")
max_length = 128
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
image_path = 'path/to/image.jpg';
image = Image.open(image_path)
image = image.resize([224,224])
if image.mode != "RGB":
image = image.convert(mode="RGB")
pixel_values = feature_extractor(images=[image], return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device)
output_ids = model.generate(pixel_values, **gen_kwargs)
preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print([pred.strip() for pred in preds])
# Bias, Risks, and Limitations
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions.
# Training Details
#### Training Data
We compiled a dataset from the open source of all Russian-language films for October 2022 - [kinopoisk](https://www.kinopoisk.ru/). Films with very short or very long descriptions were not included in the dataset, films with blank or very small images were excluded too.
#### Preprocessing
The model was trained on 1x1080ti (11Gb) near 24 hours.
# Evaluation
This model achieved the following results: sacrebleu 6.84
#### Metrics
We used [sacrebleu](https://huggingface.co/spaces/evaluate-metric/sacrebleu) metric.