File size: 2,920 Bytes
b7085f5 65e046a b7085f5 0968074 5968650 b7085f5 0968074 b7085f5 0968074 b7085f5 5968650 b7085f5 5968650 b7085f5 0968074 b7085f5 370c372 b7085f5 370c372 d1be8cc d6e6dba b7085f5 370c372 b7085f5 370c372 b7085f5 370c372 b7085f5 370c372 b7085f5 370c372 b7085f5 370c372 b7085f5 0968074 b7085f5 0968074 b7085f5 0968074 b7085f5 f00322d b7085f5 0968074 b7085f5 0968074 b7085f5 0968074 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
---
license: apache-2.0
language:
- ru
metrics:
- bleu
pipeline_tag: image-to-text
widget:
- src: https://huggingface.co/dumperize/movie-picture-captioning/resolve/main/vertical_15x.jpeg
example_title: Custom Image Sample 1
---
# Model Card for movie-picture-captioning
The model generates descriptions of any photo in the style of movie descriptions. It learned from movie posters and descriptions. ... just for fun
# Model Details:
#### Model Description
This is an encoder-decoder model based on [VisionEncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/vision-encoder-decoder).
[Google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) was used as the encoder, [DeepPavlov/rubert-base-cased](https://huggingface.co/DeepPavlov/rubert-base-cased) as the decoder.
We refined the model on the dataset with descriptions and movie posters by the russian app Kinoposk. Now the model generates descriptions of the jargon of blockbusters =).
#### Model Sources
- **Repository:** [github.com/slivka83](https://github.com/slivka83/)
- **Demo [optional]:** [@MPC_project_bot](https://t.me/MPC_project_bot)
# How to use
```python
from PIL import Image
import torch
from transformers import AutoTokenizer, ViTFeatureExtractor, VisionEncoderDecoderModel
tokenizer = AutoTokenizer.from_pretrained("dumperize/movie-picture-captioning")
feature_extractor = ViTFeatureExtractor.from_pretrained("dumperize/movie-picture-captioning")
model = VisionEncoderDecoderModel.from_pretrained("dumperize/movie-picture-captioning")
max_length = 128
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
image_path = 'path/to/image.jpg';
image = Image.open(image_path)
image = image.resize([224,224])
if image.mode != "RGB":
image = image.convert(mode="RGB")
pixel_values = feature_extractor(images=[image], return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device)
output_ids = model.generate(pixel_values, **gen_kwargs)
preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print([pred.strip() for pred in preds])
```
# Bias, Risks, and Limitations
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions.
# Training Details
#### Training Data
We compiled a dataset from the open source of all Russian-language films for October 2022 - [kinopoisk](https://www.kinopoisk.ru/). Films with very short or very long descriptions were not included in the dataset, films with blank or very small images were excluded too.
#### Preprocessing
The model was trained on 1x1080ti (11Gb) near 24 hours.
# Evaluation
This model achieved the following results: sacrebleu 6.84
#### Metrics
We used [sacrebleu](https://huggingface.co/spaces/evaluate-metric/sacrebleu) metric.
|