|
--- |
|
license: apache-2.0 |
|
language: |
|
- ru |
|
metrics: |
|
- bleu |
|
pipeline_tag: image-to-text |
|
widget: |
|
- src: https://huggingface.co/dumperize/movie-picture-captioning/resolve/main/vertical_15x.jpeg |
|
example_title: Custom Image Sample 1 |
|
--- |
|
# Model Card for movie-picture-captioning |
|
The model generates descriptions of any photo in the style of movie descriptions. It learned from movie posters and descriptions. ... just for fun |
|
|
|
# Model Details: |
|
|
|
#### Model Description |
|
|
|
This is an encoder-decoder model based on [VisionEncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/vision-encoder-decoder). |
|
[Google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) was used as the encoder, [DeepPavlov/rubert-base-cased](https://huggingface.co/DeepPavlov/rubert-base-cased) as the decoder. |
|
|
|
We refined the model on the dataset with descriptions and movie posters by the russian app Kinoposk. Now the model generates descriptions of the jargon of blockbusters =). |
|
|
|
#### Model Sources |
|
|
|
- **Repository:** [github.com/slivka83](https://github.com/slivka83/) |
|
- **Demo [optional]:** [@MPC_project_bot](https://t.me/MPC_project_bot) |
|
|
|
# How to use |
|
|
|
```python |
|
from PIL import Image |
|
import torch |
|
from transformers import AutoTokenizer, ViTFeatureExtractor, VisionEncoderDecoderModel |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("dumperize/movie-picture-captioning") |
|
feature_extractor = ViTFeatureExtractor.from_pretrained("dumperize/movie-picture-captioning") |
|
model = VisionEncoderDecoderModel.from_pretrained("dumperize/movie-picture-captioning") |
|
|
|
max_length = 128 |
|
num_beams = 4 |
|
gen_kwargs = {"max_length": max_length, "num_beams": num_beams} |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
image_path = 'path/to/image.jpg'; |
|
image = Image.open(image_path) |
|
image = image.resize([224,224]) |
|
if image.mode != "RGB": |
|
image = image.convert(mode="RGB") |
|
|
|
pixel_values = feature_extractor(images=[image], return_tensors="pt").pixel_values |
|
pixel_values = pixel_values.to(device) |
|
|
|
output_ids = model.generate(pixel_values, **gen_kwargs) |
|
|
|
preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True) |
|
print([pred.strip() for pred in preds]) |
|
|
|
``` |
|
|
|
# Bias, Risks, and Limitations |
|
|
|
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions. |
|
|
|
# Training Details |
|
|
|
#### Training Data |
|
|
|
We compiled a dataset from the open source of all Russian-language films for October 2022 - [kinopoisk](https://www.kinopoisk.ru/). Films with very short or very long descriptions were not included in the dataset, films with blank or very small images were excluded too. |
|
|
|
#### Preprocessing |
|
|
|
The model was trained on 1x1080ti (11Gb) near 24 hours. |
|
|
|
# Evaluation |
|
|
|
This model achieved the following results: sacrebleu 6.84 |
|
|
|
#### Metrics |
|
|
|
We used [sacrebleu](https://huggingface.co/spaces/evaluate-metric/sacrebleu) metric. |
|
|