File size: 2,774 Bytes
b7085f5
 
 
 
 
 
 
65e046a
 
 
b7085f5
0968074
b7085f5
 
0968074
b7085f5
0968074
b7085f5
 
 
 
 
 
0968074
b7085f5
 
 
 
370c372
b7085f5
370c372
 
b7085f5
370c372
 
 
b7085f5
370c372
 
 
b7085f5
370c372
b7085f5
370c372
 
 
 
 
b7085f5
370c372
 
b7085f5
370c372
 
 
 
 
 
b7085f5
 
 
0968074
b7085f5
 
 
0968074
b7085f5
 
 
0968074
b7085f5
 
 
 
 
0968074
b7085f5
0968074
b7085f5
0968074
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
license: apache-2.0
language:
- ru
metrics:
- bleu
pipeline_tag: image-to-text
widget:
- src: https://huggingface.co/dumperize/movie-picture-captioning/resolve/main/vertical_15x.jpeg
  example_title: Custom Image Sample 1
---
# Model Card for movie-picture-captioning
This model generate a description for movie posters ... mm, in principle, for any photo.

# Model Details:

#### Model Description

This is an encoder decoder model based on [VisionEncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/vision-encoder-decoder). 
[Google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) was used as encoder, [DeepPavlov/rubert-base-cased](https://huggingface.co/DeepPavlov/rubert-base-cased) as decoder. 

We refined the model on the dataset with descriptions and movie posters by russian app Kinoposk. Now the model generates descriptions on the jargon of blockbusters =).

#### Model Sources

- **Repository:** [github.com/slivka83](https://github.com/slivka83/)
- **Demo [optional]:** [@MPC_project_bot](https://t.me/MPC_project_bot)

# How to use

```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("dumperize/movie-picture-captioning")
feature_extractor = ViTFeatureExtractor.from_pretrained("dumperize/movie-picture-captioning")
model = VisionEncoderDecoderModel.from_pretrained("dumperize/movie-picture-captioning")

max_length = 128
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

image_path = 'path/to/image.jpg';
image = Image.open(image_path)
image = image.resize([224,224])
if image.mode != "RGB":
  image = image.convert(mode="RGB")

pixel_values = feature_extractor(images=[image], return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device)

output_ids = model.generate(pixel_values, **gen_kwargs)

preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print([pred.strip() for pred in preds])

```

# Bias, Risks, and Limitations

Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions. 

# Training Details

#### Training Data

We compiled a dataset from the open source of all Russian-language films for October 2022 - [kinopoisk](https://www.kinopoisk.ru/). Films with very short or very long descriptions were not included in the dataset, films with blank or very small images were excluded too.

#### Preprocessing

The model was trained on 8 16 GB V100 for 90 hours. 

# Evaluation

This model achieved the following results: sacrebleu 6.84

#### Metrics

We used [sacrebleu](https://huggingface.co/spaces/evaluate-metric/sacrebleu) metric.