license: apache-2.0
language:
- ru
metrics:
- bleu
pipeline_tag: image-to-text
widget:
- src: >-
https://huggingface.co/dumperize/movie-picture-captioning/resolve/main/vertical_15x.jpeg
example_title: Custom Image Sample 1
Model Card for movie-picture-captioning
This model generate a description for movie posters ... mm, in principle, for any photo.
Model Details:
Model Description
This is an encoder decoder model based on VisionEncoderDecoderModel. Google/vit-base-patch16-224-in21k was used as encoder, DeepPavlov/rubert-base-cased as decoder.
We refined the model on the dataset with descriptions and movie posters by russian app Kinoposk. Now the model generates descriptions on the jargon of blockbusters =).
Model Sources
- Repository: github.com/slivka83
- Demo [optional]: @MPC_project_bot
How to use
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("dumperize/movie-picture-captioning")
feature_extractor = ViTFeatureExtractor.from_pretrained("dumperize/movie-picture-captioning")
model = VisionEncoderDecoderModel.from_pretrained("dumperize/movie-picture-captioning")
max_length = 128
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
image_path = 'path/to/image.jpg';
image = Image.open(image_path)
image = image.resize([224,224])
if image.mode != "RGB":
image = image.convert(mode="RGB")
pixel_values = feature_extractor(images=[image], return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device)
output_ids = model.generate(pixel_values, **gen_kwargs)
preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print([pred.strip() for pred in preds])
Bias, Risks, and Limitations
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions.
Training Details
Training Data
We compiled a dataset from the open source of all Russian-language films for October 2022 - kinopoisk. Films with very short or very long descriptions were not included in the dataset, films with blank or very small images were excluded too.
Preprocessing
The model was trained on 8 16 GB V100 for 90 hours.
Evaluation
This model achieved the following results: sacrebleu 6.84
Metrics
We used sacrebleu metric.