--- license: apache-2.0 language: - ru metrics: - bleu pipeline_tag: image-to-text widget: - src: https://huggingface.co/dumperize/movie-picture-captioning/resolve/main/vertical_15x.jpeg example_title: Custom Image Sample 1 --- # Model Card for movie-picture-captioning The model generates descriptions of any photo in the style of movie descriptions. It learned from movie posters and descriptions. ... just for fun # Model Details: #### Model Description This is an encoder-decoder model based on [VisionEncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/vision-encoder-decoder). [Google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) was used as the encoder, [DeepPavlov/rubert-base-cased](https://huggingface.co/DeepPavlov/rubert-base-cased) as the decoder. We refined the model on the dataset with descriptions and movie posters by the russian app Kinoposk. Now the model generates descriptions of the jargon of blockbusters =). #### Model Sources - **Repository:** [github.com/slivka83](https://github.com/slivka83/) - **Demo [optional]:** [@MPC_project_bot](https://t.me/MPC_project_bot) # How to use ```python from PIL import Image import torch from transformers import AutoTokenizer, ViTFeatureExtractor, VisionEncoderDecoderModel tokenizer = AutoTokenizer.from_pretrained("dumperize/movie-picture-captioning") feature_extractor = ViTFeatureExtractor.from_pretrained("dumperize/movie-picture-captioning") model = VisionEncoderDecoderModel.from_pretrained("dumperize/movie-picture-captioning") max_length = 128 num_beams = 4 gen_kwargs = {"max_length": max_length, "num_beams": num_beams} device = torch.device("cuda" if torch.cuda.is_available() else "cpu") image_path = 'path/to/image.jpg'; image = Image.open(image_path) image = image.resize([224,224]) if image.mode != "RGB": image = image.convert(mode="RGB") pixel_values = feature_extractor(images=[image], return_tensors="pt").pixel_values pixel_values = pixel_values.to(device) output_ids = model.generate(pixel_values, **gen_kwargs) preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True) print([pred.strip() for pred in preds]) ``` # Bias, Risks, and Limitations Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions. # Training Details #### Training Data We compiled a dataset from the open source of all Russian-language films for October 2022 - [kinopoisk](https://www.kinopoisk.ru/). Films with very short or very long descriptions were not included in the dataset, films with blank or very small images were excluded too. #### Preprocessing The model was trained on 1x1080ti (11Gb) near 24 hours. # Evaluation This model achieved the following results: sacrebleu 6.84 #### Metrics We used [sacrebleu](https://huggingface.co/spaces/evaluate-metric/sacrebleu) metric.