dumperize
/

movie-picture-captioning

vision-encoder-decoder

Inference Endpoints

Model card Files Files and versions Community

movie-picture-captioning / README.md

dumperize's picture

Update README.md

5968650 over 1 year ago

|

No virus

2.92 kB

	---
	license: apache-2.0
	language:
	- ru
	metrics:
	- bleu
	pipeline_tag: image-to-text
	widget:
	- src: https://huggingface.co/dumperize/movie-picture-captioning/resolve/main/vertical_15x.jpeg
	example_title: Custom Image Sample 1
	---
	# Model Card for movie-picture-captioning
	The model generates descriptions of any photo in the style of movie descriptions. It learned from movie posters and descriptions. ... just for fun

	# Model Details:

	#### Model Description

	This is an encoder-decoder model based on [VisionEncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/vision-encoder-decoder).
	[Google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) was used as the encoder, [DeepPavlov/rubert-base-cased](https://huggingface.co/DeepPavlov/rubert-base-cased) as the decoder.

	We refined the model on the dataset with descriptions and movie posters by the russian app Kinoposk. Now the model generates descriptions of the jargon of blockbusters =).

	#### Model Sources

	- Repository: [github.com/slivka83](https://github.com/slivka83/)
	- Demo [optional]: [@MPC_project_bot](https://t.me/MPC_project_bot)

	# How to use

	```python
	from PIL import Image
	import torch
	from transformers import AutoTokenizer, ViTFeatureExtractor, VisionEncoderDecoderModel

	tokenizer = AutoTokenizer.from_pretrained("dumperize/movie-picture-captioning")
	feature_extractor = ViTFeatureExtractor.from_pretrained("dumperize/movie-picture-captioning")
	model = VisionEncoderDecoderModel.from_pretrained("dumperize/movie-picture-captioning")

	max_length = 128
	num_beams = 4
	gen_kwargs = {"max_length": max_length, "num_beams": num_beams}

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	image_path = 'path/to/image.jpg';
	image = Image.open(image_path)
	image = image.resize([224,224])
	if image.mode != "RGB":
	image = image.convert(mode="RGB")

	pixel_values = feature_extractor(images=[image], return_tensors="pt").pixel_values
	pixel_values = pixel_values.to(device)

	output_ids = model.generate(pixel_values, **gen_kwargs)

	preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
	print([pred.strip() for pred in preds])

	```

	# Bias, Risks, and Limitations

	Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions.

	# Training Details

	#### Training Data

	We compiled a dataset from the open source of all Russian-language films for October 2022 - [kinopoisk](https://www.kinopoisk.ru/). Films with very short or very long descriptions were not included in the dataset, films with blank or very small images were excluded too.

	#### Preprocessing

	The model was trained on 1x1080ti (11Gb) near 24 hours.

	# Evaluation

	This model achieved the following results: sacrebleu 6.84

	#### Metrics

	We used [sacrebleu](https://huggingface.co/spaces/evaluate-metric/sacrebleu) metric.