--- library_name: transformers tags: - image-to-text - image-captioning license: apache-2.0 datasets: - atasoglu/flickr8k-turkish language: - tr metrics: - rouge pipeline_tag: image-to-text --- # vit-tiny-patch16-224-turkish-small-bert-uncased This vision encoder-decoder model utilizes the [WinKawaks/vit-tiny-patch16-224](https://huggingface.co/WinKawaks/vit-tiny-patch16-224) as the encoder and [ytu-ce-cosmos/turkish-small-bert-uncased](https://huggingface.co/ytu-ce-cosmos/turkish-small-bert-uncased) as the decoder, and it has been fine-tuned on the [flickr8k-turkish](https://huggingface.co/datasets/atasoglu/flickr8k-turkish) dataset to generate image captions in Turkish. ## Usage ```py import torch from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer from PIL import Image device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model_id = "atasoglu/vit-tiny-patch16-224-turkish-small-bert-uncased" img = Image.open("example.jpg") feature_extractor = ViTImageProcessor.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id) model = VisionEncoderDecoderModel.from_pretrained(model_id) model.to(device) features = feature_extractor(images=[img], return_tensors="pt") pixel_values = features.pixel_values.to(device) generated_captions = tokenizer.batch_decode( model.generate(pixel_values, max_new_tokens=20), skip_special_tokens=True, ) print(generated_captions) ```