|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- atasoglu/flickr8k-turkish |
|
language: |
|
- tr |
|
metrics: |
|
- rouge |
|
library_name: transformers |
|
pipeline_tag: image-to-text |
|
tags: |
|
- image-to-text |
|
- image-captioning |
|
base_model: |
|
- google/vit-base-patch16-224 |
|
- ytu-ce-cosmos/turkish-gpt2 |
|
--- |
|
# vit-base-patch16-224-turkish-gpt2 |
|
|
|
This vision encoder-decoder model utilizes the [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) as the encoder and [ytu-ce-cosmos/turkish-gpt2](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2) as the decoder, and it has been fine-tuned on the [flickr8k-turkish](https://huggingface.co/datasets/atasoglu/flickr8k-turkish) dataset to generate image captions in Turkish. |
|
|
|
## Usage |
|
|
|
```py |
|
import torch |
|
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer |
|
from PIL import Image |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model_id = "atasoglu/vit-base-patch16-224-turkish-gpt2" |
|
img = Image.open("example.jpg") |
|
|
|
feature_extractor = ViTImageProcessor.from_pretrained(model_id) |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
model = VisionEncoderDecoderModel.from_pretrained(model_id) |
|
model.to(device) |
|
|
|
features = feature_extractor(images=[img], return_tensors="pt") |
|
pixel_values = features.pixel_values.to(device) |
|
|
|
generated_captions = tokenizer.batch_decode( |
|
model.generate(pixel_values, max_new_tokens=20), |
|
skip_special_tokens=True, |
|
) |
|
|
|
print(generated_captions) |
|
``` |