arxyzan's picture
Create README.md
a51bbf9
|
raw
history blame
No virus
573 Bytes
metadata
language:
  - fa
metrics:
  - wer
pipeline_tag: image-to-text

A Persian image captioning model constructed from a ViT + RoBERTa architecture trained on flickr30k-fa. The encoder (ViT) was initialized from https://huggingface.co/google/vit-base-patch16-224 and the decoder (RoBERTa) was initialized from https://huggingface.co/HooshvareLab/roberta-fa-zwnj-base .

Usage

pip install hezar
from hezar import Model

model = Model.load("hezarai/vit-gpt2-fa-image-captioning-flickr30k")
captions = model.predict("example_image.jpg")
print(captions)