--- language: - fa metrics: - wer pipeline_tag: image-to-text --- A Persian image captioning model constructed from a ViT + RoBERTa architecture trained on flickr30k-fa. The encoder (ViT) was initialized from https://huggingface.co/google/vit-base-patch16-224 and the decoder (RoBERTa) was initialized from https://huggingface.co/HooshvareLab/roberta-fa-zwnj-base . ## Usage ``` pip install hezar ``` ```python from hezar import Model model = Model.load("hezarai/vit-roberta-fa-image-captioning-flickr30k") captions = model.predict("example_image.jpg") print(captions) ```