|
# Hindi Image Captioning Model |
|
|
|
This is an encoder-decoder image captioning model made with VIT encoder and GPT2-Hindi as a decoder. This is a first attempt at using ViT + GPT2-Hindi for image captioning task. We used the Flickr8k Hindi Dataset available on kaggle to train the model. |
|
|
|
This model was trained using HuggingFace course community week, organized by Huggingface. |
|
|
|
## How to use |
|
|
|
Here is how to use this model to caption an image of the Flickr8k dataset: |
|
```python |
|
import torch |
|
import requests |
|
from PIL import Image |
|
from transformers import ViTFeatureExtractor, AutoTokenizer, VisionEncoderDecoderModel |
|
|
|
if torch.cuda.is_available(): |
|
device = 'cuda' |
|
else: |
|
device = 'cpu' |
|
|
|
url = 'https://shorturl.at/fvxEQ' |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
encoder_checkpoint = 'google/vit-base-patch16-224' |
|
decoder_checkpoint = 'surajp/gpt2-hindi' |
|
|
|
feature_extractor = ViTFeatureExtractor.from_pretrained(encoder_checkpoint) |
|
tokenizer = AutoTokenizer.from_pretrained(decoder_checkpoint) |
|
model = VisionEncoderDecoderModel.from_pretrained('team-indain-image-caption/hindi-image-captioning').to(device) |
|
|
|
#Inference |
|
sample = feature_extractor(image, return_tensors="pt").pixel_values.to(device) |
|
clean_text = lambda x: x.replace('<|endoftext|>','').split('\n')[0] |
|
|
|
caption_ids = model.generate(sample, max_length = 50)[0] |
|
caption_text = clean_text(tokenizer.decode(caption_ids)) |
|
print(caption_text) |
|
``` |
|
|
|
## Training data |
|
We used the Flickr8k Hindi Dataset, which is the translated version of the original Flickr8k Dataset, available on Kaggle to train the model. |
|
|
|
## Training procedure |
|
This model was trained during HuggingFace course community week, organized by Huggingface. The training was done on Kaggle GPU. |