team-indain-image-caption
/

hindi-image-captioning

vision-encoder-decoder

Inference Endpoints

Model card Files Files and versions Community

krypticmouse commited on Nov 24, 2021

Commit

0e8c75e

•

1 Parent(s): 5fcaacd

Update README.md

Files changed (1) hide show

README.md +14 -5

README.md CHANGED Viewed

@@ -1,8 +1,8 @@
 # Hindi Image Captioning Model
-This is an encoder-decoder image captioning model made with VIT encoder and GPT2-Hindi as a decoder. This is a first attempt at using ViT + GPT2-Hindi for image captioning task. We used the Flickr8k Hindi Dataset, which is the translated version of the original Flickr8k Dataset, available on kaggle to train the model.
-This model was trained using HuggingFace course community week, organized by Huggingface. Training were done on Kaggle Notebooks.
 ## How to use
@@ -21,8 +21,11 @@ else:
 url = 'https://shorturl.at/fvxEQ'
 image = Image.open(requests.get(url, stream=True).raw)
-feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
-tokenizer = AutoTokenizer.from_pretrained('surajp/gpt2-hindi')
 model = VisionEncoderDecoderModel.from_pretrained('team-indain-image-caption/hindi-image-captioning').to(device)
 #Inference
@@ -32,4 +35,10 @@ clean_text = lambda x: x.replace('<|endoftext|>','').split('\n')[0]
 caption_ids = model.generate(sample, max_length = 50)[0]
 caption_text = clean_text(tokenizer.decode(caption_ids))
 print(caption_text)
-```

 # Hindi Image Captioning Model
+This is an encoder-decoder image captioning model made with VIT encoder and GPT2-Hindi as a decoder. This is a first attempt at using ViT + GPT2-Hindi for image captioning task. We used the Flickr8k Hindi Dataset available on kaggle to train the model.
+This model was trained using HuggingFace course community week, organized by Huggingface.
 ## How to use
 url = 'https://shorturl.at/fvxEQ'
 image = Image.open(requests.get(url, stream=True).raw)
+encoder_checkpoint = 'google/vit-base-patch16-224'
+decoder_checkpoint = 'surajp/gpt2-hindi'
+feature_extractor = ViTFeatureExtractor.from_pretrained(encoder_checkpoint)
+tokenizer = AutoTokenizer.from_pretrained(decoder_checkpoint)
 model = VisionEncoderDecoderModel.from_pretrained('team-indain-image-caption/hindi-image-captioning').to(device)
 #Inference
 caption_ids = model.generate(sample, max_length = 50)[0]
 caption_text = clean_text(tokenizer.decode(caption_ids))
 print(caption_text)
+```
+## Training data
+We used the Flickr8k Hindi Dataset, which is the translated version of the original Flickr8k Dataset, available on Kaggle to train the model.
+## Training procedure
+This model was trained during HuggingFace course community week, organized by Huggingface. The training was done on Kaggle GPU.