bipin commited on
Commit
b8403b2
1 Parent(s): fb824de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -8
README.md CHANGED
@@ -10,9 +10,10 @@ model-index:
10
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
  should probably proofread and complete it, then remove this comment. -->
12
 
13
- # image-caption-generator
 
 
14
 
15
- This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
16
  It achieves the following results on the evaluation set:
17
  - eval_loss: 0.2536
18
  - eval_runtime: 25.369
@@ -21,19 +22,54 @@ It achieves the following results on the evaluation set:
21
  - epoch: 4.0
22
  - step: 3236
23
 
24
- ## Model description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
- More information needed
 
 
 
 
 
 
27
 
28
- ## Intended uses & limitations
 
 
 
 
29
 
30
- More information needed
 
 
 
31
 
32
- ## Training and evaluation data
 
33
 
34
- More information needed
 
 
 
35
 
36
  ## Training procedure
 
37
 
38
  ### Training hyperparameters
39
 
10
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
  should probably proofread and complete it, then remove this comment. -->
12
 
13
+ # Image-caption-generator
14
+
15
+ This model is trained on [Flickr8k](https://www.kaggle.com/datasets/nunenuh/flickr8k) dataset to generate captions given an image.
16
 
 
17
  It achieves the following results on the evaluation set:
18
  - eval_loss: 0.2536
19
  - eval_runtime: 25.369
22
  - epoch: 4.0
23
  - step: 3236
24
 
25
+ # Running the model using transformers library
26
+
27
+ 1. Load the pre-trained model from the model hub
28
+ ```python
29
+ from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer
30
+ import torch
31
+ from PIL import Image
32
+
33
+ model_name = "bipin/image-caption-generator"
34
+
35
+ # load model
36
+ model = VisionEncoderDecoderModel.from_pretrained(model_name)
37
+ feature_extractor = ViTFeatureExtractor.from_pretrained(model_name)
38
+ tokenizer = AutoTokenizer.from_pretrained("gpt2")
39
+
40
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
41
+ model.to(device)
42
+ ```
43
 
44
+ 2. Load the image for which the caption is to be generated
45
+ ```python
46
+ img_name = "flickr_data.jpg"
47
+ img = Image.open(img_name)
48
+ if img.mode != 'RGB':
49
+ img = img.convert(mode="RGB")
50
+ ```
51
 
52
+ 3. Pre-process the image
53
+ ```python
54
+ pixel_values = feature_extractor(images=[img], return_tensors="pt").pixel_values
55
+ pixel_values = pixel_values.to(device)
56
+ ```
57
 
58
+ 4. Generate the caption
59
+ ```python
60
+ max_length = 128
61
+ num_beams = 4
62
 
63
+ # get model prediction
64
+ output_ids = model.generate(pixel_values, num_beams=num_beams, max_length=max_length)
65
 
66
+ # decode the generated prediction
67
+ preds = tokenizer.decode(output_ids[0], skip_special_tokens=True)
68
+ print(preds)
69
+ ```
70
 
71
  ## Training procedure
72
+ The procedure used to train this model can be found [here](https://bipinkrishnan.github.io/ml-recipe-book/image_captioning.html).
73
 
74
  ### Training hyperparameters
75