michelecafagna26 commited on
Commit
fcf525b
1 Parent(s): 54cb4a1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md CHANGED
@@ -1,3 +1,102 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - coco
5
+ - conceptual-caption
6
+ - sbu
7
+ - flickr30k
8
+ - vqa
9
+ - gqa
10
+ - vg-qa
11
+ - open-images
12
+
13
+ library_name: pytorch
14
+ tags:
15
+ - pytorch
16
+ - image-to-text
17
  ---
18
+
19
+ # Model Card: VinVL for Captioning 🖼️
20
+
21
+ [Microsoft's VinVL](https://github.com/microsoft/Oscar) base pretrained for **image caption generation** downstream task.
22
+
23
+
24
+ # COCO Test set metrics 📈
25
+
26
+ Table from the authors (Table 7, cross-entropy optimization, )
27
+
28
+ | Bleu-4 | METEOR | CIDEr | SPICE |
29
+ |--------|--------|-------|-------|
30
+ | 0.38 | 0.30 | 1.29 | 0.23 |
31
+
32
+
33
+ # Usage and Installation:
34
+
35
+ More info about how to install and use this model can be found here: [michelecafagna26/VinVL
36
+ ](https://github.com/michelecafagna26/VinVL)
37
+
38
+ # Feature extraction ⛏️
39
+
40
+ This model has a separate Visualbackbone used to extract features.
41
+ More info about:
42
+ - the model here: [michelecafagna26/vinvl_vg_x152c4](https://huggingface.co/michelecafagna26/vinvl_vg_x152c4)
43
+ - the usage here [michelecafagna26/vinvl-visualbackbone](https://github.com/michelecafagna26/vinvl-visualbackbone)
44
+
45
+ # Quick start: 🚀
46
+
47
+ ```python
48
+ from transformers.pytorch_transformers import BertConfig, BertTokenizer
49
+ from oscar.modeling.modeling_bert import BertForImageCaptioning
50
+ from oscar.wrappers import OscarTensorizer
51
+
52
+ ckpt = "path/to/the/checkpoint"
53
+ device = "cuda" if torch.cuda.is_available() else "cpu"
54
+
55
+ # original code
56
+ config = BertConfig.from_pretrained(ckpt)
57
+ tokenizer = BertTokenizer.from_pretrained(ckpt)
58
+ model = BertForImageCaptioning.from_pretrained(ckpt, config=config).to(device)
59
+
60
+ # This takes care of the preprocessing
61
+ tensorizer = OscarTensorizer(tokenizer=tokenizer, device=device)
62
+
63
+ # numpy-arrays with shape (1, num_boxes, feat_size)
64
+ # feat_size is 2054 by default in VinVL
65
+ visual_features = torch.from_numpy(feat_obj).to(device).unsqueeze(0)
66
+
67
+ # labels are usually extracted by the features extractor
68
+ labels = [['boat', 'boat', 'boat', 'bottom', 'bush', 'coat', 'deck', 'deck', 'deck', 'dock', 'hair', 'jacket']]
69
+
70
+ inputs = tensorizer.encode(visual_features, labels=labels)
71
+ outputs = model(**inputs)
72
+
73
+ pred = tensorizer.decode(outputs)
74
+
75
+ # the output looks like this:
76
+ # pred = {0: [{'caption': 'a red and white boat traveling down a river next to a small boat.', 'conf': 0.7070220112800598]}
77
+ ```
78
+
79
+ # Citations 🧾
80
+
81
+ Please consider citing the original project and the VinVL paper
82
+
83
+ ```BibTeX
84
+
85
+ @misc{han2021image,
86
+ title={Image Scene Graph Generation (SGG) Benchmark},
87
+ author={Xiaotian Han and Jianwei Yang and Houdong Hu and Lei Zhang and Jianfeng Gao and Pengchuan Zhang},
88
+ year={2021},
89
+ eprint={2107.12604},
90
+ archivePrefix={arXiv},
91
+ primaryClass={cs.CV}
92
+ }
93
+
94
+ @inproceedings{zhang2021vinvl,
95
+ title={Vinvl: Revisiting visual representations in vision-language models},
96
+ author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
97
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
98
+ pages={5579--5588},
99
+ year={2021}
100
+ }
101
+
102
+ ```