michelecafagna26 commited on
Commit
1ef9f84
โ€ข
1 Parent(s): c780046

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -0
README.md CHANGED
@@ -1,3 +1,115 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - hl-scenes
5
+
6
+ library_name: pytorch
7
+ tags:
8
+ - pytorch
9
+ - image-to-text
10
  ---
11
+
12
+ # Model Card: VinVL for Captioning ๐Ÿ–ผ๏ธ
13
+
14
+ [Microsoft's VinVL](https://github.com/microsoft/Oscar) base fine-tuned on [HL-scenes]() dataset for **scene description generation** downstream task.
15
+
16
+ # Model fine-tuning ๐Ÿ‹๏ธโ€
17
+
18
+ The model has been finetuned for 10 epoch on [HL-scenes]() dataset
19
+
20
+ # Test set metrics ๐Ÿ“ˆ
21
+
22
+ Obtained with beam size 5 and max lenght 20
23
+
24
+ | Bleu-1 | Bleu-2 | Bleu-3 | Bleu-4 | METEOR | ROUGE-L | CIDEr | SPICE |
25
+ |--------|--------|--------|--------|--------|---------|-------|-------|
26
+ | 0.68 | 0.55 | 0.45 | 0.36 | 0.36 | 0.63 | 1.42 | 0.40 |
27
+
28
+
29
+ # Usage and Installation:
30
+
31
+ More info about how to install and use this model can be found here: [michelecafagna26/VinVL
32
+ ](https://github.com/michelecafagna26/VinVL)
33
+
34
+ # Feature extraction โ›๏ธ
35
+
36
+ This model has a separate Visualbackbone used to extract features.
37
+ More info about:
38
+ - the model here: [michelecafagna26/vinvl_vg_x152c4](https://huggingface.co/michelecafagna26/vinvl_vg_x152c4)
39
+ - the usage here [michelecafagna26/vinvl-visualbackbone](https://github.com/michelecafagna26/vinvl-visualbackbone)
40
+
41
+ # Quick start: ๐Ÿš€
42
+
43
+ ```python
44
+ from transformers.pytorch_transformers import BertConfig, BertTokenizer
45
+ from oscar.modeling.modeling_bert import BertForImageCaptioning
46
+ from oscar.wrappers import OscarTensorizer
47
+
48
+ ckpt = "path/to/the/checkpoint"
49
+ device = "cuda" if torch.cuda.is_available() else "cpu"
50
+
51
+ # original code
52
+ config = BertConfig.from_pretrained(ckpt)
53
+ tokenizer = BertTokenizer.from_pretrained(ckpt)
54
+ model = BertForImageCaptioning.from_pretrained(ckpt, config=config).to(device)
55
+
56
+ # This takes care of the preprocessing
57
+ tensorizer = OscarTensorizer(tokenizer=tokenizer, device=device)
58
+
59
+ # numpy-arrays with shape (1, num_boxes, feat_size)
60
+ # feat_size is 2054 by default in VinVL
61
+ visual_features = torch.from_numpy(feat_obj).to(device).unsqueeze(0)
62
+
63
+ # labels are usually extracted by the features extractor
64
+ labels = [['boat', 'boat', 'boat', 'bottom', 'bush', 'coat', 'deck', 'deck', 'deck', 'dock', 'hair', 'jacket']]
65
+
66
+ inputs = tensorizer.encode(visual_features, labels=labels)
67
+ outputs = model(**inputs)
68
+
69
+ pred = tensorizer.decode(outputs)
70
+
71
+ # the output looks like this:
72
+ # pred = {0: [{'caption': 'in a library', 'conf': 0.7070220112800598]}
73
+ ```
74
+
75
+ # Citations ๐Ÿงพ
76
+
77
+ This is the model used in:
78
+
79
+ ```BibTeX
80
+ @misc{cafagna2022understanding,
81
+ author = {Cafagna Michele and van Deemter Kees and Gatt Albert},
82
+ title = {Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions},
83
+ doi = {10.48550/ARXIV.2211.04971},
84
+ url = {https://arxiv.org/abs/2211.04971},
85
+ keywords = {Computation and Language (cs.CL), Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
86
+ publisher = {arXiv},
87
+ year = {2022},
88
+ copyright = {Creative Commons Attribution 4.0 International}
89
+ }
90
+
91
+ ```
92
+
93
+
94
+ Please consider citing the original project and the VinVL paper
95
+
96
+ ```BibTeX
97
+
98
+ @misc{han2021image,
99
+ title={Image Scene Graph Generation (SGG) Benchmark},
100
+ author={Xiaotian Han and Jianwei Yang and Houdong Hu and Lei Zhang and Jianfeng Gao and Pengchuan Zhang},
101
+ year={2021},
102
+ eprint={2107.12604},
103
+ archivePrefix={arXiv},
104
+ primaryClass={cs.CV}
105
+ }
106
+
107
+ @inproceedings{zhang2021vinvl,
108
+ title={Vinvl: Revisiting visual representations in vision-language models},
109
+ author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
110
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
111
+ pages={5579--5588},
112
+ year={2021}
113
+ }
114
+
115
+ ```