nazneen commited on
Commit
f0c2d97
1 Parent(s): e3da2db

model documentation

Browse files
Files changed (1) hide show
  1. README.md +203 -0
README.md ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - question-answering
5
+ - visual_bert
6
+
7
+ ---
8
+
9
+ # Model Card for KeywordIdentifier
10
+
11
+ # Model Details
12
+
13
+ ## Model Description
14
+
15
+ VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language (V&L) tasks on image-caption data.
16
+
17
+
18
+ - **Developed by:** UCLA NLP
19
+ - **Shared by [Optional]:** UCLA NLP
20
+ - **Model type:** Question Answering
21
+ - **Language(s) (NLP):** More information needed
22
+ - **License:** Apache 2.0
23
+ - **Parent Model:** [XLNet](https://huggingface.co/xlnet-base-cased?text=My+name+is+Mariama%2C+my+favorite)
24
+ - **Resources for more information:**
25
+ - [GitHub Repo](https://github.com/uclanlp/visualbert)
26
+ - [Associated Paper](https://arxiv.org/abs/1908.03557)
27
+ - [HF hub docs](https://huggingface.co/docs/transformers/model_doc/visual_bert)
28
+
29
+
30
+ # Uses
31
+
32
+
33
+ ## Direct Use
34
+ This model can be used for the task of question answering.
35
+
36
+ ## Downstream Use [Optional]
37
+
38
+ More information needed.
39
+
40
+ ## Out-of-Scope Use
41
+
42
+ The model should not be used to intentionally create hostile or alienating environments for people.
43
+
44
+ # Bias, Risks, and Limitations
45
+
46
+
47
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
48
+
49
+
50
+
51
+ ## Recommendations
52
+
53
+
54
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
55
+
56
+ # Training Details
57
+
58
+ ## Training Data
59
+
60
+
61
+ The model creators note in the [associated paper](https://arxiv.org/pdf/1908.03557.pdf):
62
+ > We evaluate VisualBERT on four different types of vision-and-language applications:
63
+ (1) Vi- sual Question Answering (VQA 2.0) (Goyal et al., 2017),
64
+ Given an image and a question, the task is to correctly answer the question. We use the VQA 2.0 (Goyal et al., 2017), consisting of over 1 million questions about images from COCO. We train the model to predict the 3,129 most frequent answers and use image features from a ResNeXt-based
65
+ (2) Visual Commonsense Reasoning (VCR) (Zellers et al., 2019),
66
+ VCR consists of 290k questions derived from 110k movie scenes, where the questions focus on visual commonsense.
67
+ (3) Natural Language for Visual Reasoning (NLVR2) (Suhr et al., 2019)
68
+ NLVR2 is a dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The task is to determine whether a natural language caption is true about a pair of images. The dataset consists of over 100k examples of English sentences paired with web images. We modify the segment embedding mechanism in VisualBERT and assign features from different images with different segment embeddings.
69
+ (4) Region-to-Phrase Grounding (Flickr30K) (Plummer et al., 2015)
70
+
71
+ >Flickr30K Entities dataset tests the ability of systems to ground phrases in captions to bounding regions in the image. The task is, given spans from a sentence, selecting the bounding regions they correspond to. The dataset consists of 30k images and nearly 250k annotations.
72
+
73
+
74
+ ## Training Procedure
75
+
76
+
77
+ ### Preprocessing
78
+
79
+ The model creators note in the [associated paper](https://arxiv.org/pdf/1908.03557.pdf):
80
+ > The parameters are initializedfromthepre-trainedBERTBASE parameters
81
+
82
+
83
+
84
+ ### Speeds, Sizes, Times
85
+ The model creators note in the [associated paper](https://arxiv.org/pdf/1908.03557.pdf):
86
+ > The Transformer encoder in all models has the same configuration as BERTBASE: 12 layers, a hidden size of 768, and 12 self-attention heads. The parameters are initializedfromthepre-trainedBERTBASE parameters
87
+ > Batch sizes are chosen to meet hardware constraints and text sequences whose lengths are longer than 128 are capped.
88
+
89
+
90
+
91
+
92
+ # Evaluation
93
+
94
+
95
+ ## Testing Data, Factors & Metrics
96
+
97
+ ### Testing Data
98
+
99
+ More information needed
100
+
101
+ ### Factors
102
+ More information needed
103
+
104
+ ### Metrics
105
+
106
+ More information needed
107
+
108
+
109
+ ## Results
110
+
111
+ More information needed
112
+
113
+
114
+ # Model Examination
115
+
116
+ More information needed
117
+
118
+ # Environmental Impact
119
+
120
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
121
+
122
+ - **Hardware Type:** Tesla V100s and GTX 1080Tis
123
+ - **Hours used:**
124
+ The model creators note in the [associated paper](https://arxiv.org/pdf/1908.03557.pdf)
125
+ > Pre-training on COCO generally takes less than a day on 4 cards while task-specific pre-training and fine-tuning usually takes less
126
+ - **Cloud Provider:** More information needed
127
+ - **Compute Region:** More information needed
128
+ - **Carbon Emitted:** More information needed
129
+
130
+ # Technical Specifications [optional]
131
+
132
+ ## Model Architecture and Objective
133
+
134
+ More information needed
135
+
136
+ ## Compute Infrastructure
137
+
138
+ More information needed
139
+
140
+ ### Hardware
141
+
142
+
143
+ More information needed
144
+
145
+ ### Software
146
+
147
+ More information needed.
148
+
149
+ # Citation
150
+
151
+
152
+ **BibTeX:**
153
+
154
+
155
+ ```bibtex
156
+ @inproceedings{li2019visualbert,
157
+ author = {Li, Liunian Harold and Yatskar, Mark and Yin, Da and Hsieh, Cho-Jui and Chang, Kai-Wei},
158
+ title = {VisualBERT: A Simple and Performant Baseline for Vision and Language},
159
+ booktitle = {Arxiv},
160
+ year = {2019}
161
+ }
162
+ ```
163
+
164
+
165
+
166
+
167
+ **APA:**
168
+
169
+ More information needed
170
+
171
+ # Glossary [optional]
172
+
173
+ More information needed
174
+
175
+ # More Information [optional]
176
+ More information needed
177
+
178
+ # Model Card Authors [optional]
179
+
180
+ UCLA NLP in collaboration with Ezi Ozoani and the Hugging Face team
181
+
182
+ # Model Card Contact
183
+
184
+ More information needed
185
+
186
+ # How to Get Started with the Model
187
+
188
+ Use the code below to get started with the model.
189
+
190
+ <details>
191
+ <summary> Click to expand </summary>
192
+
193
+ ```python
194
+ from transformers import AutoTokenizer, AutoModelForQuestionAnswering
195
+
196
+ tokenizer = AutoTokenizer.from_pretrained("uclanlp/visualbert-vqa")
197
+
198
+ model = AutoModelForQuestionAnswering.from_pretrained("uclanlp/visualbert-vqa")
199
+ ```
200
+ </details>
201
+
202
+
203
+