tufa15nik commited on
Commit
171b061
1 Parent(s): 3554cc7

Upload 5 files

Browse files
Files changed (5) hide show
  1. README.md +79 -0
  2. special_tokens_map.json +1 -0
  3. tokenizer.json +0 -0
  4. tokenizer_config.json +1 -0
  5. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - visual-question-answering
4
+ license: apache-2.0
5
+ widget:
6
+ - text: "What's the animal doing?"
7
+ src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg"
8
+ - text: "What is on top of the building?"
9
+ src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg"
10
+ ---
11
+
12
+ # Vision-and-Language Transformer (ViLT), fine-tuned on VQAv2
13
+
14
+ Vision-and-Language Transformer (ViLT) model fine-tuned on [VQAv2](https://visualqa.org/). It was introduced in the paper [ViLT: Vision-and-Language Transformer
15
+ Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Kim et al. and first released in [this repository](https://github.com/dandelin/ViLT).
16
+
17
+ Disclaimer: The team releasing ViLT did not write a model card for this model so this model card has been written by the Hugging Face team.
18
+
19
+ ## Intended uses & limitations
20
+
21
+ You can use the raw model for visual question answering.
22
+
23
+ ### How to use
24
+
25
+ Here is how to use this model in PyTorch:
26
+
27
+ ```python
28
+ from transformers import ViltProcessor, ViltForQuestionAnswering
29
+ import requests
30
+ from PIL import Image
31
+
32
+ # prepare image + question
33
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
34
+ image = Image.open(requests.get(url, stream=True).raw)
35
+ text = "How many cats are there?"
36
+
37
+ processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
38
+ model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
39
+
40
+ # prepare inputs
41
+ encoding = processor(image, text, return_tensors="pt")
42
+
43
+ # forward pass
44
+ outputs = model(**encoding)
45
+ logits = outputs.logits
46
+ idx = logits.argmax(-1).item()
47
+ print("Predicted answer:", model.config.id2label[idx])
48
+ ```
49
+
50
+ ## Training data
51
+
52
+ (to do)
53
+
54
+ ## Training procedure
55
+
56
+ ### Preprocessing
57
+
58
+ (to do)
59
+
60
+ ### Pretraining
61
+
62
+ (to do)
63
+
64
+ ## Evaluation results
65
+
66
+ (to do)
67
+
68
+ ### BibTeX entry and citation info
69
+
70
+ ```bibtex
71
+ @misc{kim2021vilt,
72
+ title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision},
73
+ author={Wonjae Kim and Bokyung Son and Ildoo Kim},
74
+ year={2021},
75
+ eprint={2102.03334},
76
+ archivePrefix={arXiv},
77
+ primaryClass={stat.ML}
78
+ }
79
+ ```
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 40, "special_tokens_map_file": null, "name_or_path": "bert-base-uncased", "tokenizer_class": "BertTokenizer"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff