yanka commited on
Commit
fce44c4
·
1 Parent(s): 5115dcb

readme done

Browse files
Files changed (1) hide show
  1. README.md +98 -3
README.md CHANGED
@@ -1,3 +1,98 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FaVQA - Fashion-related Visual Question Answering
2
+
3
+ <!-- Provide a quick summary of what the model is/does. -->
4
+
5
+ ### Summary
6
+
7
+ Finetuning a Vision-and-Language Pre-training (VLP) model for a fashion-related downstream task, Visual Question Answering (VQA). The related model, ViLT, was proposed in [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) and incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design for VLP.
8
+
9
+ ### Model Description
10
+
11
+ <!-- Provide a longer summary of what this model is. -->
12
+
13
+
14
+ - **Developed by:** yanka9
15
+ - **Model type:** Vision Question Answering, ViLT
16
+ - **License:** MIT
17
+ - **Finetuned from model:** [dandelin/vilt-b32-finetuned-vqa](https://huggingface.co/dandelin/vilt-b32-finetuned-vqa)
18
+ - **Dataset for finetuning:** [yanka9/deepfashion-for-VQA](https://huggingface.co/datasets/yanka9/deepfashion-for-VQA), derived from [DeepFashion](https://github.com/yumingj/DeepFashion-MultiModal?tab=readme-ov-file)
19
+
20
+ ### Model Sources
21
+
22
+ <!-- Provide the basic links for the model. -->
23
+
24
+ - **Repository:** [GitHub](https://github.com/yrribeiro/fashion-vqa)
25
+ - **Demo:** [🤗 Space](https://huggingface.co/spaces/yanka9/fashion-vqa)
26
+
27
+
28
+ ## How to Get Started with the Model
29
+
30
+ Use the code below to get started with the model. It's similar to original model.
31
+ ```
32
+ from transformers import ViltProcessor, ViltForQuestionAnswering
33
+ import requests
34
+ from PIL import Image
35
+
36
+ # prepare image + question
37
+ image = Image.open(YOUR_IMAGE)
38
+ text = "how long is the sleeve?"
39
+
40
+ processor = ViltProcessor.from_pretrained("yanka9/vilt_finetuned_deepfashionVQA_v2")
41
+ model = ViltForQuestionAnswering.from_pretrained("yanka9/vilt_finetuned_deepfashionVQA_v2")
42
+
43
+ # prepare inputs
44
+ encoding = processor(image, text, return_tensors="pt")
45
+
46
+ # forward pass
47
+ outputs = model(**encoding)
48
+ logits = outputs.logits
49
+ idx = logits.argmax(-1).item()
50
+ print("Answer:", model.config.id2label[idx])
51
+ ```
52
+
53
+ ## Training Details
54
+
55
+ ### Training Data
56
+
57
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
58
+
59
+ A custom training dataset was developed for training the ViLT classifier. It was derived from DeepFashion-MultiModal, which is a large-scale high-quality human dataset with rich multi-modal annotations. It contains 44,096 high-resolution human images, including 12,701 full-body human images. For each full body image, the authors manually annotate the human parsing labels of 24 classes.
60
+
61
+ It has several other properties, but for the scope of this project, only the full body images and labels were utilized to generate the training dataset. Moreover, the labels encompass at least one category of the following: fabric, color, and shape. 209481 questions were generated for 44096 images, the categories used for training are listed below.
62
+
63
+ ```
64
+ 'Color.LOWER_CLOTH',
65
+ 'Color.OUTER_CLOTH',
66
+ 'Color.UPPER_CLOTH',
67
+ 'Fabric.OUTER_CLOTH',
68
+ 'Fabric.UPPER_CLOTH',
69
+ 'Gender',
70
+ 'Shape.CARDIGAN',
71
+ 'Shape.COVERED_NAVEL',
72
+ 'Shape.HAT',
73
+ 'Shape.LOWER_CLOTHING_LENGTH',
74
+ 'Shape.NECKWEAR',
75
+ 'Shape.RING',
76
+ 'Shape.SLEEVE',
77
+ 'Shape.WRISTWEAR'
78
+ ```
79
+
80
+
81
+ ### Question Types
82
+
83
+ The model supports both open and close-ended (yes or no) questions. Below one may find examples from the training phase generated questions.
84
+
85
+ ```
86
+ 'how long is the sleeve?',
87
+ 'what is the length of the lower clothing?',
88
+ 'how would you describe the color of the upper cloth?',
89
+ 'whats is the color of the lower cloth?'
90
+ 'what fabric is the upper cloth made of?'
91
+ 'who is the target audience for this garment'
92
+ 'is there a hat worn?',
93
+ 'is the navel covered?',
94
+ 'does the lower clothing cover the navel?',
95
+ ```
96
+
97
+
98
+ <i>This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.</i>