wiusdy commited on
Commit
5dde576
1 Parent(s): 8c94c5c

updating the number of models and using a new pretrained model for finetuning a BLIP

Browse files
Files changed (3) hide show
  1. README.md +8 -4
  2. app.py +1 -1
  3. inference.py +1 -1
README.md CHANGED
@@ -8,17 +8,21 @@ widget:
8
  src: "617.jpg"
9
  ---
10
 
11
- # This is a simple VQA system using Hugging Face, PyTorch and Vision-and-Language Transformer (ViLT)
12
  -------------
13
 
14
  In this repository we created a simple VQA system capable of recognize spatial and context information of fashion images (e.g. clothes color and details).
15
 
16
- The project was based in this paper **FashionVQA: A Domain-Specific Visual Question Answering System** [[1]](#1).
17
-
18
 
 
19
 
20
 
21
  ## References
22
  <a id="1">[1]</a>
23
  Min Wang and Ata Mahjoubfar and Anupama Joshi, 2022
24
- FashionVQA: A Domain-Specific Visual Question Answering System
 
 
 
 
 
8
  src: "617.jpg"
9
  ---
10
 
11
+ # This is a simple VQA system using Hugging Face, PyTorch and VQA models
12
  -------------
13
 
14
  In this repository we created a simple VQA system capable of recognize spatial and context information of fashion images (e.g. clothes color and details).
15
 
16
+ The project was based in this paper **FashionVQA: A Domain-Specific Visual Question Answering System** [[1]](#1). We also used the VQA pre-trained model from **BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation** [[]](#2) to make the model finetuning the two new models.
 
17
 
18
+ We used the datasets **Deep Fashion with Masks** available in <https://huggingface.co/datasets/SaffalPoosh/deepFashion-with-masks> and the **Control Net Dataset** available in <https://huggingface.co/datasets/ldhnam/deepfashion_controlnet>.
19
 
20
 
21
  ## References
22
  <a id="1">[1]</a>
23
  Min Wang and Ata Mahjoubfar and Anupama Joshi, 2022
24
+ FashionVQA: A Domain-Specific Visual Question Answering System
25
+
26
+ <a id="2">[2]</a>
27
+ Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi, 2022
28
+ BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
app.py CHANGED
@@ -7,7 +7,7 @@ inference = Inference()
7
 
8
 
9
  with gr.Blocks() as block:
10
- options = gr.Dropdown(choices=["Model 1", "Model 2", "Model 3"], label="Models", info="Select the model to use..", )
11
  # need to improve this one...
12
 
13
  txt = gr.Textbox(label="Insert a question..", lines=2)
 
7
 
8
 
9
  with gr.Blocks() as block:
10
+ options = gr.Dropdown(choices=["Model 1", "Model 2"], label="Models", info="Select the model to use..", )
11
  # need to improve this one...
12
 
13
  txt = gr.Textbox(label="Insert a question..", lines=2)
inference.py CHANGED
@@ -1,4 +1,4 @@
1
- from transformers import ViltProcessor, ViltForQuestionAnswering, Pix2StructProcessor, Pix2StructForConditionalGeneration, Blip2Processor, Blip2ForConditionalGeneration
2
  from transformers.utils import logging
3
 
4
  class Inference:
 
1
+ from transformers import ViltProcessor, ViltForQuestionAnswering, Pix2StructProcessor, Pix2StructForConditionalGeneration
2
  from transformers.utils import logging
3
 
4
  class Inference: