gchhablani commited on
Commit
2931d01
1 Parent(s): 4871c6f

Update intro

Browse files
Files changed (1) hide show
  1. sections/intro/intro.md +1 -1
sections/intro/intro.md CHANGED
@@ -7,7 +7,7 @@ In addition, even recent **approaches that have been proposed for VQA generally
7
  - the the boxes are selected using a NMS (Non-max suppression),
8
  - and then the features for selected boxes are used as visual features.
9
 
10
- A major **advantage that comes from using transformers is their simplicity and their accessibility** - thanks to HuggingFace team, ViT and Transformers authors. For ViT models, for example, all one needs to do is pass the normalized images to the transformer.
11
 
12
  While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic approach/model that works decently well across several languages **With the aim of democratizing such a challenging yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA)**. Our intention here is to provide a Proof-of-Concept with our simple CLIP-Vision-BERT baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - **English, French, German and Spanish**.
13
 
 
7
  - the the boxes are selected using a NMS (Non-max suppression),
8
  - and then the features for selected boxes are used as visual features.
9
 
10
+ A major **advantage that comes from using transformers is their simplicity and their accessibility** - thanks to HuggingFace, ViT and Transformers. For ViT models, for example, all one needs to do is pass the normalized images to the transformer.
11
 
12
  While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic approach/model that works decently well across several languages **With the aim of democratizing such a challenging yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA)**. Our intention here is to provide a Proof-of-Concept with our simple CLIP-Vision-BERT baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - **English, French, German and Spanish**.
13