Transformers
PyTorch
flava
pretraining
Inference Endpoints
aps commited on
Commit
57949b6
1 Parent(s): 883af9b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -5
README.md CHANGED
@@ -5,7 +5,7 @@ license: bsd-3-clause
5
 
6
  ## Model Details
7
 
8
- FLAVA model was developed by the researchers at FAIR to understand if a single model can work across different modalities with a unified architectures. The model was developed solely using publicly available multimodal datasets containing 70M image-text pairs in total and thus fully reproducible. The model (i) similar to CLIP can be used for arbitrary image classification tasks in a zero-shot manner (ii) used for image or text retrieval in a zero-shot manner (iii) can also be fine-tuned for natural language understanding (NLU) tasks such as GLUE and vision-and-language reasoning tasks such as VQA v2. In the original paper, the authors evaluate FLAVA on 32 tasks from computer vision, NLU and vision-and-language domains and show impressive performance across the board scoring higher micro-average than CLIP while being open.
9
 
10
  ## Model Date
11
  Model was originally released in November 2021.
@@ -22,7 +22,7 @@ The FLAVA model uses a ViT-B/32 transformer for both image encoder and text enco
22
 
23
  ### FlavaModel
24
 
25
- FLAVA model supports vision, language and multimodal inputs. You can pass corresponding inputs to the modality to get losses and outputs related to that domain.
26
 
27
  ```py
28
  from PIL import Image
@@ -183,11 +183,10 @@ text_embeddings = outputs.last_hidden_state
183
  ## Model Use
184
 
185
  ## Intended Use
186
- The model is intended to serve as a reproducible research artifact for research communities in light of models whose exact reproduction details are never released such as [CLIP](https://github.com/openai/CLIP) and [SimVLM](https://arxiv.org/abs/2108.10904). FLAVA model performs equivalently to these models on most task while being trained on less (70M pairs compared to CLIP's 400M and SimVLM's 1.8B pairs respectively) but public data. We hope that this model enable communities to better understand, and explore zero-shot and arbitrary image classification, multi-domain pretraining, generic architectures while also providing a chance to develop on top.
187
 
188
  ## Primary Intended Uses
189
 
190
-
191
  The primary intended users of these models are AI researchers.
192
 
193
  We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of foundation models which work across domains which in this case are vision, language and combined multimodal vision-and-language domain.
@@ -205,7 +204,7 @@ Since the model has not been purposefully trained in or evaluated on any languag
205
  FLAVA was pretrained on public available 70M image and text pairs. This includes datasets such as COCO, Visual Genome, Localized Narratives, RedCaps, a custom filtered subset of YFCC100M, SBUCaptions, Conceptual Captions and Wikipedia Image-Text datasets. A larger portion of this dataset comes from internet and thus can have bias towards people most connected to internet such as those from developed countries and younger, male users.
206
 
207
  ## Data Mission Statement
208
- Our goal with building this dataset called PMD (Public Multimodal Datasets) was two-fold (i) allow reproducibility of vision-language foundation models with publicly available data and (ii) test robustness and generalizability of FLAVA across the domains. The data was collected from already existing public dataset sources which have already been filtered out by original dataset curators to not contain adult and excessively violent contain. We will make the URLs of the images public for further research reproducibility but will not be hosting them.
209
 
210
  ## Performance and Limitations
211
  ## Performance
 
5
 
6
  ## Model Details
7
 
8
+ FLAVA model was developed by the researchers at FAIR to understand if a single model can work across different modalities with a unified architecture. The model was pretrained solely using publicly available multimodal datasets containing 70M image-text pairs in total and thus fully reproducible. Unimodal datasets ImageNet and BookCorpus + CCNews were also used to provide unimodal data to the model. The model (i) similar to CLIP can be used for arbitrary image classification tasks in a zero-shot manner (ii) used for image or text retrieval in a zero-shot manner (iii) can also be fine-tuned for natural language understanding (NLU) tasks such as GLUE and vision-and-language reasoning tasks such as VQA v2. The model is able to use the data available as images, text corpus and image-text pairs. In the original paper, the authors evaluate FLAVA on 32 tasks from computer vision, NLU and vision-and-language domains and show impressive performance across the board scoring higher micro-average than CLIP while being open.
9
 
10
  ## Model Date
11
  Model was originally released in November 2021.
 
22
 
23
  ### FlavaModel
24
 
25
+ FLAVA model supports vision, language and multimodal inputs. You can pass inputs corresponding to the domain you are concerned with to get losses and outputs related to that domain.
26
 
27
  ```py
28
  from PIL import Image
 
183
  ## Model Use
184
 
185
  ## Intended Use
186
+ The model is intended to serve as a reproducible research artifact for research communities in the light of models whose exact reproduction details are never released such as [CLIP](https://github.com/openai/CLIP) and [SimVLM](https://arxiv.org/abs/2108.10904). FLAVA model performs equivalently to these models on most tasks while being trained on less (70M pairs compared to CLIP's 400M and SimVLM's 1.8B pairs respectively) but public data. We hope that this model enable communities to better understand, and explore zero-shot and arbitrary image classification, multi-domain pretraining, modality-agnostic generic architectures while also providing a chance to develop on top of it.
187
 
188
  ## Primary Intended Uses
189
 
 
190
  The primary intended users of these models are AI researchers.
191
 
192
  We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of foundation models which work across domains which in this case are vision, language and combined multimodal vision-and-language domain.
 
204
  FLAVA was pretrained on public available 70M image and text pairs. This includes datasets such as COCO, Visual Genome, Localized Narratives, RedCaps, a custom filtered subset of YFCC100M, SBUCaptions, Conceptual Captions and Wikipedia Image-Text datasets. A larger portion of this dataset comes from internet and thus can have bias towards people most connected to internet such as those from developed countries and younger, male users.
205
 
206
  ## Data Mission Statement
207
+ Our goal with building this dataset called PMD (Public Multimodal Datasets) was two-fold (i) allow reproducibility of vision-language foundation models with publicly available data and (ii) test robustness and generalizability of FLAVA across the domains. The data was collected from already existing public dataset sources which have already been filtered out by the original dataset curators to not contain adult and excessively violent content. We will make the URLs of the images public for further research reproducibility.
208
 
209
  ## Performance and Limitations
210
  ## Performance