VictorSanh commited on
Commit
50966bf
1 Parent(s): 65edf33
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -32,7 +32,7 @@ Read more about some of the technical challenges encountered during training IDE
32
  - **Developed by:** Hugging Face
33
  - **Model type:** Multi-modal model (image+text)
34
  - **Language(s) (NLP):** en
35
- - **License:** other
36
  - **Parent Model:** [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)
37
  - **Resources for more information:**
38
  - [GitHub Repo](https://github.com/huggingface/m4/)
@@ -158,7 +158,9 @@ We perform checkpoint selection based on validation sets of VQAv2, TextVQA, OKVQ
158
 
159
  As opposed to Flamingo, we did not train IDEFICS on video-text pairs datasets, and as such, we did not evaluate the model on video-text benchmarks like Flamingo did. We leave that evaluation for a future iteration.
160
 
161
- <!-- <img src="./assets/Figure_Evals_IDEFIX.png" width="55%"> <img width=120/> -->
 
 
162
 
163
  | Model | Shots | VQAv2<br>OE VQA acc.<br> | OKVQA<br>OE VQA acc.<br> | TextVQA<br>OE VQA acc.<br> | VizWiz<br>OE VQA acc.<br> | TextCaps<br>CIDEr<br> | Coco<br>CIDEr<br> | NoCaps<br>CIDEr | Flickr<br>CIDEr | VisDial<br>NDCG | HatefulMemes<br>ROC AUC | ScienceQA<br>acc. | RenderedSST2<br>acc. | Winoground<br>group (text/image) |
164
  |:-----------|--------:|---------------------:|---------------------:|-----------------------:|----------------------:|-------------------:|---------------:|-----------------:|-----------------:|-----------------:|-------------------------:|-----------------------:|--------------------------:|----------------------------------:|
@@ -174,8 +176,6 @@ As opposed to Flamingo, we did not train IDEFICS on video-text pairs datasets, a
174
  | | 16 | 57.0 | 48.4 | 27.9 | 42.6 | 67.4 | 99.7 | 89.4 | 64.5 | - | 50.9 | - | 67.8 | - |
175
  | | 32 | 57.9 | 49.6 | 28.3 | 43.7 | 68.1 | 98.0 | 90.5 | 64.4 | - | 49.8 | - | 67.0 | - |
176
 
177
- We note that since we trained on PMD which contains COCO, the evaluation numbers on COCO are not directly comparable with Flamingo and OpenFlamingo since they did not explicitely have this dataset in the training mixture.
178
-
179
  For ImageNet-1k, we also report results where the priming samples are selected to be similar (i.e. close in a vector space) to the queried instance.
180
 
181
  ImageNet-1k Evaluation:
 
32
  - **Developed by:** Hugging Face
33
  - **Model type:** Multi-modal model (image+text)
34
  - **Language(s) (NLP):** en
35
+ - **License:** see [License section](#license)
36
  - **Parent Model:** [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)
37
  - **Resources for more information:**
38
  - [GitHub Repo](https://github.com/huggingface/m4/)
 
158
 
159
  As opposed to Flamingo, we did not train IDEFICS on video-text pairs datasets, and as such, we did not evaluate the model on video-text benchmarks like Flamingo did. We leave that evaluation for a future iteration.
160
 
161
+ <img src="./assets/Figure_Evals_IDEFIX.png" width="55%">
162
+
163
+ We note that since IDEFICS was trained on PMD (which contains COCO), the evaluation numbers on COCO are not directly comparable with Flamingo and OpenFlamingo since they did not explicitely have this dataset in the training mixture. Additionally, Flamingo is trained with images of resolution 320 x 320 while IDEFICS and OpenFlamingo were trained with images of 224 x 224 resolution.
164
 
165
  | Model | Shots | VQAv2<br>OE VQA acc.<br> | OKVQA<br>OE VQA acc.<br> | TextVQA<br>OE VQA acc.<br> | VizWiz<br>OE VQA acc.<br> | TextCaps<br>CIDEr<br> | Coco<br>CIDEr<br> | NoCaps<br>CIDEr | Flickr<br>CIDEr | VisDial<br>NDCG | HatefulMemes<br>ROC AUC | ScienceQA<br>acc. | RenderedSST2<br>acc. | Winoground<br>group (text/image) |
166
  |:-----------|--------:|---------------------:|---------------------:|-----------------------:|----------------------:|-------------------:|---------------:|-----------------:|-----------------:|-----------------:|-------------------------:|-----------------------:|--------------------------:|----------------------------------:|
 
176
  | | 16 | 57.0 | 48.4 | 27.9 | 42.6 | 67.4 | 99.7 | 89.4 | 64.5 | - | 50.9 | - | 67.8 | - |
177
  | | 32 | 57.9 | 49.6 | 28.3 | 43.7 | 68.1 | 98.0 | 90.5 | 64.4 | - | 49.8 | - | 67.0 | - |
178
 
 
 
179
  For ImageNet-1k, we also report results where the priming samples are selected to be similar (i.e. close in a vector space) to the queried instance.
180
 
181
  ImageNet-1k Evaluation: