Update README.md
Browse files
README.md
CHANGED
@@ -11,28 +11,52 @@ We are excited to announce the continuation and rebranding of our **BLIP series*
|
|
11 |
|
12 |
`XGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. These models have been trained at scale on high-quality image caption datasets and interleaved image-text data.
|
13 |
|
14 |
-
In the v1.
|
15 |
-
-
|
16 |
-
-
|
17 |
-
-
|
18 |
-
-
|
19 |
|
20 |
In addition to the models, we are also releasing a series of datasets for multi-modal pre-training, including:
|
21 |
-
- [MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens](https://arxiv.org/abs/2406.11271)
|
22 |
-
- BLIP3-OCR-200M: a dataset with dense OCR annotations.
|
23 |
-
- BLIP3-GROUNDING-50M: a dataset for enhancing the ability to ground semantic concepts in images.
|
24 |
- BLIP3-KALE-300M (stay tuned): a large-scale curated high-quality caption dataset.
|
25 |
|
|
|
|
|
26 |
# Data
|
|
|
27 |
|
28 |
|
29 |
# Results
|
30 |
|
31 |
-
### Base model (without instruction tuning)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
|
33 |
-
|
34 |
|
35 |
-
|
|
|
|
|
|
|
|
|
36 |
|
37 |
|
38 |
# How to use
|
@@ -41,8 +65,8 @@ Please check out our [inference notebook](demo.ipynb) for example code to use ou
|
|
41 |
|
42 |
# Reproducibility:
|
43 |
|
44 |
-
|
45 |
-
|
46 |
|
47 |
# Bias, Risks, Limitations, and Ethical Considerations
|
48 |
The main data sources are from the internet, including webpages,
|
@@ -65,7 +89,7 @@ We thank the authors for their open-source implementations.
|
|
65 |
# Citation
|
66 |
```
|
67 |
@misc{xgen_mm_phi3_mini,
|
68 |
-
title={xgen-mm-phi3-mini-
|
69 |
url={https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1},
|
70 |
author={Salesforce AI Research},
|
71 |
month={May},
|
|
|
11 |
|
12 |
`XGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. These models have been trained at scale on high-quality image caption datasets and interleaved image-text data.
|
13 |
|
14 |
+
In the v1.5 (08/2024) release, we present a series of XGen-MM models including:
|
15 |
+
- [π€ xGen-MM-base](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-base-r-v1.5): `xgen-mm-phi3-mini-base-r-v1.5`
|
16 |
+
- [π€ xGen-MM-instruct](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1.5): `xgen-mm-phi3-mini-instruct-r-v1.5`
|
17 |
+
- [π€ xGen-MM-instruct-interleave](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-multi-r-v1.5): `xgen-mm-phi3-mini-instruct-multi-r-v1.5`
|
18 |
+
- [π€ xGen-MM-instruct-dpo](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-dpo-r-v1.5): `xgen-mm-phi3-mini-instruct-dpo-r-v1.5`
|
19 |
|
20 |
In addition to the models, we are also releasing a series of datasets for multi-modal pre-training, including:
|
21 |
+
- [π MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens](https://arxiv.org/abs/2406.11271)
|
22 |
+
- [π€ BLIP3-OCR-200M](https://huggingface.co/datasets/Salesforce/blip3-ocr-200m): a dataset with dense OCR annotations.
|
23 |
+
- [π€ BLIP3-GROUNDING-50M](https://huggingface.co/datasets/Salesforce/blip3-grounding-50m): a dataset for enhancing the ability to ground semantic concepts in images.
|
24 |
- BLIP3-KALE-300M (stay tuned): a large-scale curated high-quality caption dataset.
|
25 |
|
26 |
+
For more details, check out our [tech report]() and project page (coming soon).
|
27 |
+
|
28 |
# Data
|
29 |
+
The base model is pre-trained on a mixture of data sources described above, with around 100 billion image-text tokens in total.
|
30 |
|
31 |
|
32 |
# Results
|
33 |
|
34 |
+
### Few-shot Evaluation on Base model (without instruction tuning)
|
35 |
+
|
36 |
+
| Model | Shot | VQAv2 | TextVQA | OKVQA | COCO | NoCaps | TextCaps |
|
37 |
+
|:--------------|:-----|:------|:--------|:------|:------|:-------|:---------|
|
38 |
+
| Flamingo-3B | 0 | 49.2 | 30.1 | 41.2 | 73.0 | - | - |
|
39 |
+
| | 4 | 53.2 | 32.7 | 43.3 | 85.0 | - | - |
|
40 |
+
| | 8 | 55.4 | 32.4 | 44.6 | 90.6 | - | - |
|
41 |
+
| MM1-3B | 0 | 46.2 | 29.4 | 26.1 | 73.5 | 55.6 | 63.3 |
|
42 |
+
| | 4 | 57.9 | 45.3 | 44.6 | **112.3** | 99.7 | 84.1 |
|
43 |
+
| | 8 | 63.6 | 44.6 | 48.4 | **114.6** | **104.7** | 88.8 |
|
44 |
+
| xGen-MM-base | 0 | 43.1 | 34.0 | 28.0 | 67.2 | 82.6 | 69.5 |
|
45 |
+
| | 4 | **66.3**| **54.2**| **48.9**| 107.6 | **100.8**| **89.9** |
|
46 |
+
| | 8 | **66.9**| **55.3**| **50.1**| 109.8| 104.6| **94.0**|
|
47 |
+
|
48 |
+
|
49 |
+
### Showcases on In-Context Learning
|
50 |
+
|
51 |
+
Below are some qualitative examples below of the mutli-modal in-context learning capacity of our base model.
|
52 |
|
53 |
+
<img src="icl_examples/art.png" alt="Art" width=500>
|
54 |
|
55 |
+
|
56 |
+
<img src="icl_examples/animal.png" alt="Animal" width=500>
|
57 |
+
|
58 |
+
|
59 |
+
<img src="icl_examples/street.png" alt="Street" width=500>
|
60 |
|
61 |
|
62 |
# How to use
|
|
|
65 |
|
66 |
# Reproducibility:
|
67 |
|
68 |
+
The pretraining evaluation is implemented based on [OpenFlamingo: An open-source framework for training large multimodal models.](https://github.com/mlfoundations/open_flamingo).
|
69 |
+
Few-shot examples are randomly drawn so there will be some variance with different random seeds.
|
70 |
|
71 |
# Bias, Risks, Limitations, and Ethical Considerations
|
72 |
The main data sources are from the internet, including webpages,
|
|
|
89 |
# Citation
|
90 |
```
|
91 |
@misc{xgen_mm_phi3_mini,
|
92 |
+
title={xgen-mm-phi3-mini-base Model Card},
|
93 |
url={https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1},
|
94 |
author={Salesforce AI Research},
|
95 |
month={May},
|