Salesforce
/

xgen-mm-phi3-mini-instruct-r-v1

@@ -9,7 +9,7 @@ pipeline_tag: image-text-to-text
 # Model description
 We are excited to announce the continuation and rebranding of our **BLIP series** into **XGen-MM**, aligning with Salesforce's unified XGen initiative for large foundation models! This rebranding marks a significant step in our ongoing development of cutting-edge multimodal technologies.
-'XGen-mm' is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. \
 These models have been trained at scale on high-quality image caption datasets and interleaved image-text data. XGen-MM highlights a few features below,
 * The **pretrained** foundation model, `xgen-mm-phi3-mini-base-r-v1`, achieves state-of-the-art performance under 5b parameters and demonstrates strong in-context learning capabilities.
@@ -43,11 +43,11 @@ More technical details will come with a technical report soon.
 ### Instruct (after instruction tuning)
 | Model                      | SEED-IMG | MMBench(dev) | MME-total | MME-P    | MME-C   | MMStar   | MMMU (val) | MMVet    | MathVista (mini) | ScienceQA (test) | POPE      | AI2D     |   |
 |----------------------------|----------|--------------|-----------|----------|---------|----------|------------|----------|------------------|------------------|----------|----------|---|
-| MM1-3B-Chat                | 68.8     | **75.9**         | 1761      | **1482**     | 279     | -        | 33.9       | 43.7     | -                | -                | **87.4**            | -        |   |
 | openbmb/MiniCPM-V-2        | 67.1     | 69.6         | 1808      | -        | -       | -        | 38.2       | -        | 38.7             | -                | -         | -        |   |
 | VILA1.5-3B                 | 67.9     | 63.4         | -         | 1442     | -       | -        | 33.3       | 35.4     | -                | 69.0             | 85.9       | -        |   |
 | xtuner/llava-phi-3-mini-hf | 70.0     | 69.2         | 1790      | 1477     | 313     | 43.7     | **41.4**       | -        | -                | 73.7             | 87.3       | 69.3     |   |
-| **xgen-mm-phi3-mini-instruct-r-v1 (Ours)** | **72.1**     | 74.1         | **1827**      | 1467     | **360**     | **44.6**     | 39.8       | **45.1**     | **39.3**             | **74.2**             | 87.2       | **75.8**     |   |
 # How to use
@@ -77,7 +77,7 @@ class EosListStoppingCriteria(StoppingCriteria):
         return self.eos_sequence in last_ids
 # load models
-model_name_or_path = "Salesforce/blip3-phi3-mini-instruct-r-v1"
 model = AutoModelForVision2Seq.from_pretrained(model_name_or_path, trust_remote_code=True)
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, use_fast=False, legacy=False)
 image_processor = AutoImageProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)

 # Model description
 We are excited to announce the continuation and rebranding of our **BLIP series** into **XGen-MM**, aligning with Salesforce's unified XGen initiative for large foundation models! This rebranding marks a significant step in our ongoing development of cutting-edge multimodal technologies.
+`XGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. \
 These models have been trained at scale on high-quality image caption datasets and interleaved image-text data. XGen-MM highlights a few features below,
 * The **pretrained** foundation model, `xgen-mm-phi3-mini-base-r-v1`, achieves state-of-the-art performance under 5b parameters and demonstrates strong in-context learning capabilities.
 ### Instruct (after instruction tuning)
 | Model                      | SEED-IMG | MMBench(dev) | MME-total | MME-P    | MME-C   | MMStar   | MMMU (val) | MMVet    | MathVista (mini) | ScienceQA (test) | POPE      | AI2D     |   |
 |----------------------------|----------|--------------|-----------|----------|---------|----------|------------|----------|------------------|------------------|----------|----------|---|
+| MM1-3B-Chat                | 68.8     | 67.8         | 1761      | **1482**     | 279     | -        | 33.9       | 43.7     | -                | -                | **87.4**            | -        |   |
 | openbmb/MiniCPM-V-2        | 67.1     | 69.6         | 1808      | -        | -       | -        | 38.2       | -        | 38.7             | -                | -         | -        |   |
 | VILA1.5-3B                 | 67.9     | 63.4         | -         | 1442     | -       | -        | 33.3       | 35.4     | -                | 69.0             | 85.9       | -        |   |
 | xtuner/llava-phi-3-mini-hf | 70.0     | 69.2         | 1790      | 1477     | 313     | 43.7     | **41.4**       | -        | -                | 73.7             | 87.3       | 69.3     |   |
+| **xgen-mm-phi3-mini-instruct-r-v1 (Ours)** | **72.1**     | **74.1**         | **1827**      | 1467     | **360**     | **44.6**     | 39.8       | **45.1**     | **39.3**             | **74.2**             | 87.2       | **75.8**     |   |
 # How to use
         return self.eos_sequence in last_ids
 # load models
+model_name_or_path = "Salesforce/xgen-mm-phi3-mini-instruct-r-v1"
 model = AutoModelForVision2Seq.from_pretrained(model_name_or_path, trust_remote_code=True)
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, use_fast=False, legacy=False)
 image_processor = AutoImageProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)