--- license: apache-2.0 language: - en pipeline_tag: image-text-to-text --- # Model description We are excited to announce the continuation and rebranding of our **BLIP series** into **XGen-MM**, to be better aligned with Salesforce's unified XGen initiative for large foundation models! This rebranding marks a significant step in our ongoing development of cutting-edge multimodal technologies. `XGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. \ These models have been trained at scale on high-quality image caption datasets and interleaved image-text data. XGen-MM highlights a few features below, * The **pretrained** foundation model, `xgen-mm-phi3-mini-base-r-v1`, achieves state-of-the-art performance under 5b parameters and demonstrates strong in-context learning capabilities. * The **instruct** fine-tuned model, `xgen-mm-phi3-mini-instruct-r-v1`, achieves state-of-the-art performance among open-source and closed-source VLMs under 5b parameters. * `xgen-mm-phi3-mini-instruct-r-v1` supports flexible high-resolution image encoding with efficient visual token sampling. The model is for research purposes, more technical details will come with a technical report soon. # Datasets | Dataset Type| Dataset(s) Used | |--------|------------------------------------------| | Pretrain | high-quality image caption datasets and interleaved datasets | | Instruction Tuning | a mixture of VQA data and caption datasets including OCR/Document/Chart-focused tasks, publicly available text-only instruction data | # Results ### Pretrain (base model without instruction tuning) | Model | Shot | COCO (val) | NoCaps (val) | TextCaps (val) | OKVQA (val) | TextVQA (val) | VizWiz (testdev) | VQAv2 (testdev) | |-------------|------|------------|--------------|----------------|--------------|---------------|------------------|-----------------| | Flamingo-3B | 4 | 85.0 | - | - | 43.3 | 32.7 | 34 | 53.2 | | | 8 | 90.6 | - | - | 44.6 | 32.4 | 38.4 | 55.4 | | MM1-3B | 0 | 73.5 | 55.6 | 63.3 | 26.1 | 29.4 | 15.6 | 46.2 | | | 4 | 112.3 | 99.7 | 84.1 | 48.6 | 45.3 | 38.0 | 57.9 | | | 8 | 114.6 | 104.7 | 88.8 | 48.4 | 44.6 | 46.4 | 63.6 | | **xgen-mm-phi3-mini-base-r-v1 (Ours)**| 0 | **81.7** | **80.2** | 60.7 | **26.5** | **36.0** | **21.2** | **48.1** | | | 4 | 110.5 | **101.7** | **84.6** | **49.2** | **46.1** | **38.4** | **63.9** | | | 8 | 112.1 | 104.4 | 87.7 | **49.1** | **46.4** | 44.3 | **63.8** | ### Instruct (after instruction tuning) | Model | SEED-IMG | MMBench(dev) | MME-total | MME-P | MME-C | MMStar | MMMU (val) | MMVet | MathVista (mini) | ScienceQA (test) | POPE | AI2D | | |----------------------------|----------|--------------|-----------|----------|---------|----------|------------|----------|------------------|------------------|----------|----------|---| | MM1-3B-Chat | 68.8 | 67.8 | 1761 | **1482** | 279 | - | 33.9 | 43.7 | - | - | **87.4** | - | | | openbmb/MiniCPM-V-2 | 67.1 | 69.6 | 1808 | - | - | - | 38.2 | - | 38.7 | - | - | - | | | VILA1.5-3B | 67.9 | 63.4 | - | 1442 | - | - | 33.3 | 35.4 | - | 69.0 | 85.9 | - | | | xtuner/llava-phi-3-mini-hf | 70.0 | 69.2 | 1790 | 1477 | 313 | 43.7 | **41.4** | - | - | 73.7 | 87.3 | 69.3 | | | **xgen-mm-phi3-mini-instruct-r-v1 (Ours)** | **72.1** | **74.1** | **1827** | 1467 | **360** | **44.6** | 39.8 | **45.1** | **39.3** | **74.2** | 87.2 | **75.8** | | # How to use ~~> We require the use of the development version (`"4.41.0.dev0"`) of the `transformers` library. To get it, as of 05/07/2024, one can use `pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers.`~~ ```python from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoImageProcessor import requests from PIL import Image import IPython.display as display import torch model_name_or_path = "Salesforce/xgen-mm-phi3-mini-base-r-v1" model = AutoModelForVision2Seq.from_pretrained(model_name_or_path, trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, use_fast=True, legacy=False) image_processor = AutoImageProcessor.from_pretrained(model_name_or_path, trust_remote_code=True) tokenizer = model.update_special_tokens(tokenizer) model = model.to('cuda') tokenizer.padding_side = "left" def apply_prompt_template(prompt, num_images=1, num_tokens_per_vis = 128, in_context=False, output=None): """ num_tokens_per_vis: model.vlm.num_tokens_per_vis """ placeholder_image_tokens = "" * (num_tokens_per_vis - 1) if in_context: formatted_prompt = f"{placeholder_image_tokens}" + f"{prompt}" + f"{output}" + "<|endofchunk|>" else: formatted_prompt = f"{placeholder_image_tokens}"*num_images + f"{prompt}" return formatted_prompt ############ Zero shot inference ########## img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') instruction = "Describe what is the dog doing in this image in one sentence:" print("==> Instruction: ", instruction) print("==> Image: ") display.display(raw_image.resize((int(raw_image.width*0.3), int(raw_image.height*0.3)))) inputs = image_processor([raw_image], return_tensors="pt") prompt = apply_prompt_template(instruction) language_inputs = tokenizer([prompt], return_tensors="pt") inputs.update(language_inputs) inputs = {name: tensor.cuda() for name, tensor in inputs.items()} with torch.cuda.amp.autocast(dtype=torch.bfloat16): generated_text = model.generate(**inputs, pad_token_id=tokenizer.pad_token_id, do_sample=False, max_new_tokens=64, top_p=None, num_beams=1, length_penalty=1.0, repetition_penalty=3.0) prediction = tokenizer.decode(generated_text[0], skip_special_tokens=True) print("==> prediciton: ", prediction) print("-"*120) # ==> prediciton: The dog is sitting on the beach and waving at his owner. ``` More comprehensive examples can be found in the [notebook](demo.ipynb), where we provide a zero-shot and a few-shot example, respectively. # Reproducibility: Our SFT evaluation is based on the VLMEvalKit, in which we fixed some inconsistencies with the official benchmarks (e.g., LLM judge API). During our development, we noticed that the raw resolution of the input image would noticeably affect the model output in some cases. # Bias, Risks, Limitations, and Ethical Considerations The main data sources are from the internet, including webpages, image stock sites, and curated datasets released by the research community. We have excluded certain data, such as LAION, due to known CSAM concerns. The model may be subject to bias from the original data source, as well as bias from LLMs and commercial APIs. We strongly recommend users assess safety and fairness before applying to downstream applications. # License Our code and weights are released under the Apache-2.0 license. The copyright of the training data remains with the original data owner. # Code acknowledgement [LAVIS](https://github.com/salesforce/LAVIS) \ [openflamingo](https://github.com/mlfoundations/open_flamingo) \ [VLMEvalKit](https://github.com/open-compass/VLMEvalKit/tree/main) # Citation ``` @misc{xgen_mm_phi3_mini, title={xgen-mm-phi3-mini-base Model Card}, url={https://huggingface.co/Salesforce/xgen-mm-phi3-mini-base-r-v1}, author={Salesforce AI Research}, month={May}, year={2024} } ``` # Troubleshoot 1. If you missed any packages, please consider the following ``` pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121 pip install open_clip_torch==2.24.0 pip install einops pip install einops-exts pip install transformers==4.41.1 ``` # Changelog * 05/24/2024 * update codebase to be compatiable with `transformers==4.41.1`.