microsoft
/

Phi-3-vision-128k-instruct

@@ -19,7 +19,7 @@ widget:
 ---
 ## Model Summary
-Phi-3 Vision is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision.  The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.
 Resources and Technical Documentation:
@@ -59,7 +59,7 @@ Nothing contained in this Model Card should be interpreted as or deemed a restri
 ## How to Use
-Phi-3-vision-128K-Instruct has been integrated in the development version (4.40.2) of `transformers`. Until the official version is released through `pip`, ensure that you are doing one of the following:
 * When loading the model, ensure that `trust_remote_code=True` is passed as an argument of the `from_pretrained()` function.
 * Update your local `transformers` to the development version: `pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers`. The previous command is an alternative to cloning and installing from the source.
@@ -190,7 +190,7 @@ More details can be found in the [Phi-3 Technical Report](https://aka.ms/phi3-te
 ## Benchmarks
-To understand the capabilities, we compare Phi-3 Vision-128K-Instruct with a set of models over a variety of zero-shot benchmarks using our internal benchmark platform.
 |Benchmark|Phi-3 Vision-128K-In|LlaVA-1.6 Vicuna-7B|QWEN-VL Chat|Llama3-Llava-Next-8B|Claude-3 Haiku|Gemini 1.0 Pro V|GPT-4V-Turbo|
 |---------|---------------------|------------------|------------|--------------------|--------------|----------------|------------|

 ---
 ## Model Summary
+The Phi-3-Vision-128K-Instruct is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision.  The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.
 Resources and Technical Documentation:
 ## How to Use
+Phi-3-Vision-128K-Instruct has been integrated in the development version (4.40.2) of `transformers`. Until the official version is released through `pip`, ensure that you are doing one of the following:
 * When loading the model, ensure that `trust_remote_code=True` is passed as an argument of the `from_pretrained()` function.
 * Update your local `transformers` to the development version: `pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers`. The previous command is an alternative to cloning and installing from the source.
 ## Benchmarks
+To understand the capabilities, we compare Phi-3-Vision-128K-Instruct with a set of models over a variety of zero-shot benchmarks using our internal benchmark platform.
 |Benchmark|Phi-3 Vision-128K-In|LlaVA-1.6 Vicuna-7B|QWEN-VL Chat|Llama3-Llava-Next-8B|Claude-3 Haiku|Gemini 1.0 Pro V|GPT-4V-Turbo|
 |---------|---------------------|------------------|------------|--------------------|--------------|----------------|------------|