HuggingFaceM4
/

idefics2-8b-base

@@ -30,6 +30,7 @@ tags:
     <img src="https://huggingface.co/HuggingFaceM4/idefics-80b/resolve/main/assets/IDEFICS.png" alt="Idefics-Obelics logo" width="200" height="100">
 </p>
 # Idefics2
@@ -40,7 +41,6 @@ We release under the Apache 2.0 license 2 checkpoints:
 - [idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b): the base model fine-tuned on a mixture of supervised and instruction datasets (text-only and multimodal datasets)
 - idefics2-8b-chatty (coming soon): `idefics2-8b` further fine-tuned on long conservation
 # Model Summary
 - **Developed by:** Hugging Face
@@ -217,11 +217,22 @@ print(generated_texts)
 # Model optimizations
 **Vision encoder efficiency**
 Given the high resolution supported, the vision part of the model can be memory hungry depending on your configuration. If you are GPU-memory-constrained, you can:
 - **deactivate the image splitting.** To do so, add `do_image_splitting=False` when initializing the processor (`AutoProcessor.from_pretrained`). There are no changes required on the model side. Note that only the sft model has been trained with image splitting.
-- **decrease the maximum image resolution.** To do so, add `size= {"longest_edge": 448, "shortest_edge": 378}` when initializing the processor (`AutoProcessor.from_pretrained`). In particular, the `longest_edge` value can be adapted to fit the need. We recommend using values that are multiples of 14. There are no changes required on the model side.
 **Using Flash-attention 2 to speed up generation**
@@ -232,7 +243,7 @@ First, make sure to install `flash-attn`. Refer to the [original repository of F
 ```diff
 model = AutoModelForVision2Seq.from_pretrained(
     "HuggingFaceM4/idefics2-8b",
-+    torch_dtype=torch.bfloat16,
 +    _attn_implementation="flash_attention_2",
 ).to(DEVICE)
 ```
@@ -241,11 +252,11 @@ Flash attention 2 support is available both for `idefics2-8b-base` and `idefics2
 </details>
-**4 bit quantization and module fusing**
 <details><summary>Click to expand.</summary>
-4-bit AWQ-quantized versions of the checkpoints are also available and allow module fusing for accelerated inference. First make sure you install the Auto-AWQ library with `pip install autoawq`. Also make sure that this [fix] is integrated into your installation.
 ```diff
 + from transformers import AwqConfig
@@ -266,12 +277,63 @@ Flash attention 2 support is available both for `idefics2-8b-base` and `idefics2
 model = AutoModelForVision2Seq.from_pretrained(
 -    "HuggingFaceM4/idefics2-8b",
 +    "HuggingFaceM4/idefics2-8b-AWQ",
 +    quantization_config=quantization_config,
 ).to(DEVICE)
 ```
 </details>
 # Bias, Risks, and Limitations
 Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).

     <img src="https://huggingface.co/HuggingFaceM4/idefics-80b/resolve/main/assets/IDEFICS.png" alt="Idefics-Obelics logo" width="200" height="100">
 </p>
+***As of April 18th, 2024**, Idefics2 is part of the `4.40.0` Transformers pypi release. Please upgrade your Transformers version (`pip install transformers --upgrade`).*
 # Idefics2
 - [idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b): the base model fine-tuned on a mixture of supervised and instruction datasets (text-only and multimodal datasets)
 - idefics2-8b-chatty (coming soon): `idefics2-8b` further fine-tuned on long conservation
 # Model Summary
 - **Developed by:** Hugging Face
 # Model optimizations
+If your GPU allows, we first recommend loading (and running inference) in half precision (`torch.float16` or `torch.bfloat16`).
+```diff
+model = AutoModelForVision2Seq.from_pretrained(
+    "HuggingFaceM4/idefics2-8b",
++    torch_dtype=torch.float16,
+).to(DEVICE)
+```
 **Vision encoder efficiency**
 Given the high resolution supported, the vision part of the model can be memory hungry depending on your configuration. If you are GPU-memory-constrained, you can:
 - **deactivate the image splitting.** To do so, add `do_image_splitting=False` when initializing the processor (`AutoProcessor.from_pretrained`). There are no changes required on the model side. Note that only the sft model has been trained with image splitting.
+- **decrease the maximum image resolution.** To do so, add `size= {"longest_edge": 448, "shortest_edge": 378}` when initializing the processor (`AutoProcessor.from_pretrained`). In particular, the `longest_edge` value can be adapted to fit the need (the default value is `980`). We recommend using values that are multiples of 14. There are no changes required on the model side.
+`do_image_splitting=True` is especially needed to boost performance on OCR tasks where a very large image is used as input. For the regular VQA or captioning tasks, this argument can be safely set to `False` with minimal impact on performance (see the evaluation table above).
 **Using Flash-attention 2 to speed up generation**
 ```diff
 model = AutoModelForVision2Seq.from_pretrained(
     "HuggingFaceM4/idefics2-8b",
++    torch_dtype=torch.float16,
 +    _attn_implementation="flash_attention_2",
 ).to(DEVICE)
 ```
 </details>
+**4 bit quantization with AWQ**
 <details><summary>Click to expand.</summary>
+4-bit AWQ-quantized versions of the checkpoints are also available and allow module fusing for accelerated inference. First make sure you install the Auto-AWQ library with `pip install autoawq`. Also make sure that this [fix](https://github.com/casper-hansen/AutoAWQ/pull/444) is integrated into your installation.
 ```diff
 + from transformers import AwqConfig
 model = AutoModelForVision2Seq.from_pretrained(
 -    "HuggingFaceM4/idefics2-8b",
 +    "HuggingFaceM4/idefics2-8b-AWQ",
++    torch_dtype=torch.float16,
++    quantization_config=quantization_config,
+).to(DEVICE)
+```
+Fusing can be de-activated by removing `quantization_config` in the call to `from_pretrained`.
+</details>
+**4 bit quantization with bitsandbytes**
+<details><summary>Click to expand.</summary>
+It is also possible to load Idefics2 in 4bits with `bitsandbytes`. To do so, make sure that you have `accelerate` and `bitsandbytes` installed.
+```diff
++ from transformer import BitsAndBytesConfig
+quantization_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_compute_dtype=torch.float16
+)
+model = AutoModelForVision2Seq.from_pretrained(
+    "HuggingFaceM4/idefics2-8b",
++    torch_dtype=torch.float16,
 +    quantization_config=quantization_config,
 ).to(DEVICE)
 ```
 </details>
+These optimizations can be combined to suit variable trade-offs between GPU memory, inference speed and performance. We provide the following comparison as anchor points to guide the user in choosing necessary optimizations. All of these benchmarks were computed with the example code snippet described above on a H100 (see [colab](https://colab.research.google.com/drive/1USsnssoFm1UTYuwUOw0XiGeBspLHzvso?usp=sharing)). As one can see, the are a few setups that require less than 24GB of GPU memory.
+| Flash attention 2 | Image splitting | Float type | 4 bits quantization         | Peak GPU memory (GB) | Time for 20 generations (secs) |
+|-------------------|-----------------|------------|-----------------------------|----------------------|--------------------------------|
+| No                | Yes             | fp32       | No                          |                 54.9 |                           55.6 |
+| No                | Yes             | bf16       | No                          |                 41.3 |                           34.3 |
+| No                | Yes             | fp16       | No                          |                 36.7 |                           33.3 |
+| Yes               | Yes             | fp16       | No                          |                 21.0 |                           13.3 |
+| Yes               | Yes             | fp16       | bitsandbytes (entire model) |                  8.9 |                           19.9 |
+| No                | Yes             | fp16       | bitsandbytes (entire model) |                 24.7 |                           40.4 |
+| No                | Yes             | fp16       | AWQ (LLM only)              |                 26.4 |                           37.1 |
+| Yes               | Yes             | fp16       | AWQ (LLM only)              |                 10.7 |                           16.3 |
+| No                | Yes             | fp16       | AWQ + fusing (LLM only)     |                 26.0 |                           38.4 |
+|                   |                 |            |                             |                      |                                |
+| No                | No              | fp32       | No                          |                 38.8 |                           17.5 |
+| No                | No              | bf16       | No                          |                 22.2 |                           14.4 |
+| No                | No              | fp16       | No                          |                 21.3 |                           13.9 |
+| Yes               | No              | fp16       | No                          |                 18.1 |                           10.4 |
+| Yes               | No              | fp16       | bitsandbytes (entire model) |                  6.0 |                           17.3 |
+| No                | No              | fp16       | bitsandbytes (entire model) |                  9.2 |                           20.9 |
+| No                | No              | fp16       | AWQ (LLM only)              |                 10.9 |                           15.9 |
+| Yes               | No              | fp16       | AWQ (LLM only)              |                  7.8 |                           12.3 |
+| No                | No              | fp16       | AWQ + fusing (LLM only)     |                 10.5 |                           19.5 |
+To learn more quantization schemes and fusing, we refer to the [documentation](https://huggingface.co/docs/transformers/quantization).
 # Bias, Risks, and Limitations
 Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).