VictorSanh commited on
Commit
a05b1dc
1 Parent(s): 30a9a75
Files changed (1) hide show
  1. README.md +67 -5
README.md CHANGED
@@ -30,6 +30,7 @@ tags:
30
  <img src="https://huggingface.co/HuggingFaceM4/idefics-80b/resolve/main/assets/IDEFICS.png" alt="Idefics-Obelics logo" width="200" height="100">
31
  </p>
32
 
 
33
 
34
  # Idefics2
35
 
@@ -40,7 +41,6 @@ We release under the Apache 2.0 license 2 checkpoints:
40
  - [idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b): the base model fine-tuned on a mixture of supervised and instruction datasets (text-only and multimodal datasets)
41
  - idefics2-8b-chatty (coming soon): `idefics2-8b` further fine-tuned on long conservation
42
 
43
-
44
  # Model Summary
45
 
46
  - **Developed by:** Hugging Face
@@ -217,11 +217,22 @@ print(generated_texts)
217
 
218
  # Model optimizations
219
 
 
 
 
 
 
 
 
 
 
220
  **Vision encoder efficiency**
221
 
222
  Given the high resolution supported, the vision part of the model can be memory hungry depending on your configuration. If you are GPU-memory-constrained, you can:
223
  - **deactivate the image splitting.** To do so, add `do_image_splitting=False` when initializing the processor (`AutoProcessor.from_pretrained`). There are no changes required on the model side. Note that only the sft model has been trained with image splitting.
224
- - **decrease the maximum image resolution.** To do so, add `size= {"longest_edge": 448, "shortest_edge": 378}` when initializing the processor (`AutoProcessor.from_pretrained`). In particular, the `longest_edge` value can be adapted to fit the need. We recommend using values that are multiples of 14. There are no changes required on the model side.
 
 
225
 
226
  **Using Flash-attention 2 to speed up generation**
227
 
@@ -232,7 +243,7 @@ First, make sure to install `flash-attn`. Refer to the [original repository of F
232
  ```diff
233
  model = AutoModelForVision2Seq.from_pretrained(
234
  "HuggingFaceM4/idefics2-8b",
235
- + torch_dtype=torch.bfloat16,
236
  + _attn_implementation="flash_attention_2",
237
  ).to(DEVICE)
238
  ```
@@ -241,11 +252,11 @@ Flash attention 2 support is available both for `idefics2-8b-base` and `idefics2
241
 
242
  </details>
243
 
244
- **4 bit quantization and module fusing**
245
 
246
  <details><summary>Click to expand.</summary>
247
 
248
- 4-bit AWQ-quantized versions of the checkpoints are also available and allow module fusing for accelerated inference. First make sure you install the Auto-AWQ library with `pip install autoawq`. Also make sure that this [fix] is integrated into your installation.
249
 
250
  ```diff
251
  + from transformers import AwqConfig
@@ -266,12 +277,63 @@ Flash attention 2 support is available both for `idefics2-8b-base` and `idefics2
266
  model = AutoModelForVision2Seq.from_pretrained(
267
  - "HuggingFaceM4/idefics2-8b",
268
  + "HuggingFaceM4/idefics2-8b-AWQ",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
269
  + quantization_config=quantization_config,
270
  ).to(DEVICE)
271
  ```
272
 
273
  </details>
274
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
275
  # Bias, Risks, and Limitations
276
 
277
  Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
 
30
  <img src="https://huggingface.co/HuggingFaceM4/idefics-80b/resolve/main/assets/IDEFICS.png" alt="Idefics-Obelics logo" width="200" height="100">
31
  </p>
32
 
33
+ ***As of April 18th, 2024**, Idefics2 is part of the `4.40.0` Transformers pypi release. Please upgrade your Transformers version (`pip install transformers --upgrade`).*
34
 
35
  # Idefics2
36
 
 
41
  - [idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b): the base model fine-tuned on a mixture of supervised and instruction datasets (text-only and multimodal datasets)
42
  - idefics2-8b-chatty (coming soon): `idefics2-8b` further fine-tuned on long conservation
43
 
 
44
  # Model Summary
45
 
46
  - **Developed by:** Hugging Face
 
217
 
218
  # Model optimizations
219
 
220
+ If your GPU allows, we first recommend loading (and running inference) in half precision (`torch.float16` or `torch.bfloat16`).
221
+
222
+ ```diff
223
+ model = AutoModelForVision2Seq.from_pretrained(
224
+ "HuggingFaceM4/idefics2-8b",
225
+ + torch_dtype=torch.float16,
226
+ ).to(DEVICE)
227
+ ```
228
+
229
  **Vision encoder efficiency**
230
 
231
  Given the high resolution supported, the vision part of the model can be memory hungry depending on your configuration. If you are GPU-memory-constrained, you can:
232
  - **deactivate the image splitting.** To do so, add `do_image_splitting=False` when initializing the processor (`AutoProcessor.from_pretrained`). There are no changes required on the model side. Note that only the sft model has been trained with image splitting.
233
+ - **decrease the maximum image resolution.** To do so, add `size= {"longest_edge": 448, "shortest_edge": 378}` when initializing the processor (`AutoProcessor.from_pretrained`). In particular, the `longest_edge` value can be adapted to fit the need (the default value is `980`). We recommend using values that are multiples of 14. There are no changes required on the model side.
234
+
235
+ `do_image_splitting=True` is especially needed to boost performance on OCR tasks where a very large image is used as input. For the regular VQA or captioning tasks, this argument can be safely set to `False` with minimal impact on performance (see the evaluation table above).
236
 
237
  **Using Flash-attention 2 to speed up generation**
238
 
 
243
  ```diff
244
  model = AutoModelForVision2Seq.from_pretrained(
245
  "HuggingFaceM4/idefics2-8b",
246
+ + torch_dtype=torch.float16,
247
  + _attn_implementation="flash_attention_2",
248
  ).to(DEVICE)
249
  ```
 
252
 
253
  </details>
254
 
255
+ **4 bit quantization with AWQ**
256
 
257
  <details><summary>Click to expand.</summary>
258
 
259
+ 4-bit AWQ-quantized versions of the checkpoints are also available and allow module fusing for accelerated inference. First make sure you install the Auto-AWQ library with `pip install autoawq`. Also make sure that this [fix](https://github.com/casper-hansen/AutoAWQ/pull/444) is integrated into your installation.
260
 
261
  ```diff
262
  + from transformers import AwqConfig
 
277
  model = AutoModelForVision2Seq.from_pretrained(
278
  - "HuggingFaceM4/idefics2-8b",
279
  + "HuggingFaceM4/idefics2-8b-AWQ",
280
+ + torch_dtype=torch.float16,
281
+ + quantization_config=quantization_config,
282
+ ).to(DEVICE)
283
+ ```
284
+
285
+ Fusing can be de-activated by removing `quantization_config` in the call to `from_pretrained`.
286
+ </details>
287
+
288
+ **4 bit quantization with bitsandbytes**
289
+
290
+ <details><summary>Click to expand.</summary>
291
+ It is also possible to load Idefics2 in 4bits with `bitsandbytes`. To do so, make sure that you have `accelerate` and `bitsandbytes` installed.
292
+
293
+ ```diff
294
+ + from transformer import BitsAndBytesConfig
295
+
296
+ quantization_config = BitsAndBytesConfig(
297
+ load_in_4bit=True,
298
+ bnb_4bit_quant_type="nf4",
299
+ bnb_4bit_use_double_quant=True,
300
+ bnb_4bit_compute_dtype=torch.float16
301
+ )
302
+ model = AutoModelForVision2Seq.from_pretrained(
303
+ "HuggingFaceM4/idefics2-8b",
304
+ + torch_dtype=torch.float16,
305
  + quantization_config=quantization_config,
306
  ).to(DEVICE)
307
  ```
308
 
309
  </details>
310
 
311
+ These optimizations can be combined to suit variable trade-offs between GPU memory, inference speed and performance. We provide the following comparison as anchor points to guide the user in choosing necessary optimizations. All of these benchmarks were computed with the example code snippet described above on a H100 (see [colab](https://colab.research.google.com/drive/1USsnssoFm1UTYuwUOw0XiGeBspLHzvso?usp=sharing)). As one can see, the are a few setups that require less than 24GB of GPU memory.
312
+
313
+ | Flash attention 2 | Image splitting | Float type | 4 bits quantization | Peak GPU memory (GB) | Time for 20 generations (secs) |
314
+ |-------------------|-----------------|------------|-----------------------------|----------------------|--------------------------------|
315
+ | No | Yes | fp32 | No | 54.9 | 55.6 |
316
+ | No | Yes | bf16 | No | 41.3 | 34.3 |
317
+ | No | Yes | fp16 | No | 36.7 | 33.3 |
318
+ | Yes | Yes | fp16 | No | 21.0 | 13.3 |
319
+ | Yes | Yes | fp16 | bitsandbytes (entire model) | 8.9 | 19.9 |
320
+ | No | Yes | fp16 | bitsandbytes (entire model) | 24.7 | 40.4 |
321
+ | No | Yes | fp16 | AWQ (LLM only) | 26.4 | 37.1 |
322
+ | Yes | Yes | fp16 | AWQ (LLM only) | 10.7 | 16.3 |
323
+ | No | Yes | fp16 | AWQ + fusing (LLM only) | 26.0 | 38.4 |
324
+ | | | | | | |
325
+ | No | No | fp32 | No | 38.8 | 17.5 |
326
+ | No | No | bf16 | No | 22.2 | 14.4 |
327
+ | No | No | fp16 | No | 21.3 | 13.9 |
328
+ | Yes | No | fp16 | No | 18.1 | 10.4 |
329
+ | Yes | No | fp16 | bitsandbytes (entire model) | 6.0 | 17.3 |
330
+ | No | No | fp16 | bitsandbytes (entire model) | 9.2 | 20.9 |
331
+ | No | No | fp16 | AWQ (LLM only) | 10.9 | 15.9 |
332
+ | Yes | No | fp16 | AWQ (LLM only) | 7.8 | 12.3 |
333
+ | No | No | fp16 | AWQ + fusing (LLM only) | 10.5 | 19.5 |
334
+
335
+ To learn more quantization schemes and fusing, we refer to the [documentation](https://huggingface.co/docs/transformers/quantization).
336
+
337
  # Bias, Risks, and Limitations
338
 
339
  Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).