Image-Text-to-Text
Transformers
Safetensors
English
idefics2
pretraining
multimodal
vision
Inference Endpoints
5 papers
HugoLaurencon HF staff commited on
Commit
d6e40c5
1 Parent(s): 59e3081

Comment on image splitting

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -223,6 +223,8 @@ Given the high resolution supported, the vision part of the model can be memory
223
  - **deactivate the image splitting.** To do so, add `do_image_splitting=False` when initializing the processor (`AutoProcessor.from_pretrained`). There are no changes required on the model side. Note that only the sft model has been trained with image splitting.
224
  - **decrease the maximum image resolution.** To do so, add `size= {"longest_edge": 448, "shortest_edge": 378}` when initializing the processor (`AutoProcessor.from_pretrained`). In particular, the `longest_edge` value can be adapted to fit the need. We recommend using values that are multiples of 14. There are no changes required on the model side.
225
 
 
 
226
  **Using Flash-attention 2 to speed up generation**
227
 
228
  <details><summary>Click to expand.</summary>
 
223
  - **deactivate the image splitting.** To do so, add `do_image_splitting=False` when initializing the processor (`AutoProcessor.from_pretrained`). There are no changes required on the model side. Note that only the sft model has been trained with image splitting.
224
  - **decrease the maximum image resolution.** To do so, add `size= {"longest_edge": 448, "shortest_edge": 378}` when initializing the processor (`AutoProcessor.from_pretrained`). In particular, the `longest_edge` value can be adapted to fit the need. We recommend using values that are multiples of 14. There are no changes required on the model side.
225
 
226
+ `do_image_splitting=True` is especially needed to boost performance on OCR tasks where a very large image is used as input. For the regular VQA or captioning tasks, this argument can be safely set to `False` with minimal impact on performance (see the evaluation table above).
227
+
228
  **Using Flash-attention 2 to speed up generation**
229
 
230
  <details><summary>Click to expand.</summary>