File size: 5,506 Bytes
c169bfe 84ca9ca 94b0c70 84ca9ca 94b0c70 c169bfe 5abbf34 286336d c169bfe 81e60b0 c169bfe ea68365 c169bfe 37e3ef3 c169bfe 86b07f5 c169bfe 244fd74 c169bfe 286336d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
---
language:
- en
license: apache-2.0
tags:
- multimodal
- vision
- image-text-to-text
datasets:
- pixparse/docvqa-single-page-questions
---
<p align="center">
<img src="https://huggingface.co/HuggingFaceM4/idefics-80b/resolve/main/assets/IDEFICS.png" alt="Idefics-Obelics logo" width="200" height="100">
</p>
***As of April 18th, 2024**, Idefics2 is part of the `4.40.0` Transformers pypi release. Please upgrade your Transformers version (`pip install transformers --upgrade`).*
# idefics2 8b Fine tuned on DocVQA Dataset
## Model Information
- Base Model: [HuggingFaceM4/idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b)
- Dataset Used: [DocVQA dataset](https://huggingface.co/datasets/pixparse/docvqa-single-page-questions)
- Introduced in Mathew et al. (2021)
- Consists of 50,000 questions defined on 12,000+ document images
- For further information, visit the [challenge page](https://rrc.cvc.uab.es/?ch=17) and [paper](https://arxiv.org/abs/2007.00398)
## Training Details
- The training process took approximately 38hours on an A100 80GB GPU, and model was fine-tuned using QLoRA.
- Trained with 39.5k train dataset from [DocVQA single page questions](https://huggingface.co/datasets/pixparse/docvqa-single-page-questions)
- Training Log:
| Epoch | Loss | Grad Norm | Learning Rate |
|-------|-------|-----------|---------------|
| 0.01 | 2.3776| 10.40 | 4.8e-05 |
| 0.25 | 0.5029| 6.10 | 9.5412e-05 |
| 0.50 | 0.434 | 5.74 | 7.5973e-05 |
| 0.75 | 0.4608| 7.46 | 7.3925e-05 |
| 1.0 | 0.3846| 4.77 | 5.0369e-05 |
| 1.25 | 0.3226| 3.63 | 4.9857e-05 |
| 1.5 | 0.3175| 5.03 | 2.5277e-05 |
| 1.75 | 0.2918| 5.63 | 2.5789e-05 |
| 2.0 | 0.2917| 4.58 | 2.0483e-07 |
{'train_runtime': 141781.6786, 'train_samples_per_second': 0.557, 'train_steps_per_second': 0.035, 'train_loss': 0.3973848872424526, 'epoch': 2.0}
# Processor Configuration
```python
processor = AutoProcessor.from_pretrained(
"HuggingFaceM4/idefics2-8b",
do_image_splitting=True
)
```
# Vision Encoder Efficiency
Given the high resolution supported, the vision part of the model can be memory hungry depending on your configuration. If you are GPU-memory-constrained, you can:
1. **Deactivate image splitting**: To do so, add `do_image_splitting=False` when initializing the processor (`AutoProcessor.from_pretrained`). There are no changes required on the model side. Note that only the SFT model has been trained with image splitting.
2. **Decrease maximum image resolution**: To do so, add `size={"longest_edge": 448, "shortest_edge": 378}` when initializing the processor (`AutoProcessor.from_pretrained`). In particular, the `longest_edge` value can be adapted to fit the need (the default value is 980). We recommend using values that are multiples of 14. There are no changes required on the model side.
`do_image_splitting=True` is especially needed to boost performance on OCR tasks where a very large image is used as input. For regular VQA or captioning tasks, this argument can be safely set to `False` with minimal impact on performance (see the evaluation table above).
## Testing and Inference
```python
import requests
import torch
from PIL import Image
from io import BytesIO
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda:0"
# Load images
image1 = load_image("https://templates.invoicehome.com/invoice-template-us-classic-white-750px.png")
image2 = load_image("https://cdn.vertex42.com/WordTemplates/images/word-invoice-template.png")
# Initialize processor and model
processor = AutoProcessor.from_pretrained("SalmanFaroz/idefics2-8b-DocVQA-SP", do_image_splitting=True)
```
**Full Precision:**
```python
model = AutoModelForVision2Seq.from_pretrained(
"SalmanFaroz/idefics2-8b-DocVQA-SP",
).to(DEVICE)
```
***or**
**Half Precision Inference:**
```python
model = AutoModelForVision2Seq.from_pretrained(
"SalmanFaroz/idefics2-8b-DocVQA-SP",
torch_dtype=torch.float16,
).to(DEVICE)
```
***or**
**4 Bit Quantization with bitsandbytes:**
Make sure to have accelerate and bitsandbytes installed
```
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForVision2Seq.from_pretrained(
"SalmanFaroz/idefics2-8b-DocVQA-SP",
torch_dtype=torch.float16,
quantization_config=quantization_config,
).to(DEVICE)
```
then..
```
# Create inputs
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "what is invoice date?"},
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "11.02.2019"},
]
},
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "what is the total?"},
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
``` |