--- license: mit datasets: - nielsr/docvqa_1200_examples_donut language: - en library_name: transformers pipeline_tag: visual-question-answering --- ### IDEFICS2-OCR Finetuned of Idefics2-8b with fp16 weight update on nielsr/docvqa_1200_examples_donut dataset for document VQA pairs. ## Usage ```Python from transformers import BitsAndBytesConfig, AutoModelForVision2Seq, AutoProcessor from transformers.image_utils import load_image processor = AutoProcessor.from_pretrained("smishr-18/Idefics2-OCR", do_image_splitting=False) bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16 ) device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") model = AutoModelForVision2Seq.from_pretrained( "smishr-18/Idefics2-OCR", quantization_config=bnb_config, device_map=device, low_cpu_mem_usage=True ) image = load_image("https://images.pokemontcg.io/pl1/1_hires.png") messages = [ { "role": "user", "content": [ {"type": "text", "text": "Explain."}, {"type": "image"}, {"type": "text", "text": "What is the reflex energy in the image?"} ] } ] text = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor(text=[text.strip()], images=[image4], return_tensors="pt", padding=True) inputs = {k: v.to(device) for k, v in inputs.items()} # Generate texts generated_ids = model.generate(**inputs, max_new_tokens=500) generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True) print(generated_texts) # The reflex energy in the image is 70. ``` ## Limitations The model was finetuned on limited T4 GPU and could be fintuned with more adapters on devices with ```torch.cuda.get_device_capability()[0] >= 8``` or Ampere GPUs. - **Developed by:** Shubh Mishra, Aug 2024 - **Model Type:** VLM - **Language(s) (NLP):** English - **License:** MIT - **Finetuned from model:** HuggingFaceM4/idefics2-8b