Model Card for Model qwen-for-jawi-v1

Model Description

This model is a fine-tuned version of Qwen/Qwen2-VL-7B-Instruct specialized for Optical Character Recognition (OCR) of historical Malay texts written in Jawi script (Arabic script adapted for Malay language).

Model Architecture

Base Model: Qwen2-VL-2B-Instruct
Model Type: Vision-Language Model
Parameters: 2 billion
Language(s): Malay (Jawi script)

Intended Use

Primary Intended Uses

OCR for historical Malay manuscripts written in Jawi script
Digital preservation of Malay cultural heritage
Enabling computational analysis of historical Malay texts

Out-of-Scope Uses

General Arabic text recognition
Modern Malay text processing
Real-time OCR applications

Training Data

Dataset Description

This was trained and evaluated using

Training Procedure

Hardware used: 1 x H100
Training time: 6 hours

Performance and Limitations

Performance Metrics

Character Error Rate (CER): 8.66
Word Error Rate (WER): 25.50

Comparison with Other Models

We compared this model with https://github.com/VikParuchuri/surya, which reports high accuracy reates for Arabic, but performs poorly oun our Jawi data:

Character Error Rate (CER): 70.89%
Word Error Rate (WER): 91.73%

How to Use

# Example code for loading and using the model
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
from qwen_vl_utils import process_vision_info
from PIL import Image

model_name = 'mevsg/qwen-for-jawi-v1'

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,  # Use the appropriate torch dtype if needed
    device_map='auto'            # Optional: automatically allocate model layers across devices
)

# Load the processor from Hugging Face Hub
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# Add example usage code
image_path = 'path/to/image'
image = Image.open(image_path).convert('RGB')

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image,
            },
            {"type": "text", "text": "Convert this image to text"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text)

Citation

@misc{qwen-for-jawi-v1,
  title     = {Qwen for Jawi v1: a model for Jawi OCR},
  author    = {[Miguel Escobar Varela]}, 
  year      = {2024},
  publisher = {HuggingFace},
  url       = {[https://huggingface.co/mevsg/qwen-for-Jawi-v1]},
  note      = {Model created at National University of Singapore }
}

Acknowledgements

Special thanks to William Mattingly, whose finetuning script served as the base for our finetuning approach: https://github.com/wjbmattingly/qwen2-vl-finetune-huggingface