Italian OCR Error Correction Sequence-to-Sequence Model

Model Details

This model represents the first version of an experimental sequence-to-sequence architecture designed specifically for the Italian language. It aims to correct approximately 93% of the errors generated by low-quality Optical Character Recognition (OCR) systems, which tend to perform poorly on Italian text. By taking raw, OCR-scanned text as input, the model outputs the corrected version of the text, significantly reducing errors and improving readability and accuracy.

Intended Use

Primary Use: This model is intended for use in processing and correcting Italian text that has been digitized using OCR technology. It is particularly useful for texts scanned at low quality, where the OCR's error rate is noticeably high.
Users: It is designed for developers, researchers, and archivists working with Italian historical documents, books, and any digitized material where OCR errors are prevalent.

Limitations

While the model corrects approximately 93% of OCR errors, there may be certain types of errors or specific contexts where its performance could be lower.
The model is specifically trained on Italian text and may not perform well on texts in other languages or texts that include significant amounts of non-Italian languages.

How to Use

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
device = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_NAME = "DeepMount00/OCR_corrector"

model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model.to(device)
my_text = "In un'epca lontnaa, un re goernava le sue tere con saggez2a e giustiia. Sotot il suo regno, il rgeno prosperava e la getne era flice. Ma un gionro, un drgoa feroce attcò il regno, semniando ditruzione e paurra tra i suoi abtanti."
inputs = tokenizer(my_text, return_tensors="pt").to(device)
outputs = model.generate(input_ids=inputs['input_ids'],
               attention_mask=inputs['attention_mask'],
               num_beams=2, max_length=1050, top_k=10)
clean_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(clean_text)