metadata

library_name: transformers
license: apache-2.0
language:
  - it

Model Card for Italian OCR Error Correction Sequence-to-Sequence Model

Model Details

This model represents the first version of an experimental sequence-to-sequence architecture designed specifically for the Italian language. It aims to correct approximately 93% of the errors generated by low-quality Optical Character Recognition (OCR) systems, which tend to perform poorly on Italian text. By taking raw, OCR-scanned text as input, the model outputs the corrected version of the text, significantly reducing errors and improving readability and accuracy.

Intended Use

Primary Use: This model is intended for use in processing and correcting Italian text that has been digitized using OCR technology. It is particularly useful for texts scanned at low quality, where the OCR's error rate is noticeably high.
Users: It is designed for developers, researchers, and archivists working with Italian historical documents, books, and any digitized material where OCR errors are prevalent.

Training Data

The model was trained on a diverse dataset of Italian texts, which includes a wide range of sources such as books, newspapers, and documents that have been digitized using various OCR systems. This dataset was specifically curated to include examples with common OCR errors observed in Italian texts, allowing the model to learn and correct these mistakes effectively.

Model Architecture

The model is based on a sequence-to-sequence framework, leveraging the latest advancements in natural language processing to understand and correct text at the character and word levels. It incorporates attention mechanisms to focus on error-prone areas in the text, ensuring high accuracy in the correction output.

Limitations

While the model corrects approximately 93% of OCR errors, there may be certain types of errors or specific contexts where its performance could be lower.
The model is specifically trained on Italian text and may not perform well on texts in other languages or texts that include significant amounts of non-Italian languages.

How to Use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
device = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_NAME = "DeepMount00/OCR_corrector"

model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model.to(device)
my_text = ""
inputs = tokenizer(my_text, return_tensors="pt").to(device)
outputs = finetuned_model.generate(input_ids=inputs['input_ids'],
               attention_mask=inputs['attention_mask'],
               num_beams=2, max_length=1050, top_k=10)
clean_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(clean_text)