--- language: en datasets: - wikitext --- # ByT5 base English fine tuned for OCR Correction This model is a fine-tuned version of the [byt5-base](https://huggingface.co/google/byt5-base) for OCR Correction. ByT5 was introduced in [this paper](https://arxiv.org/abs/2105.13626) and the idea and code for fine-tuning the model for OCR Correction was taken from [here](https://blog.ml6.eu/ocr-correction-with-byt5-5994d1217c07). ## Model description byt5-base-english-ocr-correction is a model that has taken the byt5-base model and fine-tuned it an OCR Correction dataset. The model has been fine-tuned to take an input sentence that has incorrectly transcribed from an OCR model and output a sentence that corrects the errors. The model was trained by taking the [wikitext dataset](https://huggingface.co/datasets/wikitext) and adding synthetic OCR errors using [nlpaug](https://github.com/makcedward/nlpaug). ## Intended uses & limitations You can use the model for Text-to-Text Generation to remove errors caused by an OCR model. ### How to use ```python from transformers import T5ForConditionalGeneration import torch import nlpaug.augmenter.char as nac aug = nac.OcrAug(aug_char_p =0.4, aug_word_p = 0.6) corrected_text = "Life is like a box of chocolates" augmented_text = aug.augment(corrected_text) model = T5ForConditionalGeneration.from_pretrained('yelpfeast/byt5-base-english-ocr-correction') input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3 # add 3 for special tokens labels = torch.tensor([list("La vie est comme une boƮte de chocolat.".encode("utf-8"))]) + 3 # add 3 for special tokens loss = model(input_ids, labels=labels).loss # forward pass ``` ```python from transformers import T5ForConditionalGeneration, AutoTokenizer import nlpaug.augmenter.char as nac aug = nac.OcrAug(aug_char_p =0.4, aug_word_p = 0.6) corrected_text = "Life is like a box of chocolates" augmented_text = aug.augment(corrected_text) print(augmented_text) model = T5ForConditionalGeneration.from_pretrained('yelpfeast/byt5-base-english-ocr-correction') tokenizer = AutoTokenizer.from_pretrained("yelpfeast/byt5-base-english-ocr-correction") inputs = tokenizer(augmented_text, return_tensors="pt", padding=True) output_sequences = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], do_sample=False, # disable sampling to test if batching affects output ) print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True)) ``` ### Limitations The model has been trained on text that has been artificially corrupted to look like OCR errors. These errors may not be similar for all OCR models and hence the model may not do a good job at producing fully correct text.