Details of ByT5 - Base 🧠

ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5. ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task. ByT5 works especially well on noisy text data,e.g., google/byt5-base significantly outperforms mt5-base on TweetQA. Paper: ByT5: Towards a token-free future with pre-trained byte-to-byte models Authors: Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel

Details of byt5-is-ocr-post-processing-modern-texts

Note: This model is almost the same as atlijas/byt5-is-ocr-post-processing-old-texts. The only difference is the amount of epochs trained.
This model generates a revised version of a given Icelandic OCRed text. The model was trained with simpleT5 on 900.000 lines (~7.000.000 tokens) of which only 50.000 (~400.000 tokens) were from real OCRed texts. The rest were extracted from The Icelandic Gigaword Corpus and augmented with artificial errors. It can be assumed that increasing the amount of OCRed data can significantly improve the model.

For inference, it is recommended to feed the model one line (not necessarily whole sentences, though) at a time.

Usage

from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from datasets import load_dataset


MODEL = 'atlijas/byt5-is-ocr-post-processing-old-texts'
correct_ocr = pipeline('text2text-generation', model=MODEL, tokenizer=MODEL, num_return_sequences=1)

dataset = load_dataset('/path/to/', data_files='my_ocred_file.txt')
lines = dataset['train']
file_length = len(lines)

for corrected in correct_ocr(KeyDataset(lines, 'text'), max_length=150, batch_size=32):
    print(corrected[0]['generated_text'])

Evaluation results

The test set for this model consists of various Icelandic texts from the the 80's and 90's. On it, the model achieves a chrF error rate reduction of 30.1%, with the original text's score being 95.2, and the processed one's 96.7. The model achieves a proportional BLEU improvement of 19.8%, with the original text's BLEU score being 97.55 and the processed one's 98.0.

Acknowledgments

This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture.