--- language: is license: apache-2.0 widget: - text: "Yonum vjer að pað pví fremur fái góðar viðtökur, par sem svo lítur út, sem aldrei muni verða svo heiðskýrt á pessum vetri að „Noi'ðurljósið“ sjáist, eu paðan væntum vér allir skemmtunar." --- # Details of ByT5 - Base 🧠 ByT5 is a tokenizer-free version of [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) and generally follows the architecture of [MT5](https://huggingface.co/google/mt5-base). ByT5 was only pre-trained on [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual) excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task. ByT5 works especially well on noisy text data,*e.g.*, `google/byt5-base` significantly outperforms [mt5-base](https://huggingface.co/google/mt5-base) on [TweetQA](https://arxiv.org/abs/1907.06292). Paper: [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/pdf/1910.10683.pdf) Authors: *Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel* # Details of byt5-is-ocr-post-processing-old-texts This model generates a revised version of a given Icelandic OCRed text. The model was trained with [simpleT5](https://github.com/Shivanandroy/simpleT5) on 900.000 lines (\~7.000.000 tokens) of which only 50.000 (\~400.000 tokens) were from real OCRed texts. The rest were extracted from [The Icelandic Gigaword Corpus](https://clarin.is/en/resources/gigaword/) and augmented with artificial errors. It can be assumed that increasing the amount of OCRed data can significantly improve the model. For inference, it is recommended to feed the model one line (not necessarily whole sentences, though) at a time. # Usage ```python from transformers import pipeline from transformers.pipelines.pt_utils import KeyDataset from datasets import load_dataset MODEL = 'atlijas/byt5-is-ocr-post-processing-old-texts' correct_ocr = pipeline('text2text-generation', model=MODEL, tokenizer=MODEL, num_return_sequences=1) dataset = load_dataset('/path/to/', data_files='my_ocred_file.txt') lines = dataset['train'] file_length = len(lines) for corrected in correct_ocr(KeyDataset(lines, 'text'), max_length=150, batch_size=32): print(corrected[0]['generated_text']) ``` # Evaluation results The test set for this model consists of various Icelandic texts from the 19th and early 20th century. On it, the model achieves a chrF error rate reduction of 39.3%, with the original text's score being 94.6, and the processed one's 96.7. The model achieves a proportional BLEU improvement of 51.6%, with the original text's BLEU score being 97.2 and the processed one's 98.6. # Acknowledgments This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by [Almannarómur](https://almannaromur.is/), is funded by the Icelandic Ministry of Education, Science and Culture.