--- language: is license: apache-2.0 widget: - text: "^Fyrsta bam ársins fæddist á Landspítalanum kl. 3.30 á nýársnótt." --- # Details of ByT5 - Base 🧠 ByT5 is a tokenizer-free version of [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) and generally follows the architecture of [MT5](https://huggingface.co/google/mt5-base). ByT5 was only pre-trained on [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual) excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task. ByT5 works especially well on noisy text data,*e.g.*, `google/byt5-base` significantly outperforms [mt5-base](https://huggingface.co/google/mt5-base) on [TweetQA](https://arxiv.org/abs/1907.06292). Paper: [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/pdf/1910.10683.pdf) Authors: *Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel* # Details of byt5-is-ocr-post-processing-modern-texts *Note: This model is almost the same as [atlijas/byt5-is-ocr-post-processing-old-texts](https://huggingface.co/atlijas/byt5-is-ocr-post-processing-old-texts/). The only difference is the amount of epochs trained.* This model generates a revised version of a given Icelandic OCRed text. The model was trained with [simpleT5](https://github.com/Shivanandroy/simpleT5) on 900.000 lines (\~7.000.000 tokens) of which only 50.000 (\~400.000 tokens) were from real OCRed texts. The rest were extracted from [The Icelandic Gigaword Corpus](https://clarin.is/en/resources/gigaword/) and augmented with artificial errors. It can be assumed that increasing the amount of OCRed data can significantly improve the model. For inference, it is recommended to feed the model one line (not necessarily whole sentences, though) at a time. # Usage ```python from transformers import pipeline from transformers.pipelines.pt_utils import KeyDataset from datasets import load_dataset MODEL = 'atlijas/byt5-is-ocr-post-processing-old-texts' correct_ocr = pipeline('text2text-generation', model=MODEL, tokenizer=MODEL, num_return_sequences=1) dataset = load_dataset('/path/to/', data_files='my_ocred_file.txt') lines = dataset['train'] file_length = len(lines) for corrected in correct_ocr(KeyDataset(lines, 'text'), max_length=150, batch_size=32): print(corrected[0]['generated_text']) ``` # Evaluation results The test set for this model consists of various Icelandic texts from the the 80's and 90's. On it, the model achieves a chrF error rate reduction of 30.1%, with the original text's score being 95.2, and the processed one's 96.7. The model achieves a proportional BLEU improvement of 19.8%, with the original text's BLEU score being 97.55 and the processed one's 98.0. # Acknowledgments This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by [Almannarómur](https://almannaromur.is/), is funded by the Icelandic Ministry of Education, Science and Culture.