atlijas
/

byt5-is-ocr-post-processing-modern-texts

Text2Text Generation

Inference Endpoints

text-generation-inference

Model card Files Files and versions Community

atlijas commited on Nov 15, 2022

Commit

3e2fdbc

•

1 Parent(s): f187b4e

Update README.md

Files changed (1) hide show

README.md +37 -0

README.md CHANGED Viewed

@@ -1,3 +1,40 @@
 ---
 license: apache-2.0
 ---

 ---
+language: is
 license: apache-2.0
 ---
+# Details of ByT5 - Base 🧠
+ByT5 is a tokenizer-free version of [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) and generally follows the architecture of [MT5](https://huggingface.co/google/mt5-base).
+ByT5 was only pre-trained on [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual) excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task.
+ByT5 works especially well on noisy text data,*e.g.*, `google/byt5-base` significantly outperforms [mt5-base](https://huggingface.co/google/mt5-base) on [TweetQA](https://arxiv.org/abs/1907.06292).
+Paper: [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/pdf/1910.10683.pdf)
+Authors: *Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel*
+# Details of byt5-is-ocr-post-processing-modern-texts
+*Note: This model is almost the same as [atlijas/byt5-is-ocr-post-processing-old-texts](https://huggingface.co/atlijas/byt5-is-ocr-post-processing-old-texts/). The only difference is the amount of epochs trained.*
+This model generates a revised version of a given Icelandic OCRed text. The model was trained with [simpleT5](https://github.com/Shivanandroy/simpleT5) on 900.000 lines (\~7.000.000 tokens) of which only 50.000 (\~400.000 tokens) were from real OCRed texts. The rest were extracted from [The Icelandic Gigaword Corpus](https://clarin.is/en/resources/gigaword/) and augmented with artificial errors. It can be assumed that increasing the amount of OCRed data can significantly improve the model.
+# Usage
+```python
+from transformers import pipeline
+from transformers.pipelines.pt_utils import KeyDataset
+from datasets import load_dataset
+MODEL = 'atlijas/byt5-is-ocr-post-processing-old-texts'
+correct_ocr = pipeline('text2text-generation', model=MODEL, tokenizer=MODEL, num_return_sequences=1)
+dataset = load_dataset('/path/to/', data_files='my_ocred_file.txt')
+lines = dataset['train']
+file_length = len(lines)
+for corrected in correct_ocr(KeyDataset(lines, 'text'), max_length=150, batch_size=32):
+    print(corrected[0]['generated_text'])
+```
+# Evaluation results
+The test set for this model consists of various Icelandic texts from the the 80's and 90's. On it, the model achieves a chrF error rate reduction of 30.1%, with the original text's score being 95.2, and the processed one's 96.7. The model achieves a proportional BLEU improvement of 19.8%, with the original text's BLEU score being 97.55 and the processed one's 98.0.
+# Acknowledgments
+This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by [Almannarómur](https://almannaromur.is/), is funded by the Icelandic Ministry of Education, Science and Culture.