File size: 3,111 Bytes
c224038
0aa03fb
c224038
3bf4e23
6361a04
c224038
0aa03fb
 
 
 
 
 
 
 
 
0cf6a05
 
 
0aa03fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
faf29f2
0aa03fb
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
---
language: is
license: apache-2.0
widget:
- text: "Yonum vjer að pað pví fremur fái góðar viðtökur, par sem svo lítur út, sem aldrei muni verða svo heiðskýrt á pessum vetri að „Noi'ðurljósið“ sjáist, eu paðan væntum vér allir skemmtunar."
---

# Details of ByT5 - Base 🧠
ByT5 is a tokenizer-free version of [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) and generally follows the architecture of [MT5](https://huggingface.co/google/mt5-base).
ByT5 was only pre-trained on [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual) excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task.
ByT5 works especially well on noisy text data,*e.g.*, `google/byt5-base` significantly outperforms [mt5-base](https://huggingface.co/google/mt5-base) on [TweetQA](https://arxiv.org/abs/1907.06292).
Paper: [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/pdf/1910.10683.pdf)
Authors: *Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel* 

# Details of byt5-is-ocr-post-processing-old-texts
This model generates a revised version of a given Icelandic OCRed text. The model was trained with [simpleT5](https://github.com/Shivanandroy/simpleT5) on 900.000 lines (\~7.000.000 tokens) of which only 50.000 (\~400.000 tokens) were from real OCRed texts. The rest were extracted from [The Icelandic Gigaword Corpus](https://clarin.is/en/resources/gigaword/) and augmented with artificial errors. It can be assumed that increasing the amount of OCRed data can significantly improve the model.  

For inference, it is recommended to feed the model one line (not necessarily whole sentences, though) at a time.

# Usage
```python
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from datasets import load_dataset


MODEL = 'atlijas/byt5-is-ocr-post-processing-old-texts'
correct_ocr = pipeline('text2text-generation', model=MODEL, tokenizer=MODEL, num_return_sequences=1)

dataset = load_dataset('/path/to/', data_files='my_ocred_file.txt')
lines = dataset['train']
file_length = len(lines)

for corrected in correct_ocr(KeyDataset(lines, 'text'), max_length=150, batch_size=32):
    print(corrected[0]['generated_text'])

```

# Evaluation results
The test set for this model consists of various Icelandic texts from the 19th and early 20th century. On it, the model achieves a chrF error rate reduction of 39.3%, with the original text's score being 94.6, and the processed one's 96.7. The model achieves a proportional BLEU improvement of 51.6%, with the original text's BLEU score being 97.2 and the processed one's 98.6.

# Acknowledgments
This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by [Almannarómur](https://almannaromur.is/), is funded by the Icelandic Ministry of Education, Science and Culture.