Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,40 @@
|
|
1 |
---
|
|
|
2 |
license: apache-2.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language: is
|
3 |
license: apache-2.0
|
4 |
---
|
5 |
+
|
6 |
+
# Details of ByT5 - Base 🧠
|
7 |
+
ByT5 is a tokenizer-free version of [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) and generally follows the architecture of [MT5](https://huggingface.co/google/mt5-base).
|
8 |
+
ByT5 was only pre-trained on [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual) excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task.
|
9 |
+
ByT5 works especially well on noisy text data,*e.g.*, `google/byt5-base` significantly outperforms [mt5-base](https://huggingface.co/google/mt5-base) on [TweetQA](https://arxiv.org/abs/1907.06292).
|
10 |
+
Paper: [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/pdf/1910.10683.pdf)
|
11 |
+
Authors: *Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel*
|
12 |
+
|
13 |
+
# Details of byt5-is-ocr-post-processing-modern-texts
|
14 |
+
*Note: This model is almost the same as [atlijas/byt5-is-ocr-post-processing-old-texts](https://huggingface.co/atlijas/byt5-is-ocr-post-processing-old-texts/). The only difference is the amount of epochs trained.*
|
15 |
+
This model generates a revised version of a given Icelandic OCRed text. The model was trained with [simpleT5](https://github.com/Shivanandroy/simpleT5) on 900.000 lines (\~7.000.000 tokens) of which only 50.000 (\~400.000 tokens) were from real OCRed texts. The rest were extracted from [The Icelandic Gigaword Corpus](https://clarin.is/en/resources/gigaword/) and augmented with artificial errors. It can be assumed that increasing the amount of OCRed data can significantly improve the model.
|
16 |
+
|
17 |
+
# Usage
|
18 |
+
```python
|
19 |
+
from transformers import pipeline
|
20 |
+
from transformers.pipelines.pt_utils import KeyDataset
|
21 |
+
from datasets import load_dataset
|
22 |
+
|
23 |
+
|
24 |
+
MODEL = 'atlijas/byt5-is-ocr-post-processing-old-texts'
|
25 |
+
correct_ocr = pipeline('text2text-generation', model=MODEL, tokenizer=MODEL, num_return_sequences=1)
|
26 |
+
|
27 |
+
dataset = load_dataset('/path/to/', data_files='my_ocred_file.txt')
|
28 |
+
lines = dataset['train']
|
29 |
+
file_length = len(lines)
|
30 |
+
|
31 |
+
for corrected in correct_ocr(KeyDataset(lines, 'text'), max_length=150, batch_size=32):
|
32 |
+
print(corrected[0]['generated_text'])
|
33 |
+
|
34 |
+
```
|
35 |
+
|
36 |
+
# Evaluation results
|
37 |
+
The test set for this model consists of various Icelandic texts from the the 80's and 90's. On it, the model achieves a chrF error rate reduction of 30.1%, with the original text's score being 95.2, and the processed one's 96.7. The model achieves a proportional BLEU improvement of 19.8%, with the original text's BLEU score being 97.55 and the processed one's 98.0.
|
38 |
+
|
39 |
+
# Acknowledgments
|
40 |
+
This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by [Almannarómur](https://almannaromur.is/), is funded by the Icelandic Ministry of Education, Science and Culture.
|