|
--- |
|
language: is |
|
license: apache-2.0 |
|
widget: |
|
- text: "^Fyrsta bam ársins fæddist á Landspítalanum kl. 3.30 á nýársnótt." |
|
--- |
|
|
|
# Details of ByT5 - Base 🧠 |
|
ByT5 is a tokenizer-free version of [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) and generally follows the architecture of [MT5](https://huggingface.co/google/mt5-base). |
|
ByT5 was only pre-trained on [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual) excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task. |
|
ByT5 works especially well on noisy text data,*e.g.*, `google/byt5-base` significantly outperforms [mt5-base](https://huggingface.co/google/mt5-base) on [TweetQA](https://arxiv.org/abs/1907.06292). |
|
Paper: [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/pdf/1910.10683.pdf) |
|
Authors: *Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel* |
|
|
|
# Details of byt5-is-ocr-post-processing-modern-texts |
|
*Note: This model is almost the same as [atlijas/byt5-is-ocr-post-processing-old-texts](https://huggingface.co/atlijas/byt5-is-ocr-post-processing-old-texts/). The only difference is the amount of epochs trained.* |
|
This model generates a revised version of a given Icelandic OCRed text. The model was trained with [simpleT5](https://github.com/Shivanandroy/simpleT5) on 900.000 lines (\~7.000.000 tokens) of which only 50.000 (\~400.000 tokens) were from real OCRed texts. The rest were extracted from [The Icelandic Gigaword Corpus](https://clarin.is/en/resources/gigaword/) and augmented with artificial errors. It can be assumed that increasing the amount of OCRed data can significantly improve the model. |
|
|
|
For inference, it is recommended to feed the model one line (not necessarily whole sentences, though) at a time. |
|
|
|
# Usage |
|
```python |
|
from transformers import pipeline |
|
from transformers.pipelines.pt_utils import KeyDataset |
|
from datasets import load_dataset |
|
|
|
|
|
MODEL = 'atlijas/byt5-is-ocr-post-processing-old-texts' |
|
correct_ocr = pipeline('text2text-generation', model=MODEL, tokenizer=MODEL, num_return_sequences=1) |
|
|
|
dataset = load_dataset('/path/to/', data_files='my_ocred_file.txt') |
|
lines = dataset['train'] |
|
file_length = len(lines) |
|
|
|
for corrected in correct_ocr(KeyDataset(lines, 'text'), max_length=150, batch_size=32): |
|
print(corrected[0]['generated_text']) |
|
|
|
``` |
|
|
|
# Evaluation results |
|
The test set for this model consists of various Icelandic texts from the the 80's and 90's. On it, the model achieves a chrF error rate reduction of 30.1%, with the original text's score being 95.2, and the processed one's 96.7. The model achieves a proportional BLEU improvement of 19.8%, with the original text's BLEU score being 97.55 and the processed one's 98.0. |
|
|
|
# Acknowledgments |
|
This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by [Almannarómur](https://almannaromur.is/), is funded by the Icelandic Ministry of Education, Science and Culture. |