atlijas
/

byt5-is-ocr-post-processing-modern-texts

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

byt5-is-ocr-post-processing-modern-texts / README.md

atlijas's picture

Update README.md

b876bf0 almost 2 years ago

|

history blame contribute delete

No virus

3.18 kB

	---
	language: is
	license: apache-2.0
	widget:
	- text: "^Fyrsta bam ársins fæddist á Landspítalanum kl. 3.30 á nýársnótt."
	---

	# Details of ByT5 - Base 🧠
	ByT5 is a tokenizer-free version of [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) and generally follows the architecture of [MT5](https://huggingface.co/google/mt5-base).
	ByT5 was only pre-trained on [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual) excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task.
	ByT5 works especially well on noisy text data,e.g., `google/byt5-base` significantly outperforms [mt5-base](https://huggingface.co/google/mt5-base) on [TweetQA](https://arxiv.org/abs/1907.06292).
	Paper: [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/pdf/1910.10683.pdf)
	Authors: Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel

	# Details of byt5-is-ocr-post-processing-modern-texts
	Note: This model is almost the same as [atlijas/byt5-is-ocr-post-processing-old-texts](https://huggingface.co/atlijas/byt5-is-ocr-post-processing-old-texts/). The only difference is the amount of epochs trained.
	This model generates a revised version of a given Icelandic OCRed text. The model was trained with [simpleT5](https://github.com/Shivanandroy/simpleT5) on 900.000 lines (\~7.000.000 tokens) of which only 50.000 (\~400.000 tokens) were from real OCRed texts. The rest were extracted from [The Icelandic Gigaword Corpus](https://clarin.is/en/resources/gigaword/) and augmented with artificial errors. It can be assumed that increasing the amount of OCRed data can significantly improve the model.

	For inference, it is recommended to feed the model one line (not necessarily whole sentences, though) at a time.

	# Usage
	```python
	from transformers import pipeline
	from transformers.pipelines.pt_utils import KeyDataset
	from datasets import load_dataset


	MODEL = 'atlijas/byt5-is-ocr-post-processing-old-texts'
	correct_ocr = pipeline('text2text-generation', model=MODEL, tokenizer=MODEL, num_return_sequences=1)

	dataset = load_dataset('/path/to/', data_files='my_ocred_file.txt')
	lines = dataset['train']
	file_length = len(lines)

	for corrected in correct_ocr(KeyDataset(lines, 'text'), max_length=150, batch_size=32):
	print(corrected[0]['generated_text'])

	```

	# Evaluation results
	The test set for this model consists of various Icelandic texts from the the 80's and 90's. On it, the model achieves a chrF error rate reduction of 30.1%, with the original text's score being 95.2, and the processed one's 96.7. The model achieves a proportional BLEU improvement of 19.8%, with the original text's BLEU score being 97.55 and the processed one's 98.0.

	# Acknowledgments
	This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by [Almannarómur](https://almannaromur.is/), is funded by the Icelandic Ministry of Education, Science and Culture.