--- license: mit language: - de metrics: - bleu pipeline_tag: text2text-generation tags: - ByT5 - historical - t5 - ocr-correction --- Finetuned version of [hmByT5](https://huggingface.co/hmbyt5-preliminary/byt5-small-historic-multilingual-span20-flax) on DE1, DE2, DE3 and DE7 parts of the [IDCAR2019-POCR](https://drive.google.com/file/d/1wOhmsoxOVQEPgHSX1QrYWKg5XAdYkzwi/view) dataset to correct OCR mistakes. The max_length was set to 350. ## Performance ``` SacreBLEU eval dataset: 10.83 SacreBLEU eval model: 72.35 ``` ## Usage ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM example_sentence = "Anvpreiſungq. Haupidepot für Wien: In der Stadt, obere Bräunerſtraße Nr. 1137 in der Varfüͤmerie-Handlung zur" tokenizer = AutoTokenizer.from_pretrained("Var3n/hmByT5_anno") model = AutoModelForSeq2SeqLM.from_pretrained("Var3n/hmByT5_anno") input = tokenizer(example_sentence, return_tensors="pt").input_ids output = model.generate(input, max_new_tokens=len(input[0]), num_beams=4, do_sample=True) text = tokenizer.decode(output[0], skip_special_tokens=True) ```