metadata
license: mit
language:
- de
metrics:
- bleu
pipeline_tag: text2text-generation
tags:
- ByT5
- historical
- t5
- ocr-correction
Finetuned version of hmByT5 on DE1, DE2, DE3 and DE7 parts of the IDCAR2019-POCR dataset to correct OCR mistakes. The max_length was set to 350.
Performance
SacreBLEU eval dataset: 10.83
SacreBLEU eval model: 72.35
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
example_sentence = "Anvpreiſungq. Haupidepot für Wien: In der Stadt, obere Bräunerſtraße Nr. 1137 in der Varfüͤmerie-Handlung zur"
tokenizer = AutoTokenizer.from_pretrained("Var3n/hmByT5_anno")
model = AutoModelForSeq2SeqLM.from_pretrained("Var3n/hmByT5_anno")
input = tokenizer(example_sentence, return_tensors="pt").input_ids
output = model.generate(input, max_new_tokens=len(input[0]), num_beams=4, do_sample=True)
text = tokenizer.decode(output[0], skip_special_tokens=True)