hmByT5_anno / README.md
Var3n's picture
Update README.md
624f294
---
license: mit
language:
- de
metrics:
- bleu
pipeline_tag: text2text-generation
tags:
- ByT5
- historical
- t5
- ocr-correction
---
Finetuned version of [hmByT5](https://huggingface.co/hmbyt5-preliminary/byt5-small-historic-multilingual-span20-flax) on DE1, DE2, DE3 and DE7 parts of the [IDCAR2019-POCR](https://drive.google.com/file/d/1wOhmsoxOVQEPgHSX1QrYWKg5XAdYkzwi/view) dataset to correct OCR mistakes. The max_length was set to 350.
## Performance
```
SacreBLEU eval dataset: 10.83
SacreBLEU eval model: 72.35
```
## Usage
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
example_sentence = "Anvpreiſungq. Haupidepot für Wien: In der Stadt, obere Bräunerſtraße Nr. 1137 in der Varfüͤmerie-Handlung zur"
tokenizer = AutoTokenizer.from_pretrained("Var3n/hmByT5_anno")
model = AutoModelForSeq2SeqLM.from_pretrained("Var3n/hmByT5_anno")
input = tokenizer(example_sentence, return_tensors="pt").input_ids
output = model.generate(input, max_new_tokens=len(input[0]), num_beams=4, do_sample=True)
text = tokenizer.decode(output[0], skip_special_tokens=True)
```