File size: 1,107 Bytes
6ed121b
 
 
 
 
 
 
 
 
 
 
 
efdd2c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b50a06
efdd2c7
 
 
 
624f294
efdd2c7
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
---
license: mit
language:
- de
metrics:
- bleu
pipeline_tag: text2text-generation
tags:
- ByT5
- historical
- t5
- ocr-correction
---

Finetuned version of [hmByT5](https://huggingface.co/hmbyt5-preliminary/byt5-small-historic-multilingual-span20-flax) on DE1, DE2, DE3 and DE7 parts of the [IDCAR2019-POCR](https://drive.google.com/file/d/1wOhmsoxOVQEPgHSX1QrYWKg5XAdYkzwi/view) dataset to correct OCR mistakes. The max_length was set to 350. 

## Performance  
```
SacreBLEU eval dataset: 10.83 
SacreBLEU eval model: 72.35
```

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

example_sentence = "Anvpreiſungq. Haupidepot für Wien: In der Stadt, obere Bräunerſtraße Nr. 1137 in der Varfüͤmerie-Handlung zur"

tokenizer = AutoTokenizer.from_pretrained("Var3n/hmByT5_anno")
model = AutoModelForSeq2SeqLM.from_pretrained("Var3n/hmByT5_anno")

input = tokenizer(example_sentence, return_tensors="pt").input_ids
output = model.generate(input, max_new_tokens=len(input[0]), num_beams=4, do_sample=True)

text = tokenizer.decode(output[0], skip_special_tokens=True)
```