Var3n commited on
Commit
6ed121b
1 Parent(s): 2d17242

Create README.md

Browse files

Finetuned version of [hmByT5](https://huggingface.co/hmbyt5-preliminary/byt5-small-historic-multilingual-span20-flax) on DE1, DE2, DE3 and DE7 parts of the [IDCAR2019-POCR](https://drive.google.com/file/d/1wOhmsoxOVQEPgHSX1QrYWKg5XAdYkzwi/view) dataset to correct OCR mistakes. The max_length was set to 350.

## Performance
```
SacreBLEU eval dataset: 10.83
SacreBLEU eval model: 72.35
```

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

example_sentence = "Ben algoritme dat op ba8i8 van kunstmatige inte11i9entie vkijwel geautomatiseerd een tekst herstelt met OCR fuuten."

tokenizer = AutoTokenizer.from_pretrained("Var3n/hmByT5_anno")
model = AutoModelForSeq2SeqLM.from_pretrained("Var3n/hmByT5_anno")

input = tokenizer(input, return_tensors="pt").input_ids
output = model.generate(input, max_new_tokens=len(input[0]), num_beams=4, do_sample=True)

text = tokenizer.decode(output[0], skip_special_tokens=True)
```

Files changed (1) hide show
  1. README.md +13 -0
README.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - de
5
+ metrics:
6
+ - bleu
7
+ pipeline_tag: text2text-generation
8
+ tags:
9
+ - ByT5
10
+ - historical
11
+ - t5
12
+ - ocr-correction
13
+ ---