yelpfeast/byt5-base-english-ocr-correction

KeepKool

Jul 25, 2022

How to use it ? All my trials lead to : out = in. Does it need a prompt with an example ?

yelpfeast

Owner Jul 25, 2022

The model should work when loaded with T5ForConditionalGeneration.

Are you using this example code in the readme? And what inputs did you use?


from transformers import T5ForConditionalGeneration, AutoTokenizer


model = T5ForConditionalGeneration.from_pretrained('yelpfeast/byt5-base-english-ocr-correction')
tokenizer = AutoTokenizer.from_pretrained("yelpfeast/byt5-base-english-ocr-correction")

inputs = tokenizer("I l0ve anima1s", return_tensors="pt", padding=True)

output_sequences = model.generate(

    input_ids=inputs["input_ids"],

    attention_mask=inputs["attention_mask"],

    do_sample=False,  # disable sampling to test if batching affects output

)

print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))

Also not all inputs may be corrected. Since the model was trained on synthetically corrupted data I assume that if the input text is very different to the training the data the model might just output the same text unaltered. I aim to improve the model at some point by training on a larger dataset with different levels of corrupted text to make it work on a wider variety of input text.

KeepKool

Jul 25, 2022

I tried it using the Hosted API , manually (no code).
With your example "I l0ve anima1s", it works. But not with my test sentences.
This has certainly to do with your training set.
Here is the type of OCR issues I have (here in french) (On your model, I tried in english of course)

Lorsque ces sciences sont employées .à prédire l'avenir de .l'individu,
:lles offrent de nombreux dangers. Il est risqué d'indiquer à quelqu'un un .
-vènement ·fi tur probable. En effet, le· fait· de cannai tre cet évènement in;.. .:

yelpfeast

Owner Jul 26, 2022

•

edited Jul 26, 2022

I used the nlpaug library for creating the synthetic ocr errors. Looking at the source code here they have mappings of common OCR errors e.g. 0 -> o. However these are quite limited so you could fine tune the model on some OCR errors you have in your data to make the model perform better.

I also plan on fine tuning the model on some OCR correction datasets as well as the synthetic data to make it better. For example the ALTA 2017. I can upload my training code to github as well.

yelpfeast
/

byt5-base-english-ocr-correction

Doing nothing ?