Edit model card

OCRonos is a specialized LLM to rephrase and rewrite noisy OCR.

OCRonos is a Mistral 7b base model fine-tuned on 1,532 examples of pairs of noisy OCR texts and correction in French, English, German, Polish and other European languages from Common Corpus (Chronicle America, Gallica, Europeana, Internet Archive, etc.)

OCRonos a rewriting tool rather than a strict correction tool. It provides synthetic interpretation of texts that have been too deteriorated to be faithfully reconstructed.

Examples

Original input with a high rate of OCR errors from Chronicling America:

The Douglas Island News.

??????II ????I ????mmmm? n i i n i ??mmmmmm?i????????????????????????????

VOL. G. DOUGLAS CITY AND TREADWELL, ALASKA, WEDNESDAY, AUGUST 3, L90L NO. 37

I Wall Paper $ Carpets f

? 5?!? *i i g Spring House Cleaning is almost here, and g; ?J such a lot of trouble it brings with it. The 3 1 ff house has to be papered from top to bottom, ? ?J the floor Carpeted, Lace Curtains put up, J + Linoleum on kitchen floor, and various other 3 ? things, when carried out are the making of ? our Alaska homes. g

Now we have prepared a way out of this trou- ?* ble for you. We have had a special depart- Jment made, where we are showing 40,000 rolls of Paper ranging in price from 10 c per roll ? J up; 50 different styles of carpets, from 50 c J 4 up, such as genuine home made rag carpets 4} J /washable), Brussels, Velvets, Axminstersand ? ? Ingraius; all grades and styles of Lace Cur- g 4 tains, inported and domestic portiers and c J couch covers; Rugs, small as 12x20 inches ? J and large as 15x20 feet; Linoleum in grades J 4 A, B, C, D, E; all widths floor oil cloth, roller < J shados, carpet sweepers, curtain poles, ? g mouldings, brass fixtures and trimmings of g 4 all kinds, in fact everything carried in an up g J to date department can be obtained from ?

Rewrite with a reasonably high temperature (0.7):

The Douglas Island News.

VOL. G. DOUGLAS CITY AND TREADWELL, ALASKA, WEDNESDAY, AUGUST 3, 1904 NO. 37

Wall Paper & Carpets

Spring House Cleaning is almost here, and it brings with it a lot of trouble. The house has to be papered from top to bottom, the floor carpeted, lace curtains put up, linoleum on kitchen floor, and various other things, when carried out are the making of our Alaska homes.

Now we have prepared a way out of this trouble for you. We have had a special department made, where we are showing 40,000 rolls of Paper ranging in price from 10 c per roll up; 50 different styles of carpets, from 50 c up, such as genuine home made rag carpets (washable), Brussels, Velvets, Axminsters, and Ingrains; all grades and styles of Lace Curtains, imported and domestic portiers and couch covers; Rugs, small as 12x20 inches and large as 15x20 feet; Linoleum in grades A, B, C, D, E; all widths, floor oil cloth, roller shades, carpet sweepers, curtain poles, mouldings, brass fixtures and trimmings of all kinds, in fact everything carried in an up to date department can be obtained from B. M. BEHRENDS CO., INC.

Usage

OCRonos use a custom instruction structure: "### Text ###\n[text]\n\n### Correction ###\n" and a custom eos #END#.

Typical usage with vllm:

sampling_params = SamplingParams(temperature=0.7, top_p=.95, max_tokens=4000, presence_penalty = 0, stop=["#END#"])
prompt = "### Text ###\n" + user_input + "\n\n### Correction ###\n"
outputs = llm.generate(prompts, sampling_params, use_tqdm = False)

Issues

OCROnos tend to struggle occasionally with the start of the text and can miss the few initial words (or distort them). The remainder of the text is usually fine.

A partly effective solution is to append the initial word (or the first words) to the prompt.

sampling_params = SamplingParams(temperature=0.7, top_p=.95, max_tokens=4000, presence_penalty = 0, stop=["#END#"])
prompt = "### Text ###\n" + user_input + "\n\n### Correction ###\n" + user_input.split()[0] + " "
outputs = llm.generate(prompts, sampling_params, use_tqdm = False)

A common issue with OCR correction is language switching: apparently due to the inherent noise in the input text, an LLM will transcribe in a different language or even in a different script (like cyrillic). In contrast with general-purpose models used for the same task, the issue is better mitigated with OCRonos, but not completely.

Downloads last month
29
Safetensors
Model size
7.24B params
Tensor type
BF16
·
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.