BART-base-ocr

This model is released as part of the paper Leveraging LLMs for Post-OCR Correction of Historical Newspapers and designed to correct OCR text. BART-base is fine-tuned for post-OCR correction of historical English, using BLN600, a parallel corpus of 19th century newspaper machine/human transcription.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

model = AutoModelForSeq2SeqLM.from_pretrained('pykale/bart-base-ocr')
tokenizer = AutoTokenizer.from_pretrained('pykale/bart-base-ocr')
generator = pipeline('text2text-generation', model=model.to('cuda'), tokenizer=tokenizer, device='cuda', max_length=1024)

ocr = "The defendant wits'fined �5 and costs."
pred = generator(ocr)[0]['generated_text']
print(pred)

Citation

@inproceedings{thomas-etal-2024-leveraging,
    title = "Leveraging {LLM}s for Post-{OCR} Correction of Historical Newspapers",
    author = "Thomas, Alan and Gaizauskas, Robert and Lu, Haiping",
    editor = "Sprugnoli, Rachele and Passarotti, Marco",
    booktitle = "Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024",
    month = "may",
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lt4hala-1.14",
    pages = "116--121",
}
Downloads last month
31
Safetensors
Model size
139M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including pykale/bart-base-ocr