19c RoBERTa v5 Newspaper Language Model (In Development)
This repository is set up for the upcoming RoBERTa v5 language model, which will be pretrained on a newly reconstructed, high-resolution 19th-century American newspaper corpus.
The Core Strategy: Replacing Noisy OCR
The existing Tesseract OCR from Chronicling America (loc.gov) suffers from error rates of 12β40% due to degraded microfilms, mixed case, long-s (ΕΏ) ligatures, and poor line segmentation.
To build a clean corpus for RoBERTa v5, we are bypassing the bad original OCR and performing OCR reconstruction from the original page images using a custom pipeline:
- High-Resolution Harvest: Digitized pages downloaded at 50% resolution (doubling linear pixel density for small body text).
- YOLO Column Segmentation: A fine-tuned YOLO detector segments the newspaper pages into clean column crops.
- Strip Tiling: Slices columns into overlapping horizontal strips to keep text density within VLM visual attention boundaries.
- Local VLM Transcription: Transcribes strips using Gemma MoE via LM Studio APIs (achieving 90β95% transcription accuracy).
- Deduplication & Merge: Reassembles transcripts using sequence matching and de-hyphenation.
- Quality Gating via RoBERTa v4:
- Perplexity (PPL) Filtering: Assembled column texts are evaluated by the RoBERTa v4 language model. High-perplexity columns (indicating gibberish, heavy distortion, or transcription failure) are automatically discarded.
- Looping & Hallucination Mitigation: Microfilm defects or visual artifacts can cause the VLM to repeat paragraphs (looping), output text in blank areas, or generate plausible but hallucinated paragraphs. RoBERTa v4's perplexity scores easily flag and filter out these repeated blocks, empty segments, and anomalies.
Pretraining Corpus: The 19c Newspaper Harvest
The source corpus has been harvested and banked in ambrosfitz/19c_newspapers_images_alto containing 1,361 Parquet shards (~68,050 images).
Decade-by-Decade Banked Image Counters
| Decade | Banked / Target Images (Redis) | Shards on HF | Est. Images | Status |
|---|---|---|---|---|
| 1800s | 1,115 / 1,040 |
28 | 1,400 | Satisfied β |
| 1810s | 1,178 / 1,040 |
28 | 1,400 | Satisfied β |
| 1820s | 1,651 / 1,560 |
41 | 2,050 | Satisfied β |
| 1830s | 2,647 / 2,600 |
25 | 1,250 | Satisfied β |
| 1840s | 5,850 / 4,160 |
39 | 1,950 | Satisfied β |
| 1850s | 7,889 / 7,800 |
93 | 4,650 | Satisfied β |
| 1860s | 10,486 / 10,400 |
169 | 8,450 | Satisfied β |
| 1870s | 7,919 / 7,800 |
172 | 8,600 | Satisfied β |
| 1880s | 10,519 / 10,400 |
215 | 10,750 | Satisfied β |
| 1890s | 10,421 / 10,400 |
551 | 27,550 | Satisfied β |
| Total | 68,050 Banked | 1,361 | 68,050 | ALL COMPLETED |
How to Load (Once Released)
Once training is complete, the model will be loadable using the Hugging Face transformers library:
from transformers import AutoModelForMaskedLM, AutoTokenizer
repo_id = "ambrosfitz/19c_roberta_v5_newspaper"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForMaskedLM.from_pretrained(repo_id)