19c RoBERTa v5 Newspaper Language Model (In Development)

This repository is set up for the upcoming RoBERTa v5 language model, which will be pretrained on a newly reconstructed, high-resolution 19th-century American newspaper corpus.

The Core Strategy: Replacing Noisy OCR

The existing Tesseract OCR from Chronicling America (loc.gov) suffers from error rates of 12–40% due to degraded microfilms, mixed case, long-s (ſ) ligatures, and poor line segmentation.

To build a clean corpus for RoBERTa v5, we are bypassing the bad original OCR and performing OCR reconstruction from the original page images using a custom pipeline:

High-Resolution Harvest: Digitized pages downloaded at 50% resolution (doubling linear pixel density for small body text).
YOLO Column Segmentation: A fine-tuned YOLO detector segments the newspaper pages into clean column crops.
Strip Tiling: Slices columns into overlapping horizontal strips to keep text density within VLM visual attention boundaries.
Local VLM Transcription: Transcribes strips using Gemma MoE via LM Studio APIs (achieving 90–95% transcription accuracy).
Deduplication & Merge: Reassembles transcripts using sequence matching and de-hyphenation.
Quality Gating via RoBERTa v4:
- Perplexity (PPL) Filtering: Assembled column texts are evaluated by the RoBERTa v4 language model. High-perplexity columns (indicating gibberish, heavy distortion, or transcription failure) are automatically discarded.
- Looping & Hallucination Mitigation: Microfilm defects or visual artifacts can cause the VLM to repeat paragraphs (looping), output text in blank areas, or generate plausible but hallucinated paragraphs. RoBERTa v4's perplexity scores easily flag and filter out these repeated blocks, empty segments, and anomalies.

Pretraining Corpus: The 19c Newspaper Harvest

The source corpus has been harvested and banked in ambrosfitz/19c_newspapers_images_alto containing 1,361 Parquet shards (~68,050 images).

Decade-by-Decade Banked Image Counters

Decade	Banked / Target Images (Redis)	Shards on HF	Est. Images	Status
1800s	`1,115 / 1,040`	28	1,400	Satisfied ✓
1810s	`1,178 / 1,040`	28	1,400	Satisfied ✓
1820s	`1,651 / 1,560`	41	2,050	Satisfied ✓
1830s	`2,647 / 2,600`	25	1,250	Satisfied ✓
1840s	`5,850 / 4,160`	39	1,950	Satisfied ✓
1850s	`7,889 / 7,800`	93	4,650	Satisfied ✓
1860s	`10,486 / 10,400`	169	8,450	Satisfied ✓
1870s	`7,919 / 7,800`	172	8,600	Satisfied ✓
1880s	`10,519 / 10,400`	215	10,750	Satisfied ✓
1890s	`10,421 / 10,400`	551	27,550	Satisfied ✓
Total	68,050 Banked	1,361	68,050	ALL COMPLETED

How to Load (Once Released)

Once training is complete, the model will be loadable using the Hugging Face transformers library:

from transformers import AutoModelForMaskedLM, AutoTokenizer

repo_id = "ambrosfitz/19c_roberta_v5_newspaper"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForMaskedLM.from_pretrained(repo_id)

Downloads last month: -; Downloads are not tracked for this model. How to track