Pierre-Carl Langlais's picture

Pierre-Carl Langlais

Pclanglais

·

Dorialexander

AI & ML interests

Open data & open LLMs

Recent Activity

updated a dataset 2 days ago

Pclanglais/heritage

liked a model 4 days ago

silx-ai/Quasar-3.0-Instract-v2

liked a model 5 days ago

meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8

View all activity

Organizations

Posts 6

Post

3075

We release today our first foundation model and experiment with a new category: specialized pre-training.

OCRonos-Vintage is a 124m parameters model trained end-to-end by Pleias on llm.c from 18 billion tokens from cultural heritage archives. Despite its small size it achieve nearly state of the art results for OCR correction of historical English sources. OCRonos-Vintage is also an historical model with an unusual cut-off date: December 29th, 1955…

We look forward to replicate this approach very soon on other "hard" tasks commonly associated with generalist LLMs/SLMs: RAG, function calling, summarization, document segmentation…

OCRonos-Vintage: PleIAs/OCRonos-Vintage
CPU Demo: PleIAs/OCRonos-Vintage-CPU
GPU Demo: PleIAs/OCRonos-Vintage-GPU
Our annoncement and call for specialized pre-training: https://huggingface.co/blog/Pclanglais/specialized-pre-training

Articles 7

Article

84

They Said It Couldn’t Be Done

View all Articles

Papers 1

arxiv:2501.08365

spaces 9

Reversed Zotero

Editorialization

Correction-OCR

Tchap

Motta

tag_theme

models 38

Pclanglais/Popeye-1929

Text-to-Image • Updated Dec 31, 2024 • 12 •

Pclanglais/Pleias-Nano-onnx

Text Generation • Updated Dec 9, 2024 • 1

Pclanglais/Pleias-Pico-onnx

Updated Dec 9, 2024 • 4

Pclanglais/Headlines-OCR-Correction

Updated Oct 25, 2024 • 4

Pclanglais/SynthRag3

Updated Sep 11, 2024 • 1

Pclanglais/SynthRag2

Updated Sep 9, 2024 • 2

Pclanglais/SynthRag1

Updated Sep 8, 2024 • 2

Pclanglais/Experiment1

Updated Sep 5, 2024 • 2

Pclanglais/Segmentext-Marianne

Updated Aug 28, 2024 • 2

Pclanglais/OCRonos-Vintage-GGUF

Updated Aug 11, 2024

datasets 14

Pclanglais/heritage

Preview • Updated 2 days ago • 51

Pclanglais/course-material

Viewer • Updated 12 days ago • 89k • 59.7k

Pclanglais/gutenberg_set

Viewer • Updated 20 days ago • 7.53M • 102

Pclanglais/tokenized_sample

Viewer • Updated Feb 10 • 1.54M • 423

Pclanglais/pdf_sample_10k

Viewer • Updated Nov 30, 2024 • 415k • 29 • 1

Pclanglais/open-science

Viewer • Updated Nov 15, 2024 • 10.8M • 125

Pclanglais/LLM-for-DH

Viewer • Updated Jul 14, 2024 • 1.62k • 14

Pclanglais/youtube-commons-metadata

Viewer • Updated Jun 19, 2024 • 6.91M • 33

Pclanglais/OCR-test

Viewer • Updated Apr 22, 2024 • 20.1k • 26 • 1

Pclanglais/AllWikidataCharacters

Viewer • Updated Apr 14, 2024 • 180k • 41 • 7