Pclanglais (Pierre-Carl Langlais)

posted an update 5 months ago

Post

2701

We release today our first foundation model and experiment with a new category: specialized pre-training.

OCRonos-Vintage is a 124m parameters model trained end-to-end by Pleias on llm.c from 18 billion tokens from cultural heritage archives. Despite its small size it achieve nearly state of the art results for OCR correction of historical English sources. OCRonos-Vintage is also an historical model with an unusual cut-off date: December 29th, 1955…

We look forward to replicate this approach very soon on other "hard" tasks commonly associated with generalist LLMs/SLMs: RAG, function calling, summarization, document segmentation…

OCRonos-Vintage: PleIAs/OCRonos-Vintage
CPU Demo: PleIAs/OCRonos-Vintage-CPU
GPU Demo: PleIAs/OCRonos-Vintage-GPU
Our annoncement and call for specialized pre-training: https://huggingface.co/blog/Pclanglais/specialized-pre-training

posted an update 5 months ago

Post

1270

Since it is release season, at PleIAs we announce our first suite of specialized language models for document processing tasks (OCR correction, text segmentation, bibliographic extraction) and the release of the largest multimodal dataset of financial document Finance Commons: https://huggingface.co/blog/Pclanglais/finance-commons-bad-data-toolbox

LLM research is currently focused on quality data. We went on the opposite direction and voluntarily trained models on bad data. Far from degrading models, it made them more resilient to text sources commonly used in production.

Having a wider range of real life data proved critical for this project. A few months after the release of Common Corpus, we expanded our pool of "training data commons" with a major multimodal ressource: document released as open financial data. Finance commons comprises 17 billion tokens and 1.25 PDF corporate documents released by the SEC, WTO, AMF, EU Tenders In a multiple languages with a large variety of document layouts and challenging sources to train more robust models.

With HuggingFace compute support, we release an entire pipeline to process bad data sources and make them usable in production for LLMOps or simply retrieval: PleIAs/PleIAs-Editor

This approach is based on our new series of specialized models for document processing, the "bad data toolbox" comprising:
*OCRonos, the best available model to date for OCR correction. PleIAs/OCRonos
*Segmentext, a pure semantic small model for text segmentation, working without any visual reference. PleIAs/Segmentext
*Bibtexer, a small model for bibliographic data extraction acting as a "reversed-Zotero." PleIAs/BibTexer

reacted to davanstrien's post with 👍 6 months ago

Post

2231

📁✨ Meet Corpus Creator!

This Gradio app ( davanstrien/corpus-creator) takes you from your local files to a Hugging Face Dataset via Llama Index.

The goal of the tool is to make it quicker and easier to quickly get some local files you want to get ready for ML tasks into a Hugging Face Dataset. Perfect for building datasets for:
- synthetic data pipelines
- annotation
- RAG
- Other ML tasks that start from a HF dataset

I'll share something more substantial that uses this tomorrow 🤗

posted an update 8 months ago

Post

2344

Announcing that we are on our way to solve a long standing issue of document processing: correction of OCR mistakes. Pleias publishes the largest dataset to date with automated OCR correction, 1 billion words in English, French, German and Italian.

OCR quality is long-standing issue of digitization. Cultural heritage texts are especially concerned due to the primary sources being old documents (with many artifacts, blots, degradation) and to the limitation of OCR technology for historical scripts. When we released Common Corpus, a 500 Billion words corpus in the public domain, this was the primary criticism.

Recent breakthrough in post-OCR correction has been made possible thanks to progress in open LLM research and several months of dedicated training and alignment by Pleias as well as the HPC resources from GENCI–IDRIS (Grant 2023-AD011014736) on Jean-Zay.

Announcement: https://huggingface.co/blog/Pclanglais/post-ocr-correction

Post-OCR-Correction dataset: PleIAs/Post-OCR-Correction

reacted to conceptofmind's post with 🔥 9 months ago

Post

2551

Teraflop AI is excited to help support the Caselaw Access Project and Harvard Library Innovation Lab, in the release of over 6.6 million state and federal court decisions published throughout U.S. history. It is important to democratize fair access to data to the public, legal community, and researchers. This is a processed and cleaned version of the original CAP data.

During the digitization of these texts, there were erroneous OCR errors that occurred. We worked to post-process each of the texts for model training to fix encoding, normalization, repetition, redundancy, parsing, and formatting.

Teraflop AI’s data engine allows for the massively parallel processing of web-scale datasets into cleaned text form.

Link to the processed dataset: https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_Project

The Caselaw Access Project dataset is licensed under the CC0 License.

We plan to release trillions of commercially licensed text tokens, images, audio, videos, and other datasets spanning numerous domains and modalities over the next months. If you are interested in contributing commercially licensed data be sure to reach out: https://twitter.com/EnricoShippole

Follow us for the next collaborative dataset releases: https://twitter.com/TeraflopAI

reacted to Molbap's post with 🔥 9 months ago

Post

5085

🚀🚀 Exciting times for the document AI community!

We're thrilled to announce the release of some of the largest OCR datasets available to the public.
🔥 With over 26 million pages , 18 billion text tokens, and 6TB of data, these resources are a significant leap forward for document AI research.

Here's how to access these datasets quickly:

from datasets import load_dataset

pdfa_dataset = load_dataset('pixparse/pdfa-eng-wds', streaming=True)
IDL_dataset = load_dataset('pixparse/idl-wds', streaming=True)

This enables you to stream them directly, integrating seamlessly with your projects using the Hugging Face datasets library. On the hub, you can find them here:

pixparse/pdfa-eng-wds
pixparse/idl-wds

For lean data loading, the new [chug](https://github.com/huggingface/chug) library offers a solution with pdf decoding:

import chug

task_cfg = chug.DataTaskDocReadCfg(
    page_sampling='all',
)
data_cfg = chug.DataCfg(
    source='pixparse/pdfa-eng-wds',
    split='train',
    batch_size=None,
    format='hfids',
    num_workers=0,
)
data_loader = chug.create_loader(
    data_cfg,
    task_cfg,
)
sample = next(iter(data_loader))

We owe a huge thank you to Peter Wyatt, Kate Tasker, Rachel Taketa, Ali Furkan Biten, Ruben Tito, and their colleagues for their contributions. Their work putting these datasets together has been invaluable. 🤗

Looking Ahead:

We're on a mission to enhance document AI capabilities, and these datasets are just the beginning. With your engagement and innovation, we're confident in the community's ability to develop robust OCR solutions. We encourage you to explore these datasets, experiment with the code, and contribute to the collective progress in document AI.

For detailed information on usage and licensing, please refer to the dataset cards on the Hugging Face hub.

4 replies

·

posted an update 10 months ago

Post

2492

Announcing today the release of Common Corpus, the largest collection of fully open corpus on HuggingFace: nearly 500b words (600-700b tokens) in public domain.

https://huggingface.co/collections/PleIAs/common-corpus-65d46e3ea3980fdcd66a5613

Common corpus is an international initiative coordinated by @pleias_fr with the support of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM and the involvement of the open science LLM community (Occiglot, Eleuther AI) and cultural heritage researchers.

We aim to create the same kind of ecosystem there is now for fine tuning at the pretraining stage, by creating a strong commons without copyright issues or "trade secret" gatekeeping. Contrary to what many AI companies say, Common Corpus shows it is possible to train Large Language Models on fully open corpus. Due to the complexity of copyright check, we have only released a partial amount of the text we hold and will release way more in the months.

Common Corpus is multilingual. It also includes to date the largest open collections in French (110 billion words), German (30 billion words), Spanish (23 billion words), Dutch (18 billion words), Italian (10 billion words) as well as a very long tail of middle to low resource languages.

Our conviction is that open corpora make future models more inclusive, democratic, and respectful of cultural diversity, as well as more qualitative. Common Corpus holds many long texts in book form, editorialized, with reasoning rich content that have never been used to date for LLM pretraining.

Common Corpus is an ongoing work and still need to get enhanced and completed. Sharing is caring: Common Corpus still needs more care to become "a common" like Wikipedia or Wikisource.

https://huggingface.co/blog/Pclanglais/common-corpus

reacted to akhaliq's post with ❤️ 11 months ago

Post

Aya Dataset

An Open-Access Collection for Multilingual Instruction Tuning

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning (2402.06619)

Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.

3 replies

·

posted an update 11 months ago

Post

Today I'm releasing marginalia, a python library to perform corpus analysis and retrieve structured annotations with open LLMs like Mistral Open-Hermes-2.5: https://github.com/Pleias/marginalia

marginalia leverages vllm inference speed to re-generate until all the output matches an expected json structure and to send batches of several unstructured elements for enhanced patterns detections. It works especially well for bibliographies. The demo transforms a very old list (Benjamin Franklin favorite's books from 1744) into well-structured data: https://colab.research.google.com/drive/1xKjK2mDDpXMaKG5YLpFhOM7jehxt0kEt?usp=sharing

While marginalia can be quite flexible, it definitely isn't a general purpose tool for json generation (like outlines). I don't intend so far to extend support to more complex json structure, but really looking forward potential feedbacks and suggestions.

replied to JustinLin610's post 11 months ago

Congratulations! With all the US/EU big players being more secretive than ever, you're not just bringing good models, but really making an incredible contribution to open research.

And I slightly disagree on one point: Qwen-500m is SOTA. Never thought it could be possible to pour results like this from such a small multilingual model for RAG tasks in French.

reacted to JustinLin610's post with ❤️ 11 months ago

Post

Yesterday we just released Qwen1.5. Maybe someday I can tell more about the experience. But this is is at least a good release even if it is not yet SOTA. There is not so many SOTA by the way. This time, we actually fixed a lot of problems.

1. Context lengths are finally unified for all sizes. Previously, a lot of users kept telling us that 14B only supports 2K (Yeah even dynamic NTK does not work that well and it can only be extended to around 4-5K. Let alone those know nothing about how to use dynamic NTK).

2. If you carefully use our base language models, you will find that they understand special tokens of ChatML, which means that you can directly use LoRA to train on data with ChatML format. Why you can't do this before? This is because if the base language model does not understand the special tokens, you need to make them trained, which means that you should turn on the training of embedding. This is disgusting and it often leads to problems when you use ZeRO3.

3. We did strengthen our base language models except for 72. You should find better base language models, especially for 7 and 14. Why not 72? Nah, hard to say, but will make it better.

4. About the multilingual capabilities. Yes we finally build up our multilingual evaluation system and find out that our new base language models have nice performance in multilingual evaluation for base language models. This tells us that we should pay more attention to the post-training with multilingual data. And we did that too. This is why this time we tell you something about multilingual performance. It is for sure much much better than our models before this release.

5. Chat models are the most promising stuff. Before this release, we gave you the SFT models. But this time, we had very nice SFT+DPO models. Yeah not only annotators like them but also users like them. I am sure you developers will feel that way too.

5 replies

·

reacted to clem's post with ❤️ 11 months ago

Post

In 2024, we're expanding from open weights to open EVERYTHING (datasets, training scripts,...).

Excited to see this dataset release in French by @Pclanglais @carbonbasedLLM @anastasiastasenko :
PleIAs/French-PD-Newspapers

"To give you an idea of the size, the full French Wikipedia is about 2 billon words. This is 40 times larger."

reacted to their post with ❤️ 11 months ago

Post

Hi everyone,
For my first post, I'm announcing a big release (in multiple ways): probably the largest open corpus in French to date, with 85 billion words in the public domain.
The dataset has been prepared in collaboration with Benoît de Courson and Benjamin Azoulay from Gallicagram (https://shiny.ens-paris-saclay.fr/app/gallicagram). Gallicagram is a major cultural analytics project in French, the open and better version of ngram viewer for large scale search of word and ngram occurrences.
The corpus is made of two different dataset for monographs (16B words) PleIAs/French-PD-Newspapers and newspapers/periodicals (69B) PleIAs/French-PD-Newspapers Along with the full text it also includes core provenance metadata.
Beyond research in digital humanities, the corpus can also be used to training open and reproducible LLMs. Being in the public domain means it can be released everywhere in any shape without restrictions.
The corpus is not perfect: digitization of cultural heritage is challenging and, especially for newspapers, we tackle with layout issues and a significant rate of optical character recognition mistake. Our conviction is that releasing corpus as a commons is the best way to improve on this. Sharing is caring.

1 reply

·

posted an update 11 months ago

Post

Hi everyone,
For my first post, I'm announcing a big release (in multiple ways): probably the largest open corpus in French to date, with 85 billion words in the public domain.
The dataset has been prepared in collaboration with Benoît de Courson and Benjamin Azoulay from Gallicagram (https://shiny.ens-paris-saclay.fr/app/gallicagram). Gallicagram is a major cultural analytics project in French, the open and better version of ngram viewer for large scale search of word and ngram occurrences.
The corpus is made of two different dataset for monographs (16B words) PleIAs/French-PD-Newspapers and newspapers/periodicals (69B) PleIAs/French-PD-Newspapers Along with the full text it also includes core provenance metadata.
Beyond research in digital humanities, the corpus can also be used to training open and reproducible LLMs. Being in the public domain means it can be released everywhere in any shape without restrictions.
The corpus is not perfect: digitization of cultural heritage is challenging and, especially for newspapers, we tackle with layout issues and a significant rate of optical character recognition mistake. Our conviction is that releasing corpus as a commons is the best way to improve on this. Sharing is caring.

1 reply

·

reacted to merve's post with ❤️ 12 months ago

Post

Sharing a super-fast segmentation model today 💨
SlimSAM is pruned-distilled version of SAM model, it's (up-to 8.6x) faster and smaller yet very powerful! ⚡️
It has the same architecture as SAM, meaning you can use the 🤗 transformers code for SAM on SlimSAM models ⬇️ (yes only 3 lines of code!)

from transformers import pipeline
generator = pipeline(model="nielsr/slimsam-50-uniform", task="mask-generation")
outputs = generator(image)

Lastly, I have built an app for you to compare SlimSAM and SAM outputs
merve/slimsam

Pierre-Carl Langlais

AI & ML interests

Recent Activity

Articles

They Said It Couldn’t Be Done

Releasing the largest multilingual open pretraining dataset

The case for specialized pre-training: ultra-fast foundation models for dedicated tasks

Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing

Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM

Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data

Releasing Common Corpus: the largest public domain dataset for training LLMs

Organizations

Pclanglais's activity