We are thrilled to introduce CroissantLLM, a small but capable 1.3 billion parameter language model trained on 3T tokens, that is fully open, and truly bilingual ! The goal is to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. Our approach is rooted in transparency, and along with the model and various checkpoints, we release new high-quality French datasets sourced from legal, administrative, cultural, business, scientific and translation data, as well as FrenchBench, a novel evaluation benchmark to assess LLM performance in French !
Most recent models have been trained on dominantly English corpora, leading to performance drops in other languages and to English-centered cultural bias. With CroissantLLM, we aim to train a model in which English is not the dominant language and go for a 1:1 ratio of English and French data !
One of the challenges was to gather sufficient amounts of high-quality data in French. We collected, filtered and cleaned data from multiple varied sources, in order to target various domains (legal, administrative, cultural, scientific, etc.), or cover different text modalities (speech transcriptions, movie subtitles, encyclopedias, forums, webpages)… All data collected is explicitly listed in the technical report, falls under permissive licenses, and is shared with the rest of the project artefacts.
In total, we collect more than 303 billion tokens of monolingual French data (1.3 Terabytes), as well as 36 billion tokens of French-English high-quality translation data and aggregate that with English and Code data ! We craft our final 3 trillion token dataset such that we obtain equal amounts of French and English data after upsampling.
For reference, training a LLM on 3 trillion tokens is huge ! It is larger than the number of tokens seen during training by the Llama2 models, and almost 10 times as much as what is done in the Bloom models, making CroissantLLM the model that has trained on the most French data to this day !
CroissantLLM is a 1.3 billion parameter model, with a Llama model architecture. Selecting this model size stems from the realization the largest bottlenecks in widespread model adoption is the difficulty in getting models to run quickly on consumer-grade hardware. In fact, looking at HuggingFace downloads, the most downloaded models are not the best performing (Llama2-70B, Mixtral 8x7B) but rather the smaller ones (Llama2-7B, Mistral 7B) which are easier and cheaper to serve and finetune.
With it’s 1.3B model size, CroissantLLM is able to run extremely quickly on lower end GPU servers, enabling for high throughput and low latency, but can also run on CPUs or even mobile devices with decent speeds !
The tradeoff is obviously that CroissantLLM is not going to display the same generalist capabilities in reasoning, math, coding that larger models have, but it will be perfect for more specific industrial applications, translations or even Chat capabilities in which the big guns are not always demanded !
To assess the model's performance beyond English, the team introduces FrenchBench, a novel benchmark encompassing various classification and generation tasks to assess LLM performance in French. FrenchBench Gen includes tasks like title generation, summarization, question generation, and question answering, relying on the high-quality French Question Answering dataset, FQuaD. The Multiple Choice section of FrenchBench focuses on reasoning, factual knowledge, and linguistic capabilities.
CroissantLLM is the best performing model of the size in French, edging out models up to 3 times bigger on most tasks (Bloom 3B).
We also assess the model on English benchmarks and match or surpass the best models of the size !
For the moment, we talked about the base model only ! However, it is now understood base models are only the foundations of most modern LLM systems, and to extract the best performance, it is important to run a second-phase of training called supervised fine-tuning ! We finetune CroissantLLM on Chat data, including from some ChatGPT interactions, and assess CroissantLLMChat capabilities on various tasks in French and English such as MT-Bench, translation, French Trivia…
MT-Bench aims at assessing the capabilities of LLMs on eight domains. CroissantLLMChat exhibits good performance on French-understanding tasks like Writing and Roleplay, surpassing models of the same size. It also shows good general knowledge in STEM and humanities.
One question this work attempts to tackle is whether training on bilingual data goes beyond augmenting the language understanding and writing capabilities of a model in another language, but also equips the models with novel knowledge and different cultural biases. We evaluate French cultural knowledge on a Trivia task, consisting of questions about France-related topics, asked in English. The results on FrenchTrivia show that pre-training on a very large corpora induces significantly higher knowledge capabilities.
The benefits of training on French and English data on a 1:1 ratio, and on parallel data can also be seen on translation tasks. In fact, CroissantLLM outperforms large models like Llama and Mistral 7B in few-shot settings, and is on par with the State-of-the-art specialized translation model of the same size, NLLB 1.3B, while retaining it's generalist Chat capabilities.
State-of-the-art models, both proprietary and open-weights are often designed and trained by heavily investor-backed companies, that aim to retain a moat by keeping their training data mix and strategy secret, hindering the rest of the field's ability to fully study and understand these models.
Additionally, there are ongoing debates about who actually owns the data used to train these language models, with legal implications becoming more prominent. Recent political discussions, such as the EU AI Act and US Senate hearings, highlight the growing need for transparency in AI development to ensure legal compliance and build trust with users.
The CroissantLLM initiative was designed from the start with transparency in mind. We validate 81 % of the transparency criteria on the FMTI framework, far beyond the scores of even most open initiatives, by releasing the data, models, training procedure and all the code used to curate the data and train the model.
More than a performing model, CroissantLLM and the associated arttefacts also aim to be a support to foster further research on multilingual language models, understanding the impact of pretraining data on internal knowledge, and the dynamics of models trained way past the Chinchilla optimal threshold. It will lead to further publications on model memorization and the split capacity of bilingual language models.
The models, datasets, training code, evaluation benchmarks and data are fully open-sourced.
This work is a collaboration of academic and industrial partners. On the academic side, core authors are affiliated with CentraleSupélec (Université Paris Saclay) and Instituto Superior Técnico de Lisboa, and other contributors are linked to Sorbonne Université and Imperial College London. On the industrial side, core authors receive funding from respectively Illuin Technology (Paris), Unbabel (Lisboa), Equall (New York, Lisboa, Paris). Training compute is mainly obtained on the Jean Zay supercomputer operated by GENCI IDRIS through compute grant 2023-AD011014668R1.