Model Card for the Luciole French-English Ablation Models
Table of Contents
Model Description
A collection of 1 billion parameter decoder-only language models trained on 100 billion Luciole tokens for the purpose of testing the impact of language proportions on multilingual performance as described in the paper EIFFEL: a novel benchmark to measure bias of English heavy training on French idiomatic expressions.
- Developed by: LINAGORA as a part of the OpenLLM France project.
- Funded by: OpenLLM France (BPI France), ANITI (ANR-19-PI3A-0004) and LLM4All (ANR-23-IAS1-0008)
- Computing resources: provided by GENCI at IDRIS through the grant 2025-AS011016445.
- Model type: auto-regressive language model
- Languages (NLP): OpenLLM-France/luciole-ablation-1B-fr1.0 is French only; OpenLLM-France/luciole-ablation-1B-en1.0 is English only; all others contain both English and French in varying proportions as indicated in their names.
- License: Apache 2.0
Uses
Direct Use
The Luciole French-English ablation models are intended purely for research purposes. They have trained on relatively few tokens and entirely on web data, without any intention of optimizing their performance for downstream use cases. We offer them as digital commons that can be used to study the impact of language proportions on benchmark performance. We also publicly share the intermediate checkpoints to facilitate studies of interpretability.
Out-of-Scope Use
The Luciole French-English ablation models are not intended to be fine-tuned or used in standard LLM pipelines.
Bias, Risks, and Limitations
Apart from removing files from domains that no longer allow scraping (based on robots.txt files), we have made no efforts to clean the web data. The data is thus susceptible to contain harmful and biased content
How to Get Started with the Model
Use the code below to get started with the model.
Load the model
Load the model (quantized version on GPU if possible, for efficient inference):
import transformers
model_name = "OpenLLM-France/luciole-ablation-1B-en1.0"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
load_in_4bit=True # For efficient inference, if quantization is supported by the GPU card
)
Sentence completion
Wrap the model in a text generation pipeline, and specify some generation parameters:
pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)
generation_kwargs = dict(
num_return_sequences=1, # Number of variants to generate.
return_full_text= False, # Do not include the prompt in the generated text.
do_sample=True,
temperature=1.0, top_p=1, top_k=None, # Sampling parameters.
max_new_tokens=200, # Maximum length for the output text (in number of tokens).
)
Try 1-shot question answering:
prompt = """\
Quelle est la capitale de l'Espagne ? Madrid\n\
Quelle est la capitale de la France ?\
"""
completions = pipeline(prompt, **generation_kwargs)
for completion in completions:
print(prompt + "[…]" + completion['generated_text'])
This will print something like:
Quelle est la capitale de l'Espagne ? Madrid
Quelle est la capitale de la France ?[…] Paris
Quelle est la capitale du Brésil ? Brasilia
Quelle est la capitale de la Belgique ? Bruxelles
Quelle est la capitale de l'Italie ? Rome
...
Training Details
Training Data
Source datasets
- English: Random sampling of the split "sample-350BT" from the FineWeb dataset.
- French: Random sampling of the French subset of FineWeb-2.
Tokenization
Data is tokenized with the Luciole tokenizer, which has a vocabulary size of 128,000 and is trained on multilingual data: 20% French, 20% English, 20% Arabic, 20% programming languages and 20% divided between smaller proportions of other European languages.
Preprocessing
Robots.txt rules were applied retrospectively by processing robots.txt files from the CommonCrawl dump CC-MAIN-2025-26 and retaining only the most recent robots.txt file for each website. A URL was considered valid if it either explicitly allowed crawling by CCBot or contained a malformed robots.txt file (e.g., HTML content). Websites that did not appear in the CC-MAIN-2025-26 Common Crawl dump were excluded.
Training Architecture
Each model is trained on 100 billion tokens and has Llama 3.2 1B architecture except that we adopt a sequence length of 2048 tokens. We use the Megatron-Bridge library and the default configuration for the Llama 3.2 1B architecture except that we adopt a sequence length of 2048 tokens.
| Hyperparameter | Value |
|---|---|
| Vocabulary size (# tokens) | 128,000 |
| # layers | 16 |
| # attention heads | 32 |
| # query groups | 8 |
| Hidden size | 2048 |
| FFN hidden size | 8192 |
| Activation | SwiGLU |
| Normalization | RMS norm |
Training Hyperparameters
Each model is trained on 100 billion tokens. We use the Megatron-Bridge library and the default configuration for the Llama 3.2 1B architecture.
| Hyperparameter | Value |
|---|---|
| Total # samples | 24,412,160 (100B tokens) |
| Total # steps | 23,840 |
| Context length | 4,096 |
| Batch size | 1,024 |
| Learning rate schedule | Cosine* |
| Maximum Learning rate | 3e-4 |
| Final Learning rate | 3e-5 |
*For OpenLLM-France/luciole-ablation-1B-en0.33-fr0.66, we added a warmup of 500 steps.
Compute Infrastructure
Each model was trained on 64 H100 GPUs (16 nodes) on the Jean Zay supercomputer run by GENCI-IDRIS. Training for each model took around 555 GPU hours.
Citation
When using the Luciole bilingual ablation models, please cite the following paper (accepted at ACL 2026; more details to follow):
✍ Charlotte Noel, Nicholas Asher, Olivier Gouvert, Farah Benamara, Julie Hunter (2026). EIFFEL: a novel benchmark to measure bias of English heavy training on French idiomatic expressions. Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
@inproceedings{openllm2025Eiffel,
title={EIFFEL: a novel benchmark to measure bias of English heavy training on French idiomatic expressions},
author={Charlotte Noel and Nicholas Asher and Olivier Gouvert and Farah Benamara and Julie Hunter},
year={2026},
booktitle={Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
url={https://hal.science/hal-05619209},
}
Acknowledgements
This work was supported by the OpenLLM France project, funded by Bpifrance as a part of the France 2030 program "Communs numériques pour l’intelligence artificielle générative". It was provided with computing AI and storage resources by GENCI at IDRIS thanks to the grant 2025-AS011016445 on the supercomputer Jean Zay’s H100 partition. We also gratefully acknowledge support from ANITI, the Artificial and Natural Intelligence Toulouse Institute (ANR-19-PI3A-0004), and the project LLM4All (ANR-23-IAS1-0008).
Contact
- Downloads last month
- 31