Model Card for the Luciole French-English Ablation Models

Table of Contents

Model Description

A collection of 1 billion parameter decoder-only language models trained on 100 billion Luciole tokens for the purpose of testing the impact of language proportions on multilingual performance as described in the paper EIFFEL: a novel benchmark to measure bias of English heavy training on French idiomatic expressions.

Uses

Direct Use

The Luciole French-English ablation models are intended purely for research purposes. They have trained on relatively few tokens and entirely on web data, without any intention of optimizing their performance for downstream use cases. We offer them as digital commons that can be used to study the impact of language proportions on benchmark performance. We also publicly share the intermediate checkpoints to facilitate studies of interpretability.

Out-of-Scope Use

The Luciole French-English ablation models are not intended to be fine-tuned or used in standard LLM pipelines.

Bias, Risks, and Limitations

Apart from removing files from domains that no longer allow scraping (based on robots.txt files), we have made no efforts to clean the web data. The data is thus susceptible to contain harmful and biased content

How to Get Started with the Model

Use the code below to get started with the model.

Load the model

Load the model (quantized version on GPU if possible, for efficient inference):

import transformers

model_name = "OpenLLM-France/luciole-ablation-1B-en1.0"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
    device_map="auto",
    load_in_4bit=True       # For efficient inference, if quantization is supported by the GPU card
)

Sentence completion

Wrap the model in a text generation pipeline, and specify some generation parameters:

pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)

generation_kwargs = dict(
    num_return_sequences=1,               # Number of variants to generate.
    return_full_text= False,              # Do not include the prompt in the generated text.
    do_sample=True,
    temperature=1.0, top_p=1, top_k=None, # Sampling parameters.
    max_new_tokens=200,                   # Maximum length for the output text (in number of tokens).
)

Try 1-shot question answering:

prompt = """\
Quelle est la capitale de l'Espagne ? Madrid\n\
Quelle est la capitale de la France ?\
"""
completions = pipeline(prompt, **generation_kwargs)
for completion in completions:
    print(prompt + "[…]" + completion['generated_text'])

This will print something like:

Quelle est la capitale de l'Espagne ? Madrid
Quelle est la capitale de la France ?[…] Paris
Quelle est la capitale du Brésil ? Brasilia
Quelle est la capitale de la Belgique ? Bruxelles
Quelle est la capitale de l'Italie ? Rome
...

Training Details

Training Data

Source datasets

  • English: Random sampling of the split "sample-350BT" from the FineWeb dataset.
  • French: Random sampling of the French subset of FineWeb-2.

Tokenization

Data is tokenized with the Luciole tokenizer, which has a vocabulary size of 128,000 and is trained on multilingual data: 20% French, 20% English, 20% Arabic, 20% programming languages and 20% divided between smaller proportions of other European languages.

Preprocessing

Robots.txt rules were applied retrospectively by processing robots.txt files from the CommonCrawl dump CC-MAIN-2025-26 and retaining only the most recent robots.txt file for each website. A URL was considered valid if it either explicitly allowed crawling by CCBot or contained a malformed robots.txt file (e.g., HTML content). Websites that did not appear in the CC-MAIN-2025-26 Common Crawl dump were excluded.

Training Architecture

Each model is trained on 100 billion tokens and has Llama 3.2 1B architecture except that we adopt a sequence length of 2048 tokens. We use the Megatron-Bridge library and the default configuration for the Llama 3.2 1B architecture except that we adopt a sequence length of 2048 tokens.

Hyperparameter Value
Vocabulary size (# tokens) 128,000
# layers 16
# attention heads 32
# query groups 8
Hidden size 2048
FFN hidden size 8192
Activation SwiGLU
Normalization RMS norm

Training Hyperparameters

Each model is trained on 100 billion tokens. We use the Megatron-Bridge library and the default configuration for the Llama 3.2 1B architecture.

Hyperparameter Value
Total # samples 24,412,160 (100B tokens)
Total # steps 23,840
Context length 4,096
Batch size 1,024
Learning rate schedule Cosine*
Maximum Learning rate 3e-4
Final Learning rate 3e-5

*For OpenLLM-France/luciole-ablation-1B-en0.33-fr0.66, we added a warmup of 500 steps.

Compute Infrastructure

Each model was trained on 64 H100 GPUs (16 nodes) on the Jean Zay supercomputer run by GENCI-IDRIS. Training for each model took around 555 GPU hours.

Citation

When using the Luciole bilingual ablation models, please cite the following paper (accepted at ACL 2026; more details to follow):

✍ Charlotte Noel, Nicholas Asher, Olivier Gouvert, Farah Benamara, Julie Hunter (2026). EIFFEL: a novel benchmark to measure bias of English heavy training on French idiomatic expressions. Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

@inproceedings{openllm2025Eiffel,
      title={EIFFEL: a novel benchmark to measure bias of English heavy training on French idiomatic expressions}, 
      author={Charlotte Noel and Nicholas Asher and Olivier Gouvert and Farah Benamara and Julie Hunter},
      year={2026},
      booktitle={Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
      url={https://hal.science/hal-05619209}, 
}

Acknowledgements

This work was supported by the OpenLLM France project, funded by Bpifrance as a part of the France 2030 program "Communs numériques pour l’intelligence artificielle générative". It was provided with computing AI and storage resources by GENCI at IDRIS thanks to the grant 2025-AS011016445 on the supercomputer Jean Zay’s H100 partition. We also gratefully acknowledge support from ANITI, the Artificial and Natural Intelligence Toulouse Institute (ANR-19-PI3A-0004), and the project LLM4All (ANR-23-IAS1-0008).

Contact

contact@openllm-france.fr

Downloads last month
31
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train OpenLLM-France/luciole-ablation-1B-en1.0