File size: 5,311 Bytes

---
license: mit
---

# CamemBERTa: A French language model based on DeBERTa V3

CamemBERTa, a French language model based on DeBERTa V3, which is a DeBerta V2 with ELECTRA style pretraining using the Replaced Token Detection (RTD) objective.
RTD uses a generator model, trained using the MLM objective, to replace masked tokens with plausible candidates, and a discriminator model trained to detect which tokens were replaced by the generator.
Usually the generator and discriminator share the same embedding matrix, but the authors of DeBERTa V3 propose a new technique to disentagle the gradients of the shared embedding between the generator and discriminator called gradient-disentangled embedding sharing (GDES)

*This the first publicly available implementation of DeBERTa V3, and the first publicly DeBERTaV3 model outside of the original Microsoft release.*

Preprint Paper: https://inria.hal.science/hal-03963729/

Pre-training Code: https://gitlab.inria.fr/almanach/CamemBERTa

## How to use CamemBERTa
Our pretrained weights are available on the HuggingFace model hub, you can load them using the following code:

```python
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM

CamemBERTa = AutoModel.from_pretrained("almanach/camemberta-base")
tokenizer = AutoTokenizer.from_pretrained("almanach/camemberta-base")

CamemBERTa_gen = AutoModelForMaskedLM.from_pretrained("almanach/camemberta-base-generator")
tokenizer_gen = AutoTokenizer.from_pretrained("almanach/camemberta-base-generator")
```

We also include the TF2 weights including the weights for the model's RTD head for the discriminator, and the MLM head for the generator.
CamemBERTa is compatible with most finetuning scripts from the transformers library.

## Pretraining Setup

The model was trained on the French subset of the CCNet corpus (the same subset used in CamemBERT and PaGNOL) and is available on the HuggingFace model hub: CamemBERTa and CamemBERTa Generator.
To speed up the pre-training experiments, the pre-training was split into two phases;
in phase 1, the model is trained with a maximum sequence length of 128 tokens for 10,000 steps with 2,000 warm-up steps and a very large batch size of 67,584.
In phase 2, maximum sequence length is increased to the full model capacity of 512 tokens for 3,300 steps with 200 warm-up steps and a batch size of 27,648.
The model would have seen 133B tokens compared to 419B tokens for CamemBERT-CCNet which was trained for 100K steps, this represents roughly 30% of CamemBERT’s full training.
To have a fair comparison, we trained a RoBERTa model, CamemBERT30%, using the same exact pretraining setup but with the MLM objective.

## Pretraining Loss Curves
check the tensorboard logs and plots

## Fine-tuning results

Datasets: POS tagging and Dependency Parsing (GSD, Rhapsodie, Sequoia, FSMB), NER (FTB), the FLUE benchmark (XNLI, CLS, PAWS-X), and the French Question Answering Dataset (FQuAD)

| Model             | UPOS      | LAS       | NER       | CLS       | PAWS-X    | XNLI      | F1 (FQuAD) | EM (FQuAD) |
|-------------------|-----------|-----------|-----------|-----------|-----------|-----------|------------|------------|
| CamemBERT (CCNet) | **97.59** | **88.69** | 89.97     | 94.62     | 91.36     | 81.95     | 80.98      | **62.51**  |
| CamemBERT (30%)   | 97.53     | 87.98     | **91.04** | 93.28     | 88.94     | 79.89     | 75.14      | 56.19      |
| CamemBERTa        | 97.57     | 88.55     | 90.33     | **94.92** | **91.67** | **82.00** | **81.15**  | 62.01      |

The following table compares CamemBERTa's performance on XNLI against other models under different training setups, which demonstrates the data efficiency of CamemBERTa.


| Model             | XNLI (Acc.) | Training Steps | Tokens seen in pre-training | Dataset Size in Tokens |
|-------------------|-------------|----------------|-----------------------------|------------------------|
| mDeBERTa          | 84.4        | 500k           | 2T                          | 2.5T                   |
| CamemBERTa        | 82.0        | 33k            | 0.139T                      | 0.319T                 |
| XLM-R             | 81.4        | 1.5M           | 6T                          | 2.5T                   |
| CamemBERT - CCNet | 81.95       | 100k           | 0.419T                      | 0.319T                 |

*Note: The CamemBERTa training steps was adjusted for a batch size of 8192.*

## License

The public model weights are licensed under MIT License.
This code is licensed under the Apache License 2.0.

## Citation

Paper accepted to Findings of ACL 2023.

You can use the preprint citation for now

```
@article{antoun2023camemberta
  TITLE = {{Data-Efficient French Language Modeling with CamemBERTa}},
  AUTHOR = {Antoun, Wissam and Sagot, Beno{\^i}t and Seddah, Djam{\'e}},
  URL = {https://inria.hal.science/hal-03963729},
  NOTE = {working paper or preprint},
  YEAR = {2023},
  MONTH = Jan,
  PDF = {https://inria.hal.science/hal-03963729/file/French_DeBERTa___ACL_2023%20to%20be%20uploaded.pdf},
  HAL_ID = {hal-03963729},
  HAL_VERSION = {v1},
}
```

## Contact

Wissam Antoun: `wissam (dot) antoun (at) inria (dot) fr`

Benoit Sagot: `benoit (dot) sagot (at) inria (dot) fr`

Djame Seddah: `djame (dot) seddah (at) inria (dot) fr`