wissamantoun commited on
Commit
7e8ec04
·
verified ·
1 Parent(s): 011d275

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. Readme.md +13 -5
Readme.md CHANGED
@@ -10,9 +10,9 @@ tags:
10
  - roberta
11
  - camembert
12
  ---
13
- # CamemBERTv2: A Smarter French Language Model Aged to Perfection
14
 
15
- [CamemBERTv2](PAPER_LINK_SOON) is a French language model pretrained on a large corpus of 275B tokens of French text. It is the second version of the CamemBERT model, which is based on the RoBERTa architecture. CamemBERTv2 is trained using the Masked Language Modeling (MLM) objective with 40% mask rate for 3 epochs on 32 H100 GPUs. The dataset used for training is a combination of French [OSCAR](https://oscar-project.org/) dumps from the [CulturaX Project](https://huggingface.co/datasets/uonlp/CulturaX), French scientific documents from [HALvest](https://huggingface.co/datasets/almanach/HALvest), and the French Wikipedia.
16
 
17
  The model is a drop-in replacement for the original CamemBERT model. Note that the new tokenizer is different from the original CamemBERT tokenizer, so you will need to use Fast Tokenizers to use the model. It will work with `CamemBERTTokenizerFast` from `transformers` library even if the original `CamemBERTTokenizer` was sentencepiece-based.
18
 
@@ -24,13 +24,13 @@ The new update includes:
24
  - A newly built tokenizer based on WordPiece with 32,768 tokens, addition of the newline and tab characters, support emojis, and better handling of numbers (numbers are split into two digits tokens)
25
  - Extended context window of 1024 tokens
26
 
27
- More details are available in the [CamemBERTv2 paper](PAPER_LINK_SOON).
28
 
29
  ## How to use
30
  ```python
31
  from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM
32
 
33
- CamemBERTa = AutoModel.from_pretrained("almanach/camembertv2-base")
34
  tokenizer = AutoTokenizer.from_pretrained("almanach/camembertv2-base")
35
  ```
36
 
@@ -56,5 +56,13 @@ We use the pretraining codebase from the [CamemBERTa repository](https://github.
56
  ## Citation
57
 
58
  ```bibtex
59
- CITATION_SOON
 
 
 
 
 
 
 
 
60
  ```
 
10
  - roberta
11
  - camembert
12
  ---
13
+ # CamemBERT(a)-v2: A Smarter French Language Model Aged to Perfection
14
 
15
+ [CamemBERTv2](https://arxiv.org/abs/2411.08868) is a French language model pretrained on a large corpus of 275B tokens of French text. It is the second version of the CamemBERT model, which is based on the RoBERTa architecture. CamemBERTv2 is trained using the Masked Language Modeling (MLM) objective with 40% mask rate for 3 epochs on 32 H100 GPUs. The dataset used for training is a combination of French [OSCAR](https://oscar-project.org/) dumps from the [CulturaX Project](https://huggingface.co/datasets/uonlp/CulturaX), French scientific documents from [HALvest](https://huggingface.co/datasets/almanach/HALvest), and the French Wikipedia.
16
 
17
  The model is a drop-in replacement for the original CamemBERT model. Note that the new tokenizer is different from the original CamemBERT tokenizer, so you will need to use Fast Tokenizers to use the model. It will work with `CamemBERTTokenizerFast` from `transformers` library even if the original `CamemBERTTokenizer` was sentencepiece-based.
18
 
 
24
  - A newly built tokenizer based on WordPiece with 32,768 tokens, addition of the newline and tab characters, support emojis, and better handling of numbers (numbers are split into two digits tokens)
25
  - Extended context window of 1024 tokens
26
 
27
+ More details are available in the [CamemBERTv2 paper](https://arxiv.org/abs/2411.08868).
28
 
29
  ## How to use
30
  ```python
31
  from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM
32
 
33
+ camembertv2 = AutoModelForMaskedLM.from_pretrained("almanach/camembertv2-base")
34
  tokenizer = AutoTokenizer.from_pretrained("almanach/camembertv2-base")
35
  ```
36
 
 
56
  ## Citation
57
 
58
  ```bibtex
59
+ @misc{antoun2024camembert20smarterfrench,
60
+ title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection},
61
+ author={Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
62
+ year={2024},
63
+ eprint={2411.08868},
64
+ archivePrefix={arXiv},
65
+ primaryClass={cs.CL},
66
+ url={https://arxiv.org/abs/2411.08868},
67
+ }
68
  ```