julien-c HF staff commited on
Commit
2856950
1 Parent(s): 9275231

Use the model card from the authors instead

Browse files
Files changed (1) hide show
  1. README.md +41 -86
README.md CHANGED
@@ -7,58 +7,15 @@ datasets:
7
 
8
  # CamemBERT: a Tasty French Language Model
9
 
10
- ## Table of Contents
11
- - [Model Details](#model-details)
12
- - [Uses](#uses)
13
- - [Risks, Limitations and Biases](#risks-limitations-and-biases)
14
- - [Training](#training)
15
- - [Evaluation](#evaluation)
16
- - [Citation Information](#citation-information)
17
- - [How to Get Started With the Model](#how-to-get-started-with-the-model)
18
 
 
19
 
20
- ## Model Details
21
- - **Model Description:**
22
- CamemBERT is a state-of-the-art language model for French based on the RoBERTa model.
23
- It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
24
- - **Developed by:** Louis Martin\*, Benjamin Muller\*, Pedro Javier Ortiz Suárez\*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
25
- - **Model Type:** Fill-Mask
26
- - **Language(s):** French
27
- - **License:** MIT
28
- - **Parent Model:** See the [RoBERTa base model](https://huggingface.co/roberta-base) for more information about the RoBERTa base model.
29
- - **Resources for more information:**
30
- - [Research Paper](https://arxiv.org/abs/1911.03894)
31
- - [Camembert Website](https://camembert-model.fr/)
32
-
33
-
34
- ## Uses
35
 
36
- #### Direct Use
37
 
38
- This model can be used for Fill-Mask tasks.
39
-
40
-
41
- ## Risks, Limitations and Biases
42
- **CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.**
43
-
44
- Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
45
-
46
- This model was pretrained on a subcorpus of OSCAR multilingual corpus. Some of the limitations and risks associated with the OSCAR dataset, which are further detailed in the [OSCAR dataset card](https://huggingface.co/datasets/oscar), include the following:
47
-
48
- > The quality of some OSCAR sub-corpora might be lower than expected, specifically for the lowest-resource languages.
49
-
50
- > Constructed from Common Crawl, Personal and sensitive information might be present.
51
-
52
-
53
-
54
- ## Training
55
-
56
-
57
- #### Training Data
58
- OSCAR or Open Super-large Crawled Aggregated coRpus is a multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
59
-
60
-
61
- #### Training Procedure
62
 
63
  | Model | #params | Arch. | Training data |
64
  |--------------------------------|--------------------------------|-------|-----------------------------------|
@@ -69,33 +26,15 @@ OSCAR or Open Super-large Crawled Aggregated coRpus is a multilingual corpus obt
69
  | `camembert/camembert-base-oscar-4gb` | 110M | Base | Subsample of OSCAR (4 GB of text) |
70
  | `camembert/camembert-base-ccnet-4gb` | 110M | Base | Subsample of CCNet (4 GB of text) |
71
 
72
- ## Evaluation
73
-
74
-
75
- The model developers evaluated CamemBERT using four different downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI).
76
-
77
-
78
-
79
- ## Citation Information
80
-
81
- ```bibtex
82
- @inproceedings{martin2020camembert,
83
- title={CamemBERT: a Tasty French Language Model},
84
- author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
85
- booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
86
- year={2020}
87
- }
88
- ```
89
-
90
- ## How to Get Started With the Model
91
 
92
  ##### Load CamemBERT and its sub-word tokenizer :
93
  ```python
94
  from transformers import CamembertModel, CamembertTokenizer
95
 
96
  # You can replace "camembert-base" with any other model from the table, e.g. "camembert/camembert-large".
97
- tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
98
- camembert = CamembertModel.from_pretrained("camembert-base")
99
 
100
  camembert.eval() # disable dropout (or leave in train mode to finetune)
101
 
@@ -105,15 +44,14 @@ camembert.eval() # disable dropout (or leave in train mode to finetune)
105
  ```python
106
  from transformers import pipeline
107
 
108
- camembert_fill_mask = pipeline("fill-mask", model="camembert-base", tokenizer="camembert-base")
109
- results = camembert_fill_mask("Le camembert est <mask> :)")
110
  # results
111
- #[{'sequence': '<s> Le camembert est délicieux :)</s>', 'score': 0.4909103214740753, 'token': 7200},
112
- # {'sequence': '<s> Le camembert est excellent :)</s>', 'score': 0.10556930303573608, 'token': 2183},
113
- # {'sequence': '<s> Le camembert est succulent :)</s>', 'score': 0.03453315049409866, 'token': 26202},
114
- # {'sequence': '<s> Le camembert est meilleur :)</s>', 'score': 0.03303130343556404, 'token': 528},
115
- # {'sequence': '<s> Le camembert est parfait :)</s>', 'score': 0.030076518654823303, 'token': 1654}]
116
-
117
  ```
118
 
119
  ##### Extract contextual embedding features from Camembert output
@@ -125,7 +63,7 @@ tokenized_sentence = tokenizer.tokenize("J'aime le camembert !")
125
 
126
  # 1-hot encode and add special starting and end tokens
127
  encoded_sentence = tokenizer.encode(tokenized_sentence)
128
- # [5, 121, 11, 660, 16, 730, 25543, 110, 83, 6]
129
  # NB: Can be done in one step : tokenize.encode("J'aime le camembert !")
130
 
131
  # Feed tokens to Camembert as a torch tensor (batch dim 1)
@@ -133,9 +71,9 @@ encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
133
  embeddings, _ = camembert(encoded_sentence)
134
  # embeddings.detach()
135
  # embeddings.size torch.Size([1, 10, 768])
136
- # tensor([[[-0.0254, 0.0235, 0.1027, ..., -0.1459, -0.0205, -0.0116],
137
- # [ 0.0606, -0.1811, -0.0418, ..., -0.1815, 0.0880, -0.0766],
138
- # [-0.1561, -0.1127, 0.2687, ..., -0.0648, 0.0249, 0.0446],
139
  # ...,
140
  ```
141
 
@@ -143,16 +81,33 @@ embeddings, _ = camembert(encoded_sentence)
143
  ```python
144
  from transformers import CamembertConfig
145
  # (Need to reload the model with new config)
146
- config = CamembertConfig.from_pretrained("camembert-base", output_hidden_states=True)
147
- camembert = CamembertModel.from_pretrained("camembert-base", config=config)
148
 
149
  embeddings, _, all_layer_embeddings = camembert(encoded_sentence)
150
  # all_layer_embeddings list of len(all_layer_embeddings) == 13 (input embedding layer + 12 self attention layers)
151
  all_layer_embeddings[5]
152
  # layer 5 contextual embedding : size torch.Size([1, 10, 768])
153
- #tensor([[[-0.0032, 0.0075, 0.0040, ..., -0.0025, -0.0178, -0.0210],
154
- # [-0.0996, -0.1474, 0.1057, ..., -0.0278, 0.1690, -0.2982],
155
- # [ 0.0557, -0.0588, 0.0547, ..., -0.0726, -0.0867, 0.0699],
156
  # ...,
157
  ```
158
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
  # CamemBERT: a Tasty French Language Model
9
 
10
+ ## Introduction
 
 
 
 
 
 
 
11
 
12
+ [CamemBERT](https://arxiv.org/abs/1911.03894) is a state-of-the-art language model for French based on the RoBERTa model.
13
 
14
+ It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
+ For further information or requests, please go to [Camembert Website](https://camembert-model.fr/)
17
 
18
+ ## Pre-trained models
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  | Model | #params | Arch. | Training data |
21
  |--------------------------------|--------------------------------|-------|-----------------------------------|
26
  | `camembert/camembert-base-oscar-4gb` | 110M | Base | Subsample of OSCAR (4 GB of text) |
27
  | `camembert/camembert-base-ccnet-4gb` | 110M | Base | Subsample of CCNet (4 GB of text) |
28
 
29
+ ## How to use CamemBERT with HuggingFace
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ##### Load CamemBERT and its sub-word tokenizer :
32
  ```python
33
  from transformers import CamembertModel, CamembertTokenizer
34
 
35
  # You can replace "camembert-base" with any other model from the table, e.g. "camembert/camembert-large".
36
+ tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base-wikipedia-4gb")
37
+ camembert = CamembertModel.from_pretrained("camembert/camembert-base-wikipedia-4gb")
38
 
39
  camembert.eval() # disable dropout (or leave in train mode to finetune)
40
 
44
  ```python
45
  from transformers import pipeline
46
 
47
+ camembert_fill_mask = pipeline("fill-mask", model="camembert/camembert-base-wikipedia-4gb", tokenizer="camembert/camembert-base-wikipedia-4gb")
48
+ results = camembert_fill_mask("Le camembert est un fromage de <mask>!")
49
  # results
50
+ #[{'sequence': '<s> Le camembert est un fromage de chèvre!</s>', 'score': 0.4937814474105835, 'token': 19370},
51
+ #{'sequence': '<s> Le camembert est un fromage de brebis!</s>', 'score': 0.06255942583084106, 'token': 30616},
52
+ #{'sequence': '<s> Le camembert est un fromage de montagne!</s>', 'score': 0.04340197145938873, 'token': 2364},
53
+ # {'sequence': '<s> Le camembert est un fromage de Noël!</s>', 'score': 0.02823255956172943, 'token': 3236},
54
+ #{'sequence': '<s> Le camembert est un fromage de vache!</s>', 'score': 0.021357402205467224, 'token': 12329}]
 
55
  ```
56
 
57
  ##### Extract contextual embedding features from Camembert output
63
 
64
  # 1-hot encode and add special starting and end tokens
65
  encoded_sentence = tokenizer.encode(tokenized_sentence)
66
+ # [5, 221, 10, 10600, 14, 8952, 10540, 75, 1114, 6]
67
  # NB: Can be done in one step : tokenize.encode("J'aime le camembert !")
68
 
69
  # Feed tokens to Camembert as a torch tensor (batch dim 1)
71
  embeddings, _ = camembert(encoded_sentence)
72
  # embeddings.detach()
73
  # embeddings.size torch.Size([1, 10, 768])
74
+ #tensor([[[-0.0928, 0.0506, -0.0094, ..., -0.2388, 0.1177, -0.1302],
75
+ # [ 0.0662, 0.1030, -0.2355, ..., -0.4224, -0.0574, -0.2802],
76
+ # [-0.0729, 0.0547, 0.0192, ..., -0.1743, 0.0998, -0.2677],
77
  # ...,
78
  ```
79
 
81
  ```python
82
  from transformers import CamembertConfig
83
  # (Need to reload the model with new config)
84
+ config = CamembertConfig.from_pretrained("camembert/camembert-base-wikipedia-4gb", output_hidden_states=True)
85
+ camembert = CamembertModel.from_pretrained("camembert/camembert-base-wikipedia-4gb", config=config)
86
 
87
  embeddings, _, all_layer_embeddings = camembert(encoded_sentence)
88
  # all_layer_embeddings list of len(all_layer_embeddings) == 13 (input embedding layer + 12 self attention layers)
89
  all_layer_embeddings[5]
90
  # layer 5 contextual embedding : size torch.Size([1, 10, 768])
91
+ #tensor([[[-0.0059, -0.0227, 0.0065, ..., -0.0770, 0.0369, 0.0095],
92
+ # [ 0.2838, -0.1531, -0.3642, ..., -0.0027, -0.8502, -0.7914],
93
+ # [-0.0073, -0.0338, -0.0011, ..., 0.0533, -0.0250, -0.0061],
94
  # ...,
95
  ```
96
 
97
+
98
+ ## Authors
99
+
100
+ CamemBERT was trained and evaluated by Louis Martin\*, Benjamin Muller\*, Pedro Javier Ortiz Suárez\*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
101
+
102
+
103
+ ## Citation
104
+ If you use our work, please cite:
105
+
106
+ ```bibtex
107
+ @inproceedings{martin2020camembert,
108
+ title={CamemBERT: a Tasty French Language Model},
109
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
110
+ booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
111
+ year={2020}
112
+ }
113
+ ```