Ezi commited on
Commit
aca67c9
1 Parent(s): 482393b

Model Card

Browse files

Some additional information for the model card, based on the format we are using as part of our effort to standardise model cards
Feel free to merge if you are ok with the changes! (cc

@Marissa



@Meg

)

Files changed (1) hide show
  1. README.md +67 -24
README.md CHANGED
@@ -7,15 +7,58 @@ datasets:
7
 
8
  # CamemBERT: a Tasty French Language Model
9
 
10
- ## Introduction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
- [CamemBERT](https://arxiv.org/abs/1911.03894) is a state-of-the-art language model for French based on the RoBERTa model.
13
 
14
- It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- For further information or requests, please go to [Camembert Website](https://camembert-model.fr/)
17
 
18
- ## Pre-trained models
 
 
 
 
19
 
20
  | Model | #params | Arch. | Training data |
21
  |--------------------------------|--------------------------------|-------|-----------------------------------|
@@ -26,7 +69,25 @@ For further information or requests, please go to [Camembert Website](https://ca
26
  | `camembert/camembert-base-oscar-4gb` | 110M | Base | Subsample of OSCAR (4 GB of text) |
27
  | `camembert/camembert-base-ccnet-4gb` | 110M | Base | Subsample of CCNet (4 GB of text) |
28
 
29
- ## How to use CamemBERT with HuggingFace
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ##### Load CamemBERT and its sub-word tokenizer :
32
  ```python
@@ -95,21 +156,3 @@ all_layer_embeddings[5]
95
  # ...,
96
  ```
97
 
98
-
99
- ## Authors
100
-
101
- CamemBERT was trained and evaluated by Louis Martin\*, Benjamin Muller\*, Pedro Javier Ortiz Suárez\*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
102
-
103
-
104
- ## Citation
105
- If you use our work, please cite:
106
-
107
- ```bibtex
108
- @inproceedings{martin2020camembert,
109
- title={CamemBERT: a Tasty French Language Model},
110
- author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
111
- booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
112
- year={2020}
113
- }
114
- ```
115
-
7
 
8
  # CamemBERT: a Tasty French Language Model
9
 
10
+ ## Table of Contents
11
+ - [Model Details](#model-details)
12
+ - [Uses](#uses)
13
+ - [Risks, Limitations and Biases](#risks-limitations-and-biases)
14
+ - [Training](#training)
15
+ - [Evaluation](#evaluation)
16
+ - [Citation Information](#citation-information)
17
+ - [How to Get Started With the Model](#how-to-get-started-with-the-model)
18
+
19
+
20
+ ## Model Details
21
+ - **Model Description:**
22
+ CamemBERT is a state-of-the-art language model for French based on the RoBERTa model.
23
+ It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
24
+ - **Developed by:** Louis Martin\*, Benjamin Muller\*, Pedro Javier Ortiz Suárez\*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
25
+ - **Model Type:** Fill-Mask
26
+ - **Language(s):** French
27
+ - **License:** MIT
28
+ - **Parent Model:** See the [RoBERTa base model](https://huggingface.co/roberta-base) for more information about the RoBERTa base model.
29
+ - **Resources for more information:**
30
+ - [Research Paper](https://arxiv.org/abs/1911.03894)
31
+ - [Camembert Website](https://camembert-model.fr/)
32
+
33
+
34
+ ## Uses
35
 
36
+ #### Direct Use
37
 
38
+ This model can be used for Fill-Mask tasks.
39
+
40
+
41
+ ## Risks, Limitations and Biases
42
+ **CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.**
43
+
44
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
45
+
46
+ This model was pretrinaed on a subcorpus of OSCAR multilingual corpus. Some of the limitations and risks associated with the OSCAR dataset, which are further detailed in the [OSCAR dataset card](https://huggingface.co/datasets/oscar), include the following:
47
+
48
+ > The quality of some OSCAR sub-corpora might be lower than expected, specifically for the lowest-resource languages.
49
+
50
+ > Constructed from Common Crawl, Personal and sensitive information might be present.
51
+
52
+
53
+
54
+ ## Training
55
 
 
56
 
57
+ #### Training Data
58
+ OSCAR or Open Super-large Crawled Aggregated coRpus is a multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
59
+
60
+
61
+ #### Training Procedure
62
 
63
  | Model | #params | Arch. | Training data |
64
  |--------------------------------|--------------------------------|-------|-----------------------------------|
69
  | `camembert/camembert-base-oscar-4gb` | 110M | Base | Subsample of OSCAR (4 GB of text) |
70
  | `camembert/camembert-base-ccnet-4gb` | 110M | Base | Subsample of CCNet (4 GB of text) |
71
 
72
+ ## Evaluation
73
+
74
+
75
+ The model developers evaluated CamemBERT using four different downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI).
76
+
77
+
78
+
79
+ ## Citation Information
80
+
81
+ ```bibtex
82
+ @inproceedings{martin2020camembert,
83
+ title={CamemBERT: a Tasty French Language Model},
84
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
85
+ booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
86
+ year={2020}
87
+ }
88
+ ```
89
+
90
+ ## How to Get Started With the Model
91
 
92
  ##### Load CamemBERT and its sub-word tokenizer :
93
  ```python
156
  # ...,
157
  ```
158