MisterAI commited on
Commit
39a7c1d
1 Parent(s): 4ec7e0f

Upload 6 files

Browse files
Files changed (6) hide show
  1. README.md +113 -0
  2. config.json +22 -0
  3. gitattributes +9 -0
  4. sentencepiece.bpe.model +0 -0
  5. tokenizer.json +0 -0
  6. tokenizer_config.json +1 -0
README.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: fr
3
+ license: mit
4
+ datasets:
5
+ - oscar
6
+ ---
7
+
8
+ # CamemBERT: a Tasty French Language Model
9
+
10
+ ## Introduction
11
+
12
+ [CamemBERT](https://arxiv.org/abs/1911.03894) is a state-of-the-art language model for French based on the RoBERTa model.
13
+
14
+ It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
15
+
16
+ For further information or requests, please go to [Camembert Website](https://camembert-model.fr/)
17
+
18
+ ## Pre-trained models
19
+
20
+ | Model | #params | Arch. | Training data |
21
+ |--------------------------------|--------------------------------|-------|-----------------------------------|
22
+ | `camembert-base` | 110M | Base | OSCAR (138 GB of text) |
23
+ | `camembert/camembert-large` | 335M | Large | CCNet (135 GB of text) |
24
+ | `camembert/camembert-base-ccnet` | 110M | Base | CCNet (135 GB of text) |
25
+ | `camembert/camembert-base-wikipedia-4gb` | 110M | Base | Wikipedia (4 GB of text) |
26
+ | `camembert/camembert-base-oscar-4gb` | 110M | Base | Subsample of OSCAR (4 GB of text) |
27
+ | `camembert/camembert-base-ccnet-4gb` | 110M | Base | Subsample of CCNet (4 GB of text) |
28
+
29
+ ## How to use CamemBERT with HuggingFace
30
+
31
+ ##### Load CamemBERT and its sub-word tokenizer :
32
+ ```python
33
+ from transformers import CamembertModel, CamembertTokenizer
34
+
35
+ # You can replace "camembert-base" with any other model from the table, e.g. "camembert/camembert-large".
36
+ tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base-wikipedia-4gb")
37
+ camembert = CamembertModel.from_pretrained("camembert/camembert-base-wikipedia-4gb")
38
+
39
+ camembert.eval() # disable dropout (or leave in train mode to finetune)
40
+
41
+ ```
42
+
43
+ ##### Filling masks using pipeline
44
+ ```python
45
+ from transformers import pipeline
46
+
47
+ camembert_fill_mask = pipeline("fill-mask", model="camembert/camembert-base-wikipedia-4gb", tokenizer="camembert/camembert-base-wikipedia-4gb")
48
+ results = camembert_fill_mask("Le camembert est un fromage de <mask>!")
49
+ # results
50
+ #[{'sequence': '<s> Le camembert est un fromage de chèvre!</s>', 'score': 0.4937814474105835, 'token': 19370},
51
+ #{'sequence': '<s> Le camembert est un fromage de brebis!</s>', 'score': 0.06255942583084106, 'token': 30616},
52
+ #{'sequence': '<s> Le camembert est un fromage de montagne!</s>', 'score': 0.04340197145938873, 'token': 2364},
53
+ # {'sequence': '<s> Le camembert est un fromage de Noël!</s>', 'score': 0.02823255956172943, 'token': 3236},
54
+ #{'sequence': '<s> Le camembert est un fromage de vache!</s>', 'score': 0.021357402205467224, 'token': 12329}]
55
+ ```
56
+
57
+ ##### Extract contextual embedding features from Camembert output
58
+ ```python
59
+ import torch
60
+ # Tokenize in sub-words with SentencePiece
61
+ tokenized_sentence = tokenizer.tokenize("J'aime le camembert !")
62
+ # ['▁J', "'", 'aime', '▁le', '▁ca', 'member', 't', '▁!']
63
+
64
+ # 1-hot encode and add special starting and end tokens
65
+ encoded_sentence = tokenizer.encode(tokenized_sentence)
66
+ # [5, 221, 10, 10600, 14, 8952, 10540, 75, 1114, 6]
67
+ # NB: Can be done in one step : tokenize.encode("J'aime le camembert !")
68
+
69
+ # Feed tokens to Camembert as a torch tensor (batch dim 1)
70
+ encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
71
+ embeddings, _ = camembert(encoded_sentence)
72
+ # embeddings.detach()
73
+ # embeddings.size torch.Size([1, 10, 768])
74
+ #tensor([[[-0.0928, 0.0506, -0.0094, ..., -0.2388, 0.1177, -0.1302],
75
+ # [ 0.0662, 0.1030, -0.2355, ..., -0.4224, -0.0574, -0.2802],
76
+ # [-0.0729, 0.0547, 0.0192, ..., -0.1743, 0.0998, -0.2677],
77
+ # ...,
78
+ ```
79
+
80
+ ##### Extract contextual embedding features from all Camembert layers
81
+ ```python
82
+ from transformers import CamembertConfig
83
+ # (Need to reload the model with new config)
84
+ config = CamembertConfig.from_pretrained("camembert/camembert-base-wikipedia-4gb", output_hidden_states=True)
85
+ camembert = CamembertModel.from_pretrained("camembert/camembert-base-wikipedia-4gb", config=config)
86
+
87
+ embeddings, _, all_layer_embeddings = camembert(encoded_sentence)
88
+ # all_layer_embeddings list of len(all_layer_embeddings) == 13 (input embedding layer + 12 self attention layers)
89
+ all_layer_embeddings[5]
90
+ # layer 5 contextual embedding : size torch.Size([1, 10, 768])
91
+ #tensor([[[-0.0059, -0.0227, 0.0065, ..., -0.0770, 0.0369, 0.0095],
92
+ # [ 0.2838, -0.1531, -0.3642, ..., -0.0027, -0.8502, -0.7914],
93
+ # [-0.0073, -0.0338, -0.0011, ..., 0.0533, -0.0250, -0.0061],
94
+ # ...,
95
+ ```
96
+
97
+
98
+ ## Authors
99
+
100
+ CamemBERT was trained and evaluated by Louis Martin\*, Benjamin Muller\*, Pedro Javier Ortiz Suárez\*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
101
+
102
+
103
+ ## Citation
104
+ If you use our work, please cite:
105
+
106
+ ```bibtex
107
+ @inproceedings{martin2020camembert,
108
+ title={CamemBERT: a Tasty French Language Model},
109
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
110
+ booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
111
+ year={2020}
112
+ }
113
+ ```
config.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "CamembertForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 5,
7
+ "eos_token_id": 6,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-05,
14
+ "max_position_embeddings": 514,
15
+ "model_type": "camembert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "output_past": true,
19
+ "pad_token_id": 1,
20
+ "type_vocab_size": 1,
21
+ "vocab_size": 32005
22
+ }
gitattributes ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
2
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.h5 filter=lfs diff=lfs merge=lfs -text
5
+ *.tflite filter=lfs diff=lfs merge=lfs -text
6
+ *.tar.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.ot filter=lfs diff=lfs merge=lfs -text
8
+ *.onnx filter=lfs diff=lfs merge=lfs -text
9
+ model.safetensors filter=lfs diff=lfs merge=lfs -text
sentencepiece.bpe.model ADDED
Binary file (811 kB). View file
 
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"model_max_length": 512}