cservan commited on
Commit
200afd1
1 Parent(s): 32ce8b0

Initial commit

Browse files
README.md ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: fr
3
+ license: apache-2.0
4
+ datasets:
5
+ - wikipedia
6
+ ---
7
+
8
+ # FrALBERT Base Cased
9
+
10
+ Pretrained model on French language using a masked language modeling (MLM) objective. It was introduced in
11
+ [this paper](https://arxiv.org/abs/1909.11942) and first released in
12
+ [this repository](https://github.com/google-research/albert).
13
+ This model, unlike other ALBERT models, is cased: it does make a difference between french and French.
14
+
15
+ ## Model description
16
+
17
+ FrALBERT is a transformers model pretrained on 16Go of French Wikipedia in a self-supervised fashion. This means it
18
+ was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
19
+ publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
20
+ was pretrained with two objectives:
21
+
22
+ - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
23
+ the entire masked sentence through the model and has to predict the masked words. This is different from traditional
24
+ recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
25
+ GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the
26
+ sentence.
27
+ - Sentence Ordering Prediction (SOP): FrALBERT uses a pretraining loss based on predicting the ordering of two consecutive segments of text.
28
+
29
+ This way, the model learns an inner representation of the English language that can then be used to extract features
30
+ useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
31
+ classifier using the features produced by the FrALBERT model as inputs.
32
+
33
+ FrALBERT is particular in that it shares its layers across its Transformer. Therefore, all layers have the same weights. Using repeating layers results in a small memory footprint, however, the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.
34
+
35
+ This is the second version of the base model.
36
+
37
+ This model has the following configuration:
38
+
39
+ - 12 repeating layers
40
+ - 128 embedding dimension
41
+ - 768 hidden dimension
42
+ - 12 attention heads
43
+ - 11M parameters
44
+
45
+ ## Intended uses & limitations
46
+
47
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
48
+ be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=fralbert-base-cased) to look for
49
+ fine-tuned versions on a task that interests you.
50
+
51
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
52
+ to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
53
+ generation you should look at model like GPT2.
54
+
55
+ ### How to use
56
+
57
+ You can use this model directly with a pipeline for masked language modeling:
58
+
59
+ ```python
60
+ >>> from transformers import pipeline
61
+ >>> unmasker = pipeline('fill-mask', model='cservan/fralbert-base-cased')
62
+ >>> unmasker("Paris est la capitale de la [MASK] .")
63
+ [
64
+ {
65
+ "sequence": "paris est la capitale de la france.",
66
+ "score": 0.6231236457824707,
67
+ "token": 3043,
68
+ "token_str": "france"
69
+ },
70
+ {
71
+ "sequence": "paris est la capitale de la region.",
72
+ "score": 0.2993471622467041,
73
+ "token": 10531,
74
+ "token_str": "region"
75
+ },
76
+ {
77
+ "sequence": "paris est la capitale de la societe.",
78
+ "score": 0.02028230018913746,
79
+ "token": 24622,
80
+ "token_str": "societe"
81
+ },
82
+ {
83
+ "sequence": "paris est la capitale de la bretagne.",
84
+ "score": 0.012089950032532215,
85
+ "token": 24987,
86
+ "token_str": "bretagne"
87
+ },
88
+ {
89
+ "sequence": "paris est la capitale de la chine.",
90
+ "score": 0.010002839379012585,
91
+ "token": 14860,
92
+ "token_str": "chine"
93
+ }
94
+ ]
95
+ ```
96
+
97
+ Here is how to use this model to get the features of a given text in PyTorch:
98
+
99
+ ```python
100
+ from transformers import AlbertTokenizer, AlbertModel
101
+ tokenizer = AlbertTokenizer.from_pretrained('cservan/fralbert-base-cased')
102
+ model = AlbertModel.from_pretrained("cservan/fralbert-base-cased")
103
+ text = "Remplacez-moi par le texte en français que vous souhaitez."
104
+ encoded_input = tokenizer(text, return_tensors='pt')
105
+ output = model(**encoded_input)
106
+ ```
107
+
108
+ and in TensorFlow:
109
+
110
+ ```python
111
+ from transformers import AlbertTokenizer, TFAlbertModel
112
+ tokenizer = AlbertTokenizer.from_pretrained('cservan/fralbert-base-cased')
113
+ model = TFAlbertModel.from_pretrained("cservan/fralbert-base-cased")
114
+ text = "Remplacez-moi par le texte en français que vous souhaitez."
115
+ encoded_input = tokenizer(text, return_tensors='tf')
116
+ output = model(encoded_input)
117
+ ```
118
+
119
+
120
+ ## Training data
121
+
122
+ The FrALBERT model was pretrained on 4go of [French Wikipedia](https://fr.wikipedia.org/wiki/French_Wikipedia) (excluding lists, tables and
123
+ headers).
124
+
125
+ ## Training procedure
126
+
127
+ ### Preprocessing
128
+
129
+ The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 32,000. The inputs of the model are
130
+ then of the form:
131
+
132
+ ```
133
+ [CLS] Sentence A [SEP] Sentence B [SEP]
134
+ ```
135
+
136
+ ### Training
137
+
138
+ The FrALBERT procedure follows the BERT setup.
139
+
140
+ The details of the masking procedure for each sentence are the following:
141
+ - 15% of the tokens are masked.
142
+ - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
143
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
144
+ - In the 10% remaining cases, the masked tokens are left as is.
145
+
146
+ ## Evaluation results
147
+
148
+ When fine-tuned on downstream tasks, the ALBERT models achieve the following results:
149
+
150
+ Slot-filling:
151
+
152
+ | | FrALBERT-base | FrALBERT-base-cased
153
+ |----------------|---------------|--------------------
154
+ | MEDIA | 81.76 (0.59) | 85.09 (0.14)
155
+ |
156
+
157
+
158
+ ### BibTeX entry and citation info
159
+
160
+ ```bibtex
161
+ @inproceedings{cattan2021fralbert,
162
+ author = {Oralie Cattan and
163
+ Christophe Servan and
164
+ Sophie Rosset},
165
+ booktitle = {Recent Advances in Natural Language Processing, RANLP 2021},
166
+ title = {{On the Usability of Transformers-based models for a French Question-Answering task}},
167
+ year = {2021},
168
+ address = {Online},
169
+ month = sep,
170
+ }
171
+ ```
172
+
173
+ Link to the paper: [PDF](https://hal.archives-ouvertes.fr/hal-03336060)
config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "AlbertForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0,
6
+ "bos_token_id": 2,
7
+ "classifier_dropout_prob": 0.1,
8
+ "down_scale_factor": 1,
9
+ "embedding_size": 128,
10
+ "eos_token_id": 3,
11
+ "eos_token_ids": null,
12
+ "gap_size": 0,
13
+ "hidden_act": "gelu_new",
14
+ "hidden_dropout_prob": 0,
15
+ "hidden_size": 768,
16
+ "initializer_range": 0.02,
17
+ "inner_group_num": 1,
18
+ "intermediate_size": 3072,
19
+ "layer_norm_eps": 1e-12,
20
+ "max_position_embeddings": 512,
21
+ "model_type": "albert",
22
+ "net_structure_type": 0,
23
+ "num_attention_heads": 12,
24
+ "num_hidden_groups": 1,
25
+ "num_hidden_layers": 12,
26
+ "num_memory_blocks": 0,
27
+ "output_past": true,
28
+ "pad_token_id": 0,
29
+ "position_embedding_type": "absolute",
30
+ "transformers_version": "4.24.0",
31
+ "type_vocab_size": 2,
32
+ "vocab_size": 32000
33
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:86789a8f6e22063bce5c8c944db629b7c1c5544761d4a6ecf0bc9b8540a653bc
3
+ size 64676287
special_tokens_map.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {"bos_token": "[CLS]", "eos_token": "[SEP]", "unk_token": "<unk>", "sep_token": "[SEP]", "pad_token": "<pad>", "cls_token": "[CLS]", "mask_token": "[MASK]"}
2
+
3
+
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f84b84d236af93fa229f8c3e400c02b31c8532e61b1b927f609a8b470d4afb71
3
+ size 737469
spiece.vocab ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ {"keep_accents": true, "do_lower_case": false}
2
+