qanastek commited on
Commit
4027d82
1 Parent(s): 073fb1c

Update README.md

Browse files
README.md ADDED
@@ -0,0 +1,254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - flair
4
+ - token-classification
5
+ - sequence-tagger-model
6
+ language: fr
7
+ widget:
8
+ - text: "George Washington est allé à Washington"
9
+ ---
10
+
11
+ # POET: A French Extended Part-of-Speech Tagger
12
+
13
+ - Corpora: [ANTILLES](https://github.com/qanastek/ANTILLES)
14
+ - Embeddings: [CamemBERT](https://arxiv.org/abs/1911.03894)
15
+ - Sequence Labelling: [Transformers](https://arxiv.org/abs/1706.03762)
16
+ - Number of Epochs: 115
17
+
18
+ **People Involved**
19
+
20
+ * [LABRAK Yanis](https://www.linkedin.com/in/yanis-labrak-8a7412145/) (1)
21
+ * [DUFOUR Richard](https://cv.archives-ouvertes.fr/richard-dufour) (2)
22
+
23
+ **Affiliations**
24
+
25
+ 1. [LIA, NLP team](https://lia.univ-avignon.fr/), Avignon University, Avignon, France.
26
+ 2. [LS2N, TALN team](https://www.ls2n.fr/equipe/taln/), Nantes University, Nantes, France.
27
+
28
+ ## Demo: How to use in HuggingFace Transformers
29
+
30
+ Requires [transformers](https://pypi.org/project/transformers/): ```pip install transformers```
31
+
32
+ ```python
33
+ from transformers import CamembertTokenizer, CamembertForTokenClassification, TokenClassificationPipeline
34
+
35
+ tokenizer = CamembertTokenizer.from_pretrained('./')
36
+ model = CamembertForTokenClassification.from_pretrained('./')
37
+ pos = TokenClassificationPipeline(model=model, tokenizer=tokenizer)
38
+
39
+ def make_prediction(sentence):
40
+ labels = [l['entity'] for l in pos(sentence)]
41
+ return list(zip(sentence.split(" "), labels))
42
+
43
+ res = make_prediction("George Washington est allé à Washington")
44
+ ```
45
+
46
+ Output:
47
+
48
+ ![Preview Output](preview.PNG)
49
+
50
+ ## Training data
51
+
52
+ `ANTILLES` is a part-of-speech tagging corpora based on [UD_French-GSD](https://universaldependencies.org/treebanks/fr_gsd/index.html) which was originally created in 2015 and is based on the [universal dependency treebank v2.0](https://github.com/ryanmcd/uni-dep-tb).
53
+
54
+ Originally, the corpora consists of 400,399 words (16,341 sentences) and had 17 different classes. Now, after applying our tags augmentation we obtain 60 different classes which add linguistic and semantic information such as the gender, number, mood, person, tense or verb form given in the different CoNLL-03 fields from the original corpora.
55
+
56
+ We based our tags on the level of details given by the [LIA_TAGG](http://pageperso.lif.univ-mrs.fr/frederic.bechet/download.html) statistical POS tagger written by [Frédéric Béchet](http://pageperso.lif.univ-mrs.fr/frederic.bechet/index-english.html) in 2001.
57
+
58
+ The corpora used for this model is available on [Github](https://github.com/qanastek/ANTILLES) at the [CoNLL-U format](https://universaldependencies.org/format.html).
59
+
60
+ Training data are fed to the model as free language and doesn't pass a normalization phase. Thus, it's made the model case and punctuation sensitive.
61
+
62
+ ## Original Tags
63
+
64
+ ```plain
65
+ PRON VERB SCONJ ADP CCONJ DET NOUN ADJ AUX ADV PUNCT PROPN NUM SYM PART X INTJ
66
+ ```
67
+
68
+ ## New additional POS tags
69
+
70
+ | Abbreviation | Description | Examples |
71
+ |:--------:|:--------:|:--------:|
72
+ | PREP | Preposition | de |
73
+ | AUX | Auxiliary Verb | est |
74
+ | ADV | Adverb | toujours |
75
+ | COSUB | Subordinating conjunction | que |
76
+ | COCO | Coordinating Conjunction | et |
77
+ | PART | Demonstrative particle | -t |
78
+ | PRON | Pronoun | qui ce quoi |
79
+ | PDEMMS | Demonstrative Pronoun - Singular Masculine | ce |
80
+ | PDEMMP | Demonstrative Pronoun - Plural Masculine | ceux |
81
+ | PDEMFS | Demonstrative Pronoun - Singular Feminine | cette |
82
+ | PDEMFP | Demonstrative Pronoun - Plural Feminine | celles |
83
+ | PINDMS | Indefinite Pronoun - Singular Masculine | tout |
84
+ | PINDMP | Indefinite Pronoun - Plural Masculine | autres |
85
+ | PINDFS | Indefinite Pronoun - Singular Feminine | chacune |
86
+ | PINDFP | Indefinite Pronoun - Plural Feminine | certaines |
87
+ | PROPN | Proper noun | Houston |
88
+ | XFAMIL | Last name | Levy |
89
+ | NUM | Numerical Adjective | trentaine vingtaine |
90
+ | DINTMS | Masculine Numerical Adjective | un |
91
+ | DINTFS | Feminine Numerical Adjective | une |
92
+ | PPOBJMS | Pronoun complements of objects - Singular Masculine | le lui |
93
+ | PPOBJMP | Pronoun complements of objects - Plural Masculine | eux y |
94
+ | PPOBJFS | Pronoun complements of objects - Singular Feminine | moi la |
95
+ | PPOBJFP | Pronoun complements of objects - Plural Feminine | en y |
96
+ | PPER1S | Personal Pronoun First-Person - Singular | je |
97
+ | PPER2S | Personal Pronoun Second-Person - Singular | tu |
98
+ | PPER3MS | Personal Pronoun Third-Person - Singular Masculine | il |
99
+ | PPER3MP | Personal Pronoun Third-Person - Plural Masculine | ils |
100
+ | PPER3FS | Personal Pronoun Third-Person - Singular Feminine | elle |
101
+ | PPER3FP | Personal Pronoun Third-Person - Plural Feminine | elles |
102
+ | PREFS | Reflexive Pronoun First-Person - Singular | me m' |
103
+ | PREF | Reflexive Pronoun Third-Person - Singular | se s' |
104
+ | PREFP | Reflexive Pronoun First / Second-Person - Plural | nous vous |
105
+ | VERB | Verb | obtient |
106
+ | VPPMS | Past Participle - Singular Masculine | formulé |
107
+ | VPPMP | Past Participle - Plural Masculine | classés |
108
+ | VPPFS | Past Participle - Singular Feminine | appelée |
109
+ | VPPFP | Past Participle - Plural Feminine | sanctionnées |
110
+ | DET | Determinant | les l' |
111
+ | DETMS | Determinant - Singular Masculine | les |
112
+ | DETFS | Determinant - Singular Feminine | la |
113
+ | ADJ | Adjective | capable sérieux |
114
+ | ADJMS | Adjective - Singular Masculine | grand important |
115
+ | ADJMP | Adjective - Plural Masculine | grands petits |
116
+ | ADJFS | Adjective - Singular Feminine | française petite |
117
+ | ADJFP | Adjective - Plural Feminine | légères petites |
118
+ | NOUN | Noun | temps |
119
+ | NMS | Noun - Singular Masculine | drapeau |
120
+ | NMP | Noun - Plural Masculine | journalistes |
121
+ | NFS | Noun - Singular Feminine | tête |
122
+ | NFP | Noun - Plural Feminine | ondes |
123
+ | PREL | Relative Pronoun | qui dont |
124
+ | PRELMS | Relative Pronoun - Singular Masculine | lequel |
125
+ | PRELMP | Relative Pronoun - Plural Masculine | lesquels |
126
+ | PRELFS | Relative Pronoun - Singular Feminine | laquelle |
127
+ | PRELFP | Relative Pronoun - Plural Feminine | lesquelles |
128
+ | INTJ | Interjection | merci bref |
129
+ | CHIF | Numbers | 1979 10 |
130
+ | SYM | Symbol | € % |
131
+ | YPFOR | Endpoint | . |
132
+ | PUNCT | Ponctuation | : , |
133
+ | MOTINC | Unknown words | Technology Lady |
134
+ | X | Typos & others | sfeir 3D statu |
135
+
136
+ ## Evaluation results
137
+
138
+ The test corpora used for this evaluation is available on [Github](https://github.com/qanastek/ANTILLES/blob/main/ANTILLES/test.conllu).
139
+
140
+ ```plain
141
+ precision recall f1-score support
142
+
143
+ ADJ 0.9040 0.8828 0.8933 128
144
+ ADJFP 0.9811 0.9585 0.9697 434
145
+ ADJFS 0.9606 0.9826 0.9715 918
146
+ ADJMP 0.9613 0.9357 0.9483 451
147
+ ADJMS 0.9561 0.9611 0.9586 952
148
+ ADV 0.9870 0.9948 0.9908 1524
149
+ AUX 0.9956 0.9964 0.9960 1124
150
+ CHIF 0.9798 0.9774 0.9786 1239
151
+ COCO 1.0000 0.9989 0.9994 884
152
+ COSUB 0.9939 0.9939 0.9939 328
153
+ DET 0.9972 0.9972 0.9972 2897
154
+ DETFS 0.9990 1.0000 0.9995 1007
155
+ DETMS 1.0000 0.9993 0.9996 1426
156
+ DINTFS 0.9967 0.9902 0.9934 306
157
+ DINTMS 0.9923 0.9948 0.9935 387
158
+ INTJ 0.8000 0.8000 0.8000 5
159
+ MOTINC 0.5049 0.5827 0.5410 266
160
+ NFP 0.9807 0.9675 0.9740 892
161
+ NFS 0.9778 0.9699 0.9738 2588
162
+ NMP 0.9687 0.9495 0.9590 1367
163
+ NMS 0.9759 0.9560 0.9659 3181
164
+ NOUN 0.6164 0.8673 0.7206 113
165
+ NUM 0.6250 0.8333 0.7143 6
166
+ PART 1.0000 0.9375 0.9677 16
167
+ PDEMFP 1.0000 1.0000 1.0000 3
168
+ PDEMFS 1.0000 1.0000 1.0000 89
169
+ PDEMMP 1.0000 1.0000 1.0000 20
170
+ PDEMMS 1.0000 1.0000 1.0000 222
171
+ PINDFP 1.0000 1.0000 1.0000 3
172
+ PINDFS 0.8571 1.0000 0.9231 12
173
+ PINDMP 0.9000 1.0000 0.9474 9
174
+ PINDMS 0.9286 0.9701 0.9489 67
175
+ PINTFS 0.0000 0.0000 0.0000 2
176
+ PPER1S 1.0000 1.0000 1.0000 62
177
+ PPER2S 0.7500 1.0000 0.8571 3
178
+ PPER3FP 1.0000 1.0000 1.0000 9
179
+ PPER3FS 1.0000 1.0000 1.0000 96
180
+ PPER3MP 1.0000 1.0000 1.0000 31
181
+ PPER3MS 1.0000 1.0000 1.0000 377
182
+ PPOBJFP 1.0000 0.7500 0.8571 4
183
+ PPOBJFS 0.9167 0.8919 0.9041 37
184
+ PPOBJMP 0.7500 0.7500 0.7500 12
185
+ PPOBJMS 0.9371 0.9640 0.9504 139
186
+ PREF 1.0000 1.0000 1.0000 332
187
+ PREFP 1.0000 1.0000 1.0000 64
188
+ PREFS 1.0000 1.0000 1.0000 13
189
+ PREL 0.9964 0.9964 0.9964 277
190
+ PRELFP 1.0000 1.0000 1.0000 5
191
+ PRELFS 0.8000 1.0000 0.8889 4
192
+ PRELMP 1.0000 1.0000 1.0000 3
193
+ PRELMS 1.0000 1.0000 1.0000 11
194
+ PREP 0.9971 0.9977 0.9974 6161
195
+ PRON 0.9836 0.9836 0.9836 61
196
+ PROPN 0.9468 0.9503 0.9486 4310
197
+ PUNCT 1.0000 1.0000 1.0000 4019
198
+ SYM 0.9394 0.8158 0.8732 76
199
+ VERB 0.9956 0.9921 0.9938 2273
200
+ VPPFP 0.9145 0.9469 0.9304 113
201
+ VPPFS 0.9562 0.9597 0.9580 273
202
+ VPPMP 0.8827 0.9728 0.9256 147
203
+ VPPMS 0.9778 0.9794 0.9786 630
204
+ VPPRE 0.0000 0.0000 0.0000 1
205
+ X 0.9604 0.9935 0.9766 1073
206
+ XFAMIL 0.9386 0.9113 0.9248 1342
207
+ YPFOR 1.0000 1.0000 1.0000 2750
208
+
209
+ accuracy 0.9778 47574
210
+ macro avg 0.9151 0.9285 0.9202 47574
211
+ weighted avg 0.9785 0.9778 0.9780 47574
212
+ ```
213
+
214
+ ## BibTeX Citations
215
+
216
+ Please cite the following paper when using this model.
217
+
218
+ UD_French-GSD corpora:
219
+
220
+ ```latex
221
+ @misc{
222
+ universaldependencies,
223
+ title={UniversalDependencies/UD_French-GSD},
224
+ url={https://github.com/UniversalDependencies/UD_French-GSD}, journal={GitHub},
225
+ author={UniversalDependencies}
226
+ }
227
+ ```
228
+
229
+ LIA TAGG:
230
+
231
+ ```latex
232
+ @techreport{LIA_TAGG,
233
+ author = {Frédéric Béchet},
234
+ title = {LIA_TAGG: a statistical POS tagger + syntactic bracketer},
235
+ institution = {Aix-Marseille University & CNRS},
236
+ year = {2001}
237
+ }
238
+ ```
239
+
240
+ Flair Embeddings:
241
+
242
+ ```latex
243
+ @inproceedings{akbik2018coling,
244
+ title={Contextual String Embeddings for Sequence Labeling},
245
+ author={Akbik, Alan and Blythe, Duncan and Vollgraf, Roland},
246
+ booktitle = {{COLING} 2018, 27th International Conference on Computational Linguistics},
247
+ pages = {1638--1649},
248
+ year = {2018}
249
+ }
250
+ ```
251
+
252
+ ## Acknowledgment
253
+
254
+ This work was financially supported by [Zenidoc](https://zenidoc.fr/)
config.json ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "camembert-base",
3
+ "architectures": [
4
+ "CamembertForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 5,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 6,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "id2label": {
14
+ "0": "PART",
15
+ "1": "PDEMMP",
16
+ "10": "NOUN",
17
+ "11": "PPER3MS",
18
+ "12": "AUX",
19
+ "13": "COSUB",
20
+ "14": "ADJ",
21
+ "15": "VPPRE",
22
+ "16": "COCO",
23
+ "17": "ADJMP",
24
+ "18": "X",
25
+ "19": "NMS",
26
+ "2": "PREFS",
27
+ "20": "PINDMS",
28
+ "21": "DETFS",
29
+ "22": "PPER2S",
30
+ "23": "PREFP",
31
+ "24": "PPER3MP",
32
+ "25": "PRELMP",
33
+ "26": "PINDFS",
34
+ "27": "PRON",
35
+ "28": "PREP",
36
+ "29": "PPOBJMP",
37
+ "3": "PINDMP",
38
+ "30": "ADJFS",
39
+ "31": "DET",
40
+ "32": "ADJFP",
41
+ "33": "PDEMFP",
42
+ "34": "PREL",
43
+ "35": "PPER3FS",
44
+ "36": "VPPFS",
45
+ "37": "PPER3FP",
46
+ "38": "CHIF",
47
+ "39": "NMP",
48
+ "4": "DINTMS",
49
+ "40": "SYM",
50
+ "41": "NFS",
51
+ "42": "VERB",
52
+ "43": "PREF",
53
+ "44": "VPPFP",
54
+ "45": "PDEMMS",
55
+ "46": "XFAMIL",
56
+ "47": "PINDFP",
57
+ "48": "VPPMP",
58
+ "49": "YPFOR",
59
+ "5": "NUM",
60
+ "50": "ADV",
61
+ "51": "PRELFS",
62
+ "52": "DINTFS",
63
+ "53": "DETMS",
64
+ "54": "PPOBJFP",
65
+ "55": "PPOBJMS",
66
+ "56": "VPPMS",
67
+ "57": "INTJ",
68
+ "58": "PROPN",
69
+ "59": "PDEMFS",
70
+ "6": "PINTFS",
71
+ "60": "PPER1S",
72
+ "61": "PRELFP",
73
+ "62": "MOTINC",
74
+ "63": "ADJMS",
75
+ "64": "PPOBJFS",
76
+ "7": "NFP",
77
+ "8": "PUNCT",
78
+ "9": "PRELMS"
79
+ },
80
+ "initializer_range": 0.02,
81
+ "intermediate_size": 3072,
82
+ "label2id": {
83
+ "ADJ": "14",
84
+ "ADJFP": "32",
85
+ "ADJFS": "30",
86
+ "ADJMP": "17",
87
+ "ADJMS": "63",
88
+ "ADV": "50",
89
+ "AUX": "12",
90
+ "CHIF": "38",
91
+ "COCO": "16",
92
+ "COSUB": "13",
93
+ "DET": "31",
94
+ "DETFS": "21",
95
+ "DETMS": "53",
96
+ "DINTFS": "52",
97
+ "DINTMS": "4",
98
+ "INTJ": "57",
99
+ "MOTINC": "62",
100
+ "NFP": "7",
101
+ "NFS": "41",
102
+ "NMP": "39",
103
+ "NMS": "19",
104
+ "NOUN": "10",
105
+ "NUM": "5",
106
+ "PART": "0",
107
+ "PDEMFP": "33",
108
+ "PDEMFS": "59",
109
+ "PDEMMP": "1",
110
+ "PDEMMS": "45",
111
+ "PINDFP": "47",
112
+ "PINDFS": "26",
113
+ "PINDMP": "3",
114
+ "PINDMS": "20",
115
+ "PINTFS": "6",
116
+ "PPER1S": "60",
117
+ "PPER2S": "22",
118
+ "PPER3FP": "37",
119
+ "PPER3FS": "35",
120
+ "PPER3MP": "24",
121
+ "PPER3MS": "11",
122
+ "PPOBJFP": "54",
123
+ "PPOBJFS": "64",
124
+ "PPOBJMP": "29",
125
+ "PPOBJMS": "55",
126
+ "PREF": "43",
127
+ "PREFP": "23",
128
+ "PREFS": "2",
129
+ "PREL": "34",
130
+ "PRELFP": "61",
131
+ "PRELFS": "51",
132
+ "PRELMP": "25",
133
+ "PRELMS": "9",
134
+ "PREP": "28",
135
+ "PRON": "27",
136
+ "PROPN": "58",
137
+ "PUNCT": "8",
138
+ "SYM": "40",
139
+ "VERB": "42",
140
+ "VPPFP": "44",
141
+ "VPPFS": "36",
142
+ "VPPMP": "48",
143
+ "VPPMS": "56",
144
+ "VPPRE": "15",
145
+ "X": "18",
146
+ "XFAMIL": "46",
147
+ "YPFOR": "49"
148
+ },
149
+ "layer_norm_eps": 1e-05,
150
+ "max_position_embeddings": 514,
151
+ "model_type": "camembert",
152
+ "num_attention_heads": 12,
153
+ "num_hidden_layers": 12,
154
+ "output_past": true,
155
+ "pad_token_id": 1,
156
+ "position_embedding_type": "absolute",
157
+ "torch_dtype": "float32",
158
+ "transformers_version": "4.12.5",
159
+ "type_vocab_size": 1,
160
+ "use_cache": true,
161
+ "vocab_size": 32005
162
+ }
logs/logs.txt ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ precision recall f1-score support
2
+
3
+ ADJ 0.9040 0.8828 0.8933 128
4
+ ADJFP 0.9811 0.9585 0.9697 434
5
+ ADJFS 0.9606 0.9826 0.9715 918
6
+ ADJMP 0.9613 0.9357 0.9483 451
7
+ ADJMS 0.9561 0.9611 0.9586 952
8
+ ADV 0.9870 0.9948 0.9908 1524
9
+ AUX 0.9956 0.9964 0.9960 1124
10
+ CHIF 0.9798 0.9774 0.9786 1239
11
+ COCO 1.0000 0.9989 0.9994 884
12
+ COSUB 0.9939 0.9939 0.9939 328
13
+ DET 0.9972 0.9972 0.9972 2897
14
+ DETFS 0.9990 1.0000 0.9995 1007
15
+ DETMS 1.0000 0.9993 0.9996 1426
16
+ DINTFS 0.9967 0.9902 0.9934 306
17
+ DINTMS 0.9923 0.9948 0.9935 387
18
+ INTJ 0.8000 0.8000 0.8000 5
19
+ MOTINC 0.5049 0.5827 0.5410 266
20
+ NFP 0.9807 0.9675 0.9740 892
21
+ NFS 0.9778 0.9699 0.9738 2588
22
+ NMP 0.9687 0.9495 0.9590 1367
23
+ NMS 0.9759 0.9560 0.9659 3181
24
+ NOUN 0.6164 0.8673 0.7206 113
25
+ NUM 0.6250 0.8333 0.7143 6
26
+ PART 1.0000 0.9375 0.9677 16
27
+ PDEMFP 1.0000 1.0000 1.0000 3
28
+ PDEMFS 1.0000 1.0000 1.0000 89
29
+ PDEMMP 1.0000 1.0000 1.0000 20
30
+ PDEMMS 1.0000 1.0000 1.0000 222
31
+ PINDFP 1.0000 1.0000 1.0000 3
32
+ PINDFS 0.8571 1.0000 0.9231 12
33
+ PINDMP 0.9000 1.0000 0.9474 9
34
+ PINDMS 0.9286 0.9701 0.9489 67
35
+ PINTFS 0.0000 0.0000 0.0000 2
36
+ PPER1S 1.0000 1.0000 1.0000 62
37
+ PPER2S 0.7500 1.0000 0.8571 3
38
+ PPER3FP 1.0000 1.0000 1.0000 9
39
+ PPER3FS 1.0000 1.0000 1.0000 96
40
+ PPER3MP 1.0000 1.0000 1.0000 31
41
+ PPER3MS 1.0000 1.0000 1.0000 377
42
+ PPOBJFP 1.0000 0.7500 0.8571 4
43
+ PPOBJFS 0.9167 0.8919 0.9041 37
44
+ PPOBJMP 0.7500 0.7500 0.7500 12
45
+ PPOBJMS 0.9371 0.9640 0.9504 139
46
+ PREF 1.0000 1.0000 1.0000 332
47
+ PREFP 1.0000 1.0000 1.0000 64
48
+ PREFS 1.0000 1.0000 1.0000 13
49
+ PREL 0.9964 0.9964 0.9964 277
50
+ PRELFP 1.0000 1.0000 1.0000 5
51
+ PRELFS 0.8000 1.0000 0.8889 4
52
+ PRELMP 1.0000 1.0000 1.0000 3
53
+ PRELMS 1.0000 1.0000 1.0000 11
54
+ PREP 0.9971 0.9977 0.9974 6161
55
+ PRON 0.9836 0.9836 0.9836 61
56
+ PROPN 0.9468 0.9503 0.9486 4310
57
+ PUNCT 1.0000 1.0000 1.0000 4019
58
+ SYM 0.9394 0.8158 0.8732 76
59
+ VERB 0.9956 0.9921 0.9938 2273
60
+ VPPFP 0.9145 0.9469 0.9304 113
61
+ VPPFS 0.9562 0.9597 0.9580 273
62
+ VPPMP 0.8827 0.9728 0.9256 147
63
+ VPPMS 0.9778 0.9794 0.9786 630
64
+ VPPRE 0.0000 0.0000 0.0000 1
65
+ X 0.9604 0.9935 0.9766 1073
66
+ XFAMIL 0.9386 0.9113 0.9248 1342
67
+ YPFOR 1.0000 1.0000 1.0000 2750
68
+
69
+ accuracy 0.9778 47574
70
+ macro avg 0.9151 0.9285 0.9202 47574
71
+ weighted avg 0.9785 0.9778 0.9780 47574
72
+
73
+ DatasetDict({
74
+ train: Dataset({
75
+ features: ['id', 'tokens', 'pos_tags'],
76
+ num_rows: 14453
77
+ })
78
+ validation: Dataset({
79
+ features: ['id', 'tokens', 'pos_tags'],
80
+ num_rows: 1477
81
+ })
82
+ test: Dataset({
83
+ features: ['id', 'tokens', 'pos_tags'],
84
+ num_rows: 417
85
+ })
86
+ })
optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6d4765236b34abe2f3ff11d7ad1724fd097175f43d78574642c1b1329128f8b2
3
+ size 880766181
predict.py ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import CamembertTokenizer, CamembertForTokenClassification, TokenClassificationPipeline
2
+
3
+ OUTPUT_PATH = './'
4
+
5
+ tokenizer = CamembertTokenizer.from_pretrained(OUTPUT_PATH)
6
+ model = CamembertForTokenClassification.from_pretrained(OUTPUT_PATH)
7
+
8
+ pos = TokenClassificationPipeline(model=model, tokenizer=tokenizer)
9
+
10
+ def make_prediction(sentence):
11
+ labels = [l['entity'] for l in pos(sentence)]
12
+ return list(zip(sentence.split(" "), labels))
13
+
14
+ res = make_prediction("George Washington est allé à Washington")
preview.PNG ADDED
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:995aba8183b98bcbeeeb548ea48ea9daea4f9143c62c0097738df40bdc21aa54
3
+ size 440410097
rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4bf1777801e2479389bdfb66a8aa6f361d4127c6a9b4249841100ffe937982fb
3
+ size 16543
scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9917c381e092c6b1349bc30c0b89ccbac8dda8de23bce2c62425520bd40ac616
3
+ size 623
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:988bc5a00281c6d210a5d34bd143d0363741a432fefe741bf71e61b1869d4314
3
+ size 810912
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}, "additional_special_tokens": ["<s>NOTUSED", "</s>NOTUSED"]}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "sep_token": "</s>", "cls_token": "<s>", "unk_token": "<unk>", "pad_token": "<pad>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "additional_special_tokens": ["<s>NOTUSED", "</s>NOTUSED"], "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "camembert-base", "tokenizer_class": "CamembertTokenizer"}
trainer_state.json ADDED
@@ -0,0 +1,388 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 19.90049751243781,
5
+ "global_step": 12000,
6
+ "is_hyper_param_search": false,
7
+ "is_local_process_zero": true,
8
+ "is_world_process_zero": true,
9
+ "log_history": [
10
+ {
11
+ "epoch": 0.83,
12
+ "learning_rate": 4.792703150912106e-05,
13
+ "loss": 1.636,
14
+ "step": 500
15
+ },
16
+ {
17
+ "epoch": 1.0,
18
+ "eval_accuracy": 0.9660949257998066,
19
+ "eval_f1": 0.9648470438617105,
20
+ "eval_loss": 0.40596073865890503,
21
+ "eval_precision": 0.9632890778105937,
22
+ "eval_recall": 0.9664100575985822,
23
+ "eval_runtime": 12.913,
24
+ "eval_samples_per_second": 114.381,
25
+ "eval_steps_per_second": 4.801,
26
+ "step": 603
27
+ },
28
+ {
29
+ "epoch": 1.66,
30
+ "learning_rate": 4.5854063018242126e-05,
31
+ "loss": 0.3601,
32
+ "step": 1000
33
+ },
34
+ {
35
+ "epoch": 2.0,
36
+ "eval_accuracy": 0.9747971581115735,
37
+ "eval_f1": 0.9757520460075205,
38
+ "eval_loss": 0.16257813572883606,
39
+ "eval_precision": 0.9742435954063604,
40
+ "eval_recall": 0.9772651750110767,
41
+ "eval_runtime": 13.0776,
42
+ "eval_samples_per_second": 112.941,
43
+ "eval_steps_per_second": 4.741,
44
+ "step": 1206
45
+ },
46
+ {
47
+ "epoch": 2.49,
48
+ "learning_rate": 4.3781094527363184e-05,
49
+ "loss": 0.1654,
50
+ "step": 1500
51
+ },
52
+ {
53
+ "epoch": 3.0,
54
+ "eval_accuracy": 0.9755118341951486,
55
+ "eval_f1": 0.9770373954517177,
56
+ "eval_loss": 0.11993579566478729,
57
+ "eval_precision": 0.9755404025066946,
58
+ "eval_recall": 0.9785389898094816,
59
+ "eval_runtime": 12.7573,
60
+ "eval_samples_per_second": 115.777,
61
+ "eval_steps_per_second": 4.86,
62
+ "step": 1809
63
+ },
64
+ {
65
+ "epoch": 3.32,
66
+ "learning_rate": 4.170812603648425e-05,
67
+ "loss": 0.1051,
68
+ "step": 2000
69
+ },
70
+ {
71
+ "epoch": 4.0,
72
+ "eval_accuracy": 0.9774877033673856,
73
+ "eval_f1": 0.9789900275245854,
74
+ "eval_loss": 0.10314257442951202,
75
+ "eval_precision": 0.9779755160693067,
76
+ "eval_recall": 0.9800066459902526,
77
+ "eval_runtime": 12.7609,
78
+ "eval_samples_per_second": 115.744,
79
+ "eval_steps_per_second": 4.859,
80
+ "step": 2412
81
+ },
82
+ {
83
+ "epoch": 4.15,
84
+ "learning_rate": 3.9635157545605314e-05,
85
+ "loss": 0.072,
86
+ "step": 2500
87
+ },
88
+ {
89
+ "epoch": 4.98,
90
+ "learning_rate": 3.756218905472637e-05,
91
+ "loss": 0.0557,
92
+ "step": 3000
93
+ },
94
+ {
95
+ "epoch": 5.0,
96
+ "eval_accuracy": 0.9762895699331567,
97
+ "eval_f1": 0.979079382198808,
98
+ "eval_loss": 0.10313071310520172,
99
+ "eval_precision": 0.9777679582424259,
100
+ "eval_recall": 0.9803943287549844,
101
+ "eval_runtime": 12.8291,
102
+ "eval_samples_per_second": 115.129,
103
+ "eval_steps_per_second": 4.833,
104
+ "step": 3015
105
+ },
106
+ {
107
+ "epoch": 5.8,
108
+ "learning_rate": 3.548922056384743e-05,
109
+ "loss": 0.0406,
110
+ "step": 3500
111
+ },
112
+ {
113
+ "epoch": 6.0,
114
+ "eval_accuracy": 0.977109345440787,
115
+ "eval_f1": 0.9801621337465068,
116
+ "eval_loss": 0.10502293705940247,
117
+ "eval_precision": 0.9793221650909493,
118
+ "eval_recall": 0.9810035445281347,
119
+ "eval_runtime": 12.7666,
120
+ "eval_samples_per_second": 115.693,
121
+ "eval_steps_per_second": 4.856,
122
+ "step": 3618
123
+ },
124
+ {
125
+ "epoch": 6.63,
126
+ "learning_rate": 3.341625207296849e-05,
127
+ "loss": 0.0332,
128
+ "step": 4000
129
+ },
130
+ {
131
+ "epoch": 7.0,
132
+ "eval_accuracy": 0.9776558624458738,
133
+ "eval_f1": 0.9800943409276397,
134
+ "eval_loss": 0.10666096210479736,
135
+ "eval_precision": 0.9791868210840543,
136
+ "eval_recall": 0.9810035445281347,
137
+ "eval_runtime": 12.7474,
138
+ "eval_samples_per_second": 115.867,
139
+ "eval_steps_per_second": 4.864,
140
+ "step": 4221
141
+ },
142
+ {
143
+ "epoch": 7.46,
144
+ "learning_rate": 3.1343283582089554e-05,
145
+ "loss": 0.0248,
146
+ "step": 4500
147
+ },
148
+ {
149
+ "epoch": 8.0,
150
+ "eval_accuracy": 0.9782654391053937,
151
+ "eval_f1": 0.9810459324847814,
152
+ "eval_loss": 0.10935021936893463,
153
+ "eval_precision": 0.9802864410528644,
154
+ "eval_recall": 0.9818066016836509,
155
+ "eval_runtime": 12.9986,
156
+ "eval_samples_per_second": 113.628,
157
+ "eval_steps_per_second": 4.77,
158
+ "step": 4824
159
+ },
160
+ {
161
+ "epoch": 8.29,
162
+ "learning_rate": 2.9270315091210616e-05,
163
+ "loss": 0.0209,
164
+ "step": 5000
165
+ },
166
+ {
167
+ "epoch": 9.0,
168
+ "eval_accuracy": 0.9786648169168033,
169
+ "eval_f1": 0.9818035894668382,
170
+ "eval_loss": 0.1135268285870552,
171
+ "eval_precision": 0.9812197483059051,
172
+ "eval_recall": 0.9823881258307487,
173
+ "eval_runtime": 12.5526,
174
+ "eval_samples_per_second": 117.665,
175
+ "eval_steps_per_second": 4.939,
176
+ "step": 5427
177
+ },
178
+ {
179
+ "epoch": 9.12,
180
+ "learning_rate": 2.7197346600331674e-05,
181
+ "loss": 0.0177,
182
+ "step": 5500
183
+ },
184
+ {
185
+ "epoch": 9.95,
186
+ "learning_rate": 2.512437810945274e-05,
187
+ "loss": 0.0145,
188
+ "step": 6000
189
+ },
190
+ {
191
+ "epoch": 10.0,
192
+ "eval_accuracy": 0.9779291209484172,
193
+ "eval_f1": 0.9809597608900205,
194
+ "eval_loss": 0.12104799598455429,
195
+ "eval_precision": 0.980362871999115,
196
+ "eval_recall": 0.9815573770491803,
197
+ "eval_runtime": 12.8411,
198
+ "eval_samples_per_second": 115.021,
199
+ "eval_steps_per_second": 4.828,
200
+ "step": 6030
201
+ },
202
+ {
203
+ "epoch": 10.78,
204
+ "learning_rate": 2.3051409618573798e-05,
205
+ "loss": 0.0112,
206
+ "step": 6500
207
+ },
208
+ {
209
+ "epoch": 11.0,
210
+ "eval_accuracy": 0.9774246437129525,
211
+ "eval_f1": 0.9806444472114999,
212
+ "eval_loss": 0.12379806488752365,
213
+ "eval_precision": 0.9798988027760113,
214
+ "eval_recall": 0.9813912272928667,
215
+ "eval_runtime": 12.8231,
216
+ "eval_samples_per_second": 115.183,
217
+ "eval_steps_per_second": 4.835,
218
+ "step": 6633
219
+ },
220
+ {
221
+ "epoch": 11.61,
222
+ "learning_rate": 2.097844112769486e-05,
223
+ "loss": 0.0105,
224
+ "step": 7000
225
+ },
226
+ {
227
+ "epoch": 12.0,
228
+ "eval_accuracy": 0.9780552402572834,
229
+ "eval_f1": 0.9810323598179328,
230
+ "eval_loss": 0.1279357671737671,
231
+ "eval_precision": 0.9802593381072189,
232
+ "eval_recall": 0.9818066016836509,
233
+ "eval_runtime": 12.5624,
234
+ "eval_samples_per_second": 117.573,
235
+ "eval_steps_per_second": 4.935,
236
+ "step": 7236
237
+ },
238
+ {
239
+ "epoch": 12.44,
240
+ "learning_rate": 1.890547263681592e-05,
241
+ "loss": 0.0088,
242
+ "step": 7500
243
+ },
244
+ {
245
+ "epoch": 13.0,
246
+ "eval_accuracy": 0.9773405641737083,
247
+ "eval_f1": 0.9802169221404461,
248
+ "eval_loss": 0.1307593435049057,
249
+ "eval_precision": 0.9794039588632091,
250
+ "eval_recall": 0.981031236154187,
251
+ "eval_runtime": 12.6149,
252
+ "eval_samples_per_second": 117.084,
253
+ "eval_steps_per_second": 4.915,
254
+ "step": 7839
255
+ },
256
+ {
257
+ "epoch": 13.27,
258
+ "learning_rate": 1.6832504145936983e-05,
259
+ "loss": 0.0078,
260
+ "step": 8000
261
+ },
262
+ {
263
+ "epoch": 14.0,
264
+ "eval_accuracy": 0.9781182999117165,
265
+ "eval_f1": 0.980741027698608,
266
+ "eval_loss": 0.13244280219078064,
267
+ "eval_precision": 0.9800088480893657,
268
+ "eval_recall": 0.9814743021710235,
269
+ "eval_runtime": 12.7732,
270
+ "eval_samples_per_second": 115.633,
271
+ "eval_steps_per_second": 4.854,
272
+ "step": 8442
273
+ },
274
+ {
275
+ "epoch": 14.1,
276
+ "learning_rate": 1.4759535655058043e-05,
277
+ "loss": 0.0063,
278
+ "step": 8500
279
+ },
280
+ {
281
+ "epoch": 14.93,
282
+ "learning_rate": 1.2686567164179105e-05,
283
+ "loss": 0.0056,
284
+ "step": 9000
285
+ },
286
+ {
287
+ "epoch": 15.0,
288
+ "eval_accuracy": 0.9781393197965275,
289
+ "eval_f1": 0.9811566131710017,
290
+ "eval_loss": 0.13830867409706116,
291
+ "eval_precision": 0.9803970360539703,
292
+ "eval_recall": 0.98191736818786,
293
+ "eval_runtime": 12.5885,
294
+ "eval_samples_per_second": 117.329,
295
+ "eval_steps_per_second": 4.925,
296
+ "step": 9045
297
+ },
298
+ {
299
+ "epoch": 15.75,
300
+ "learning_rate": 1.0613598673300167e-05,
301
+ "loss": 0.0046,
302
+ "step": 9500
303
+ },
304
+ {
305
+ "epoch": 16.0,
306
+ "eval_accuracy": 0.9780972800269054,
307
+ "eval_f1": 0.9813613029099613,
308
+ "eval_loss": 0.1351945400238037,
309
+ "eval_precision": 0.9807506153718505,
310
+ "eval_recall": 0.9819727514399645,
311
+ "eval_runtime": 12.6382,
312
+ "eval_samples_per_second": 116.868,
313
+ "eval_steps_per_second": 4.906,
314
+ "step": 9648
315
+ },
316
+ {
317
+ "epoch": 16.58,
318
+ "learning_rate": 8.540630182421228e-06,
319
+ "loss": 0.0046,
320
+ "step": 10000
321
+ },
322
+ {
323
+ "epoch": 17.0,
324
+ "eval_accuracy": 0.977845041409173,
325
+ "eval_f1": 0.9810062666869561,
326
+ "eval_loss": 0.14340569078922272,
327
+ "eval_precision": 0.9801520387007602,
328
+ "eval_recall": 0.9818619849357554,
329
+ "eval_runtime": 12.6576,
330
+ "eval_samples_per_second": 116.688,
331
+ "eval_steps_per_second": 4.898,
332
+ "step": 10251
333
+ },
334
+ {
335
+ "epoch": 17.41,
336
+ "learning_rate": 6.467661691542288e-06,
337
+ "loss": 0.0033,
338
+ "step": 10500
339
+ },
340
+ {
341
+ "epoch": 18.0,
342
+ "eval_accuracy": 0.9776979022154959,
343
+ "eval_f1": 0.9810448835021307,
344
+ "eval_loss": 0.1425262689590454,
345
+ "eval_precision": 0.9803395642074991,
346
+ "eval_recall": 0.9817512184315463,
347
+ "eval_runtime": 12.8452,
348
+ "eval_samples_per_second": 114.985,
349
+ "eval_steps_per_second": 4.827,
350
+ "step": 10854
351
+ },
352
+ {
353
+ "epoch": 18.24,
354
+ "learning_rate": 4.39469320066335e-06,
355
+ "loss": 0.0032,
356
+ "step": 11000
357
+ },
358
+ {
359
+ "epoch": 19.0,
360
+ "eval_accuracy": 0.9780972800269054,
361
+ "eval_f1": 0.9813638816253684,
362
+ "eval_loss": 0.14492034912109375,
363
+ "eval_precision": 0.9806176901595377,
364
+ "eval_recall": 0.982111209570226,
365
+ "eval_runtime": 12.4458,
366
+ "eval_samples_per_second": 118.675,
367
+ "eval_steps_per_second": 4.982,
368
+ "step": 11457
369
+ },
370
+ {
371
+ "epoch": 19.07,
372
+ "learning_rate": 2.3217247097844113e-06,
373
+ "loss": 0.003,
374
+ "step": 11500
375
+ },
376
+ {
377
+ "epoch": 19.9,
378
+ "learning_rate": 2.4875621890547267e-07,
379
+ "loss": 0.0029,
380
+ "step": 12000
381
+ }
382
+ ],
383
+ "max_steps": 12060,
384
+ "num_train_epochs": 20,
385
+ "total_flos": 1.21984229568579e+16,
386
+ "trial_name": null,
387
+ "trial_params": null
388
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f69d3406e6a240d9dca8d53ded465b0836fb3e46b9706b94b6e58db548f7051
3
+ size 2863