NorbertRop commited on
Commit
18726e6
1 Parent(s): 18f6ae9

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +243 -0
README.md ADDED
@@ -0,0 +1,243 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: pl
3
+ license: mit
4
+ tags:
5
+ - ner
6
+ datasets:
7
+ - clarin-pl/kpwr-ner
8
+ metrics:
9
+ - f1
10
+ - accuracy
11
+ - precision
12
+ - recall
13
+ widget:
14
+ - text: "Nazywam się Jan Kowalski i mieszkam we Wrocławiu."
15
+ example_title: "Example"
16
+ ---
17
+
18
+ # FastPDN
19
+
20
+ FastPolDeepNer is a model designed for easy use, training and configuration. The forerunner of this project is [PolDeepNer2](https://gitlab.clarin-pl.eu/information-extraction/poldeepner2). The model implements a pipeline consisting of data processing and training using: hydra, pytorch, pytorch-lightning, transformers.
21
+
22
+ ## How to use
23
+
24
+ Here is how to use this model to get the Named Entities in text:
25
+
26
+ ```python
27
+ from transformers import pipeline
28
+ ner = pipeline('ner', model='clarin-pl/FastPDN')
29
+
30
+ text = "Nazywam się Jan Kowalski i mieszkam we Wrocławiu."
31
+ ner_results = ner(text)
32
+ for output in ner_results:
33
+ print(output)
34
+
35
+ {'entity': 'B-nam_liv_person', 'score': 0.99957544, 'index': 4, 'word': 'Jan</w>', 'start': 12, 'end': 15}
36
+ {'entity': 'I-nam_liv_person', 'score': 0.99963534, 'index': 5, 'word': 'Kowalski</w>', 'start': 16, 'end': 24}
37
+ {'entity': 'B-nam_loc_gpe_city', 'score': 0.998931, 'index': 9, 'word': 'Wrocławiu</w>', 'start': 39, 'end': 48}
38
+ ```
39
+
40
+ Here is how to use this model to get the logits for every token in text:
41
+
42
+ ```python
43
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
44
+
45
+ tokenizer = AutoTokenizer.from_pretrained("clarin-pl/FastPDN")
46
+ model = AutoModelForTokenClassification.from_pretrained("clarin-pl/FastPDN")
47
+
48
+ text = "Nazywam się Jan Kowalski i mieszkam we Wrocławiu."
49
+ encoded_input = tokenizer(text, return_tensors='pt')
50
+ output = model(**encoded_input)
51
+ ```
52
+
53
+ ### Developing
54
+
55
+ Model pipeline consists of 2 steps:
56
+
57
+ - Data processing
58
+ - Training
59
+ - (optional) Share model to Hugginface Hub
60
+
61
+ #### Config
62
+
63
+ This project use hydra configuration. Every configuration used in this module
64
+ is placed in `.yaml` files in `config` directory.
65
+
66
+ This directory has structure:
67
+
68
+ - prepare_data.yaml - main configuration for the data processing stage
69
+ - train.yaml - main configuration for the training stage
70
+ - share_mode.yaml - main configuraion for sharing model to Huggingface Hub
71
+ - callbacks - contains callbacks for pytorch_lightning trainer
72
+ - default.yaml
73
+ - early_stopping.yaml
74
+ - learning_rate_monitor.yaml
75
+ - model_checkpoint.yaml
76
+ - rich_progress_bar.yaml
77
+ - datamodule - contains pytorch_lightning datamodule configuration
78
+ - pdn.yaml
79
+ - experiment - contains all the configurations of executed experiments
80
+ - hydra - hydra configuration files
81
+ - loggers - contains loggers for trainer
82
+ - csv.yaml
83
+ - many_loggers.yaml
84
+ - tensorboards.yaml
85
+ - wandb.yaml
86
+ - model - contains model architecture hyperparameters
87
+ - default.yaml
88
+ - distiluse.yaml
89
+ - custom_classification_head.yaml
90
+ - multilabel.yaml
91
+ - paths - contains paths for IO
92
+ - prepare_data - contains configuration for data processing stage
93
+ - cen_n82
94
+ - default
95
+ - trainer - contains trainer configurations
96
+ - default.yaml
97
+ - cpu.yaml
98
+ - gpu.yaml
99
+
100
+ #### Training
101
+
102
+ 1. Install requirements with poetry
103
+
104
+ ```
105
+ poetry install
106
+ ```
107
+
108
+ 2. Use poetry environment in next steps
109
+
110
+ ```
111
+ poetry shell
112
+ ```
113
+
114
+ or
115
+
116
+ ```
117
+ poetry run <command>
118
+ ```
119
+
120
+ 3. Prepare dataset
121
+
122
+ ```
123
+ python3 src/prepare_data.py experiment=<experiment-name>
124
+ ```
125
+
126
+ 4. Train model
127
+
128
+ ```
129
+ CUDA_VISIBLE_DEVICES=<device-id> python3 src/train.py experiment=<experiment-name>
130
+ ```
131
+
132
+ 5. (optional) Share model to Huggingface Hub
133
+
134
+ ```
135
+ python3 src/share_model.py
136
+ ```
137
+
138
+ ## Evaluation
139
+
140
+ Runs trained on `cen_n82` and `kpwr_n82`:
141
+ | name |test/f1|test/pdn2_f1|test/acc|test/precision|test/recall|
142
+ |---------|-------|------------|--------|--------------|-----------|
143
+ |distiluse| 0.53 | 0.61 | 0.95 | 0.55 | 0.54 |
144
+ | herbert | 0.68 | 0.78 | 0.97 | 0.7 | 0.69 |
145
+
146
+ Runs trained and validated only on `cen_n82`:
147
+ | name |test/f1|test/pdn2_f1|test/acc|test/precision|test/recall|
148
+ |----------------|-------|------------|--------|--------------|-----------|
149
+ | distiluse_cen | 0.58 | 0.7 | 0.96 | 0.6 | 0.59 |
150
+ |herbert_cen_bs32| 0.71 | 0.84 | 0.97 | 0.72 | 0.72 |
151
+ | herbert_cen | 0.72 | 0.84 | 0.97 | 0.73 | 0.73 |
152
+
153
+ Detailed results for `herbert`:
154
+ | tag | f1 |precision|recall|support|
155
+ |-------------------------|----|---------|------|-------|
156
+ | nam_eve_human_cultural |0.65| 0.53 | 0.83 | 88 |
157
+ | nam_pro_title_document |0.87| 0.82 | 0.92 | 50 |
158
+ | nam_loc_gpe_country |0.82| 0.76 | 0.9 | 258 |
159
+ | nam_oth_www |0.71| 0.85 | 0.61 | 18 |
160
+ | nam_liv_person |0.94| 0.89 | 1.0 | 8 |
161
+ | nam_adj_country |0.44| 0.42 | 0.46 | 94 |
162
+ | nam_org_institution |0.15| 0.16 | 0.14 | 22 |
163
+ | nam_loc_land_continent | 0.5| 0.57 | 0.44 | 9 |
164
+ | nam_org_organization |0.64| 0.59 | 0.71 | 58 |
165
+ | nam_liv_god |0.13| 0.09 | 0.25 | 4 |
166
+ | nam_loc_gpe_city |0.56| 0.51 | 0.62 | 87 |
167
+ | nam_org_company | 0.0| 0.0 | 0.0 | 4 |
168
+ | nam_oth_currency |0.71| 0.86 | 0.6 | 10 |
169
+ | nam_org_group_team |0.87| 0.79 | 0.96 | 106 |
170
+ | nam_fac_road |0.67| 0.67 | 0.67 | 6 |
171
+ | nam_fac_park |0.39| 0.7 | 0.27 | 26 |
172
+ | nam_pro_title_tv |0.17| 1.0 | 0.09 | 11 |
173
+ | nam_loc_gpe_admin3 |0.91| 0.97 | 0.86 | 35 |
174
+ | nam_adj |0.47| 0.5 | 0.44 | 9 |
175
+ | nam_loc_gpe_admin1 |0.92| 0.91 | 0.93 | 1146 |
176
+ | nam_oth_tech | 0.0| 0.0 | 0.0 | 4 |
177
+ | nam_pro_brand |0.93| 0.88 | 1.0 | 14 |
178
+ | nam_fac_goe | 0.1| 0.07 | 0.14 | 7 |
179
+ | nam_eve_human |0.76| 0.73 | 0.78 | 74 |
180
+ | nam_pro_vehicle |0.81| 0.79 | 0.83 | 36 |
181
+ | nam_oth | 0.8| 0.82 | 0.79 | 47 |
182
+ | nam_org_nation |0.85| 0.87 | 0.84 | 516 |
183
+ | nam_pro_media_periodic |0.95| 0.94 | 0.96 | 603 |
184
+ | nam_adj_city |0.43| 0.39 | 0.47 | 19 |
185
+ | nam_oth_position |0.56| 0.54 | 0.58 | 26 |
186
+ | nam_pro_title |0.63| 0.68 | 0.59 | 22 |
187
+ | nam_pro_media_tv |0.29| 0.2 | 0.5 | 2 |
188
+ | nam_fac_system |0.29| 0.2 | 0.5 | 2 |
189
+ | nam_eve_human_holiday | 1.0| 1.0 | 1.0 | 2 |
190
+ | nam_loc_gpe_admin2 |0.83| 0.91 | 0.76 | 51 |
191
+ | nam_adj_person |0.86| 0.75 | 1.0 | 3 |
192
+ | nam_pro_software |0.67| 1.0 | 0.5 | 2 |
193
+ | nam_num_house |0.88| 0.9 | 0.86 | 43 |
194
+ | nam_pro_media_web |0.32| 0.43 | 0.25 | 12 |
195
+ | nam_org_group | 0.5| 0.45 | 0.56 | 9 |
196
+ | nam_loc_hydronym_river |0.67| 0.61 | 0.74 | 19 |
197
+ | nam_liv_animal |0.88| 0.79 | 1.0 | 11 |
198
+ | nam_pro_award | 0.8| 1.0 | 0.67 | 3 |
199
+ | nam_pro |0.82| 0.8 | 0.83 | 243 |
200
+ | nam_org_political_party |0.34| 0.38 | 0.32 | 19 |
201
+ | nam_eve_human_sport |0.65| 0.73 | 0.58 | 19 |
202
+ | nam_pro_title_book |0.94| 0.93 | 0.95 | 149 |
203
+ | nam_org_group_band |0.74| 0.73 | 0.75 | 359 |
204
+ | nam_oth_data_format |0.82| 0.88 | 0.76 | 88 |
205
+ | nam_loc_astronomical |0.75| 0.72 | 0.79 | 341 |
206
+ | nam_loc_hydronym_sea | 0.4| 1.0 | 0.25 | 4 |
207
+ | nam_loc_land_mountain |0.95| 0.96 | 0.95 | 74 |
208
+ | nam_loc_land_island |0.55| 0.52 | 0.59 | 46 |
209
+ | nam_num_phone |0.91| 0.91 | 0.91 | 137 |
210
+ | nam_pro_model_car |0.56| 0.64 | 0.5 | 14 |
211
+ | nam_loc_land_region |0.52| 0.5 | 0.55 | 11 |
212
+ | nam_liv_habitant |0.38| 0.29 | 0.54 | 13 |
213
+ | nam_eve |0.47| 0.38 | 0.61 | 85 |
214
+ | nam_loc_historical_region|0.44| 0.8 | 0.31 | 26 |
215
+ | nam_fac_bridge |0.33| 0.26 | 0.46 | 24 |
216
+ | nam_oth_license |0.65| 0.74 | 0.58 | 24 |
217
+ | nam_pro_media |0.33| 0.32 | 0.35 | 52 |
218
+ | nam_loc_gpe_subdivision | 0.0| 0.0 | 0.0 | 9 |
219
+ | nam_loc_gpe_district |0.84| 0.86 | 0.81 | 108 |
220
+ | nam_loc |0.67| 0.6 | 0.75 | 4 |
221
+ | nam_pro_software_game |0.75| 0.61 | 0.95 | 20 |
222
+ | nam_pro_title_album | 0.6| 0.56 | 0.65 | 52 |
223
+ | nam_loc_country_region |0.81| 0.74 | 0.88 | 26 |
224
+ | nam_pro_title_song |0.52| 0.6 | 0.45 | 111 |
225
+ | nam_org_organization_sub| 0.0| 0.0 | 0.0 | 3 |
226
+ | nam_loc_land | 0.4| 0.31 | 0.56 | 36 |
227
+ | nam_fac_square | 0.5| 0.6 | 0.43 | 7 |
228
+ | nam_loc_hydronym |0.67| 0.56 | 0.82 | 11 |
229
+ | nam_loc_hydronym_lake |0.51| 0.44 | 0.61 | 96 |
230
+ | nam_fac_goe_stop |0.35| 0.3 | 0.43 | 7 |
231
+ | nam_pro_media_radio | 0.0| 0.0 | 0.0 | 2 |
232
+ | nam_pro_title_treaty | 0.3| 0.56 | 0.21 | 24 |
233
+ | nam_loc_hydronym_ocean |0.35| 0.38 | 0.33 | 33 |
234
+
235
+ To see all the experiments and graphs head over to wandb - https://wandb.ai/clarin-pl/FastPDN
236
+
237
+ ## Authors
238
+
239
+ - Grupa Wieszcze CLARIN-PL
240
+
241
+ ## Contact
242
+
243
+ - Norbert Ropiak (norbert.ropiak@pwr.edu.pl)