metadata

language: pl
license: mit
tags:
  - ner
datasets:
  - clarin-pl/kpwr-ner
metrics:
  - f1
  - accuracy
  - precision
  - recall
widget:
  - text: Nazywam się Jan Kowalski i mieszkam we Wrocławiu.
    example_title: Example

FastPDN

FastPolDeepNer is a model designed for easy use, training and configuration. The forerunner of this project is PolDeepNer2. The model implements a pipeline consisting of data processing and training using: hydra, pytorch, pytorch-lightning, transformers.

How to use

Here is how to use this model to get the Named Entities in text:

from transformers import pipeline
ner = pipeline('ner', model='clarin-pl/FastPDN')

text = "Nazywam się Jan Kowalski i mieszkam we Wrocławiu."
ner_results = ner(text)
for output in ner_results:
    print(output)

{'entity': 'B-nam_liv_person', 'score': 0.99957544, 'index': 4, 'word': 'Jan</w>', 'start': 12, 'end': 15}
{'entity': 'I-nam_liv_person', 'score': 0.99963534, 'index': 5, 'word': 'Kowalski</w>', 'start': 16, 'end': 24}
{'entity': 'B-nam_loc_gpe_city', 'score': 0.998931, 'index': 9, 'word': 'Wrocławiu</w>', 'start': 39, 'end': 48}

Here is how to use this model to get the logits for every token in text:

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("clarin-pl/FastPDN")
model = AutoModelForTokenClassification.from_pretrained("clarin-pl/FastPDN")

text = "Nazywam się Jan Kowalski i mieszkam we Wrocławiu."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Developing

Model pipeline consists of 2 steps:

Data processing
Training
(optional) Share model to Hugginface Hub

Config

This project use hydra configuration. Every configuration used in this module is placed in .yaml files in config directory.

This directory has structure:

prepare_data.yaml - main configuration for the data processing stage
train.yaml - main configuration for the training stage
share_mode.yaml - main configuraion for sharing model to Huggingface Hub
callbacks - contains callbacks for pytorch_lightning trainer
- default.yaml
- early_stopping.yaml
- learning_rate_monitor.yaml
- model_checkpoint.yaml
- rich_progress_bar.yaml
datamodule - contains pytorch_lightning datamodule configuration
- pdn.yaml
experiment - contains all the configurations of executed experiments
hydra - hydra configuration files
loggers - contains loggers for trainer
- csv.yaml
- many_loggers.yaml
- tensorboards.yaml
- wandb.yaml
model - contains model architecture hyperparameters
- default.yaml
- distiluse.yaml
- custom_classification_head.yaml
- multilabel.yaml
paths - contains paths for IO
prepare_data - contains configuration for data processing stage
- cen_n82
- default
trainer - contains trainer configurations
- default.yaml
- cpu.yaml
- gpu.yaml

Training

Install requirements with poetry

poetry install

Use poetry environment in next steps

poetry shell

poetry run <command>

Prepare dataset

python3 src/prepare_data.py experiment=<experiment-name>

Train model

CUDA_VISIBLE_DEVICES=<device-id> python3 src/train.py experiment=<experiment-name>

(optional) Share model to Huggingface Hub

python3 src/share_model.py

Evaluation

Runs trained on cen_n82 and kpwr_n82:

name	test/f1	test/pdn2_f1	test/acc	test/precision	test/recall
distiluse	0.53	0.61	0.95	0.55	0.54
herbert	0.68	0.78	0.97	0.7	0.69

Runs trained and validated only on cen_n82:

name	test/f1	test/pdn2_f1	test/acc	test/precision	test/recall
distiluse_cen	0.58	0.7	0.96	0.6	0.59
herbert_cen_bs32	0.71	0.84	0.97	0.72	0.72
herbert_cen	0.72	0.84	0.97	0.73	0.73

Detailed results for herbert:

tag	f1	precision	recall	support
nam_eve_human_cultural	0.65	0.53	0.83	88
nam_pro_title_document	0.87	0.82	0.92	50
nam_loc_gpe_country	0.82	0.76	0.9	258
nam_oth_www	0.71	0.85	0.61	18
nam_liv_person	0.94	0.89	1.0	8
nam_adj_country	0.44	0.42	0.46	94
nam_org_institution	0.15	0.16	0.14	22
nam_loc_land_continent	0.5	0.57	0.44	9
nam_org_organization	0.64	0.59	0.71	58
nam_liv_god	0.13	0.09	0.25	4
nam_loc_gpe_city	0.56	0.51	0.62	87
nam_org_company	0.0	0.0	0.0	4
nam_oth_currency	0.71	0.86	0.6	10
nam_org_group_team	0.87	0.79	0.96	106
nam_fac_road	0.67	0.67	0.67	6
nam_fac_park	0.39	0.7	0.27	26
nam_pro_title_tv	0.17	1.0	0.09	11
nam_loc_gpe_admin3	0.91	0.97	0.86	35
nam_adj	0.47	0.5	0.44	9
nam_loc_gpe_admin1	0.92	0.91	0.93	1146
nam_oth_tech	0.0	0.0	0.0	4
nam_pro_brand	0.93	0.88	1.0	14
nam_fac_goe	0.1	0.07	0.14	7
nam_eve_human	0.76	0.73	0.78	74
nam_pro_vehicle	0.81	0.79	0.83	36
nam_oth	0.8	0.82	0.79	47
nam_org_nation	0.85	0.87	0.84	516
nam_pro_media_periodic	0.95	0.94	0.96	603
nam_adj_city	0.43	0.39	0.47	19
nam_oth_position	0.56	0.54	0.58	26
nam_pro_title	0.63	0.68	0.59	22
nam_pro_media_tv	0.29	0.2	0.5	2
nam_fac_system	0.29	0.2	0.5	2
nam_eve_human_holiday	1.0	1.0	1.0	2
nam_loc_gpe_admin2	0.83	0.91	0.76	51
nam_adj_person	0.86	0.75	1.0	3
nam_pro_software	0.67	1.0	0.5	2
nam_num_house	0.88	0.9	0.86	43
nam_pro_media_web	0.32	0.43	0.25	12
nam_org_group	0.5	0.45	0.56	9
nam_loc_hydronym_river	0.67	0.61	0.74	19
nam_liv_animal	0.88	0.79	1.0	11
nam_pro_award	0.8	1.0	0.67	3
nam_pro	0.82	0.8	0.83	243
nam_org_political_party	0.34	0.38	0.32	19
nam_eve_human_sport	0.65	0.73	0.58	19
nam_pro_title_book	0.94	0.93	0.95	149
nam_org_group_band	0.74	0.73	0.75	359
nam_oth_data_format	0.82	0.88	0.76	88
nam_loc_astronomical	0.75	0.72	0.79	341
nam_loc_hydronym_sea	0.4	1.0	0.25	4
nam_loc_land_mountain	0.95	0.96	0.95	74
nam_loc_land_island	0.55	0.52	0.59	46
nam_num_phone	0.91	0.91	0.91	137
nam_pro_model_car	0.56	0.64	0.5	14
nam_loc_land_region	0.52	0.5	0.55	11
nam_liv_habitant	0.38	0.29	0.54	13
nam_eve	0.47	0.38	0.61	85
nam_loc_historical_region	0.44	0.8	0.31	26
nam_fac_bridge	0.33	0.26	0.46	24
nam_oth_license	0.65	0.74	0.58	24
nam_pro_media	0.33	0.32	0.35	52
nam_loc_gpe_subdivision	0.0	0.0	0.0	9
nam_loc_gpe_district	0.84	0.86	0.81	108
nam_loc	0.67	0.6	0.75	4
nam_pro_software_game	0.75	0.61	0.95	20
nam_pro_title_album	0.6	0.56	0.65	52
nam_loc_country_region	0.81	0.74	0.88	26
nam_pro_title_song	0.52	0.6	0.45	111
nam_org_organization_sub	0.0	0.0	0.0	3
nam_loc_land	0.4	0.31	0.56	36
nam_fac_square	0.5	0.6	0.43	7
nam_loc_hydronym	0.67	0.56	0.82	11
nam_loc_hydronym_lake	0.51	0.44	0.61	96
nam_fac_goe_stop	0.35	0.3	0.43	7
nam_pro_media_radio	0.0	0.0	0.0	2
nam_pro_title_treaty	0.3	0.56	0.21	24
nam_loc_hydronym_ocean	0.35	0.38	0.33	33

To see all the experiments and graphs head over to wandb - https://wandb.ai/clarin-pl/FastPDN

Authors

Grupa Wieszcze CLARIN-PL

Contact

Norbert Ropiak (norbert.ropiak@pwr.edu.pl)