FastPDN / README.md
NorbertRop's picture
Upload README.md with huggingface_hub
18726e6
metadata
language: pl
license: mit
tags:
  - ner
datasets:
  - clarin-pl/kpwr-ner
metrics:
  - f1
  - accuracy
  - precision
  - recall
widget:
  - text: Nazywam się Jan Kowalski i mieszkam we Wrocławiu.
    example_title: Example

FastPDN

FastPolDeepNer is a model designed for easy use, training and configuration. The forerunner of this project is PolDeepNer2. The model implements a pipeline consisting of data processing and training using: hydra, pytorch, pytorch-lightning, transformers.

How to use

Here is how to use this model to get the Named Entities in text:

from transformers import pipeline
ner = pipeline('ner', model='clarin-pl/FastPDN')

text = "Nazywam się Jan Kowalski i mieszkam we Wrocławiu."
ner_results = ner(text)
for output in ner_results:
    print(output)

{'entity': 'B-nam_liv_person', 'score': 0.99957544, 'index': 4, 'word': 'Jan</w>', 'start': 12, 'end': 15}
{'entity': 'I-nam_liv_person', 'score': 0.99963534, 'index': 5, 'word': 'Kowalski</w>', 'start': 16, 'end': 24}
{'entity': 'B-nam_loc_gpe_city', 'score': 0.998931, 'index': 9, 'word': 'Wrocławiu</w>', 'start': 39, 'end': 48}

Here is how to use this model to get the logits for every token in text:

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("clarin-pl/FastPDN")
model = AutoModelForTokenClassification.from_pretrained("clarin-pl/FastPDN")

text = "Nazywam się Jan Kowalski i mieszkam we Wrocławiu."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Developing

Model pipeline consists of 2 steps:

  • Data processing
  • Training
  • (optional) Share model to Hugginface Hub

Config

This project use hydra configuration. Every configuration used in this module is placed in .yaml files in config directory.

This directory has structure:

  • prepare_data.yaml - main configuration for the data processing stage
  • train.yaml - main configuration for the training stage
  • share_mode.yaml - main configuraion for sharing model to Huggingface Hub
  • callbacks - contains callbacks for pytorch_lightning trainer
    • default.yaml
    • early_stopping.yaml
    • learning_rate_monitor.yaml
    • model_checkpoint.yaml
    • rich_progress_bar.yaml
  • datamodule - contains pytorch_lightning datamodule configuration
    • pdn.yaml
  • experiment - contains all the configurations of executed experiments
  • hydra - hydra configuration files
  • loggers - contains loggers for trainer
    • csv.yaml
    • many_loggers.yaml
    • tensorboards.yaml
    • wandb.yaml
  • model - contains model architecture hyperparameters
    • default.yaml
    • distiluse.yaml
    • custom_classification_head.yaml
    • multilabel.yaml
  • paths - contains paths for IO
  • prepare_data - contains configuration for data processing stage
    • cen_n82
    • default
  • trainer - contains trainer configurations
    • default.yaml
    • cpu.yaml
    • gpu.yaml

Training

  1. Install requirements with poetry
poetry install
  1. Use poetry environment in next steps
poetry shell

or

poetry run <command>
  1. Prepare dataset
python3 src/prepare_data.py experiment=<experiment-name>
  1. Train model
CUDA_VISIBLE_DEVICES=<device-id> python3 src/train.py experiment=<experiment-name>
  1. (optional) Share model to Huggingface Hub
python3 src/share_model.py

Evaluation

Runs trained on cen_n82 and kpwr_n82:

name test/f1 test/pdn2_f1 test/acc test/precision test/recall
distiluse 0.53 0.61 0.95 0.55 0.54
herbert 0.68 0.78 0.97 0.7 0.69

Runs trained and validated only on cen_n82:

name test/f1 test/pdn2_f1 test/acc test/precision test/recall
distiluse_cen 0.58 0.7 0.96 0.6 0.59
herbert_cen_bs32 0.71 0.84 0.97 0.72 0.72
herbert_cen 0.72 0.84 0.97 0.73 0.73

Detailed results for herbert:

tag f1 precision recall support
nam_eve_human_cultural 0.65 0.53 0.83 88
nam_pro_title_document 0.87 0.82 0.92 50
nam_loc_gpe_country 0.82 0.76 0.9 258
nam_oth_www 0.71 0.85 0.61 18
nam_liv_person 0.94 0.89 1.0 8
nam_adj_country 0.44 0.42 0.46 94
nam_org_institution 0.15 0.16 0.14 22
nam_loc_land_continent 0.5 0.57 0.44 9
nam_org_organization 0.64 0.59 0.71 58
nam_liv_god 0.13 0.09 0.25 4
nam_loc_gpe_city 0.56 0.51 0.62 87
nam_org_company 0.0 0.0 0.0 4
nam_oth_currency 0.71 0.86 0.6 10
nam_org_group_team 0.87 0.79 0.96 106
nam_fac_road 0.67 0.67 0.67 6
nam_fac_park 0.39 0.7 0.27 26
nam_pro_title_tv 0.17 1.0 0.09 11
nam_loc_gpe_admin3 0.91 0.97 0.86 35
nam_adj 0.47 0.5 0.44 9
nam_loc_gpe_admin1 0.92 0.91 0.93 1146
nam_oth_tech 0.0 0.0 0.0 4
nam_pro_brand 0.93 0.88 1.0 14
nam_fac_goe 0.1 0.07 0.14 7
nam_eve_human 0.76 0.73 0.78 74
nam_pro_vehicle 0.81 0.79 0.83 36
nam_oth 0.8 0.82 0.79 47
nam_org_nation 0.85 0.87 0.84 516
nam_pro_media_periodic 0.95 0.94 0.96 603
nam_adj_city 0.43 0.39 0.47 19
nam_oth_position 0.56 0.54 0.58 26
nam_pro_title 0.63 0.68 0.59 22
nam_pro_media_tv 0.29 0.2 0.5 2
nam_fac_system 0.29 0.2 0.5 2
nam_eve_human_holiday 1.0 1.0 1.0 2
nam_loc_gpe_admin2 0.83 0.91 0.76 51
nam_adj_person 0.86 0.75 1.0 3
nam_pro_software 0.67 1.0 0.5 2
nam_num_house 0.88 0.9 0.86 43
nam_pro_media_web 0.32 0.43 0.25 12
nam_org_group 0.5 0.45 0.56 9
nam_loc_hydronym_river 0.67 0.61 0.74 19
nam_liv_animal 0.88 0.79 1.0 11
nam_pro_award 0.8 1.0 0.67 3
nam_pro 0.82 0.8 0.83 243
nam_org_political_party 0.34 0.38 0.32 19
nam_eve_human_sport 0.65 0.73 0.58 19
nam_pro_title_book 0.94 0.93 0.95 149
nam_org_group_band 0.74 0.73 0.75 359
nam_oth_data_format 0.82 0.88 0.76 88
nam_loc_astronomical 0.75 0.72 0.79 341
nam_loc_hydronym_sea 0.4 1.0 0.25 4
nam_loc_land_mountain 0.95 0.96 0.95 74
nam_loc_land_island 0.55 0.52 0.59 46
nam_num_phone 0.91 0.91 0.91 137
nam_pro_model_car 0.56 0.64 0.5 14
nam_loc_land_region 0.52 0.5 0.55 11
nam_liv_habitant 0.38 0.29 0.54 13
nam_eve 0.47 0.38 0.61 85
nam_loc_historical_region 0.44 0.8 0.31 26
nam_fac_bridge 0.33 0.26 0.46 24
nam_oth_license 0.65 0.74 0.58 24
nam_pro_media 0.33 0.32 0.35 52
nam_loc_gpe_subdivision 0.0 0.0 0.0 9
nam_loc_gpe_district 0.84 0.86 0.81 108
nam_loc 0.67 0.6 0.75 4
nam_pro_software_game 0.75 0.61 0.95 20
nam_pro_title_album 0.6 0.56 0.65 52
nam_loc_country_region 0.81 0.74 0.88 26
nam_pro_title_song 0.52 0.6 0.45 111
nam_org_organization_sub 0.0 0.0 0.0 3
nam_loc_land 0.4 0.31 0.56 36
nam_fac_square 0.5 0.6 0.43 7
nam_loc_hydronym 0.67 0.56 0.82 11
nam_loc_hydronym_lake 0.51 0.44 0.61 96
nam_fac_goe_stop 0.35 0.3 0.43 7
nam_pro_media_radio 0.0 0.0 0.0 2
nam_pro_title_treaty 0.3 0.56 0.21 24
nam_loc_hydronym_ocean 0.35 0.38 0.33 33

To see all the experiments and graphs head over to wandb - https://wandb.ai/clarin-pl/FastPDN

Authors

  • Grupa Wieszcze CLARIN-PL

Contact