FastPDN / README.md

Upload README.md with huggingface_hub

18726e6 over 1 year ago

7.9 kB

	---
	language: pl
	license: mit
	tags:
	- ner
	datasets:
	- clarin-pl/kpwr-ner
	metrics:
	- f1
	- accuracy
	- precision
	- recall
	widget:
	- text: "Nazywam się Jan Kowalski i mieszkam we Wrocławiu."
	example_title: "Example"
	---

	# FastPDN

	FastPolDeepNer is a model designed for easy use, training and configuration. The forerunner of this project is [PolDeepNer2](https://gitlab.clarin-pl.eu/information-extraction/poldeepner2). The model implements a pipeline consisting of data processing and training using: hydra, pytorch, pytorch-lightning, transformers.

	## How to use

	Here is how to use this model to get the Named Entities in text:

	```python
	from transformers import pipeline
	ner = pipeline('ner', model='clarin-pl/FastPDN')

	text = "Nazywam się Jan Kowalski i mieszkam we Wrocławiu."
	ner_results = ner(text)
	for output in ner_results:
	print(output)

	{'entity': 'B-nam_liv_person', 'score': 0.99957544, 'index': 4, 'word': 'Jan</w>', 'start': 12, 'end': 15}
	{'entity': 'I-nam_liv_person', 'score': 0.99963534, 'index': 5, 'word': 'Kowalski</w>', 'start': 16, 'end': 24}
	{'entity': 'B-nam_loc_gpe_city', 'score': 0.998931, 'index': 9, 'word': 'Wrocławiu</w>', 'start': 39, 'end': 48}
	```

	Here is how to use this model to get the logits for every token in text:

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	tokenizer = AutoTokenizer.from_pretrained("clarin-pl/FastPDN")
	model = AutoModelForTokenClassification.from_pretrained("clarin-pl/FastPDN")

	text = "Nazywam się Jan Kowalski i mieszkam we Wrocławiu."
	encoded_input = tokenizer(text, return_tensors='pt')
	output = model(**encoded_input)
	```

	### Developing

	Model pipeline consists of 2 steps:

	- Data processing
	- Training
	- (optional) Share model to Hugginface Hub

	#### Config

	This project use hydra configuration. Every configuration used in this module
	is placed in `.yaml` files in `config` directory.

	This directory has structure:

	- prepare_data.yaml - main configuration for the data processing stage
	- train.yaml - main configuration for the training stage
	- share_mode.yaml - main configuraion for sharing model to Huggingface Hub
	- callbacks - contains callbacks for pytorch_lightning trainer
	- default.yaml
	- early_stopping.yaml
	- learning_rate_monitor.yaml
	- model_checkpoint.yaml
	- rich_progress_bar.yaml
	- datamodule - contains pytorch_lightning datamodule configuration
	- pdn.yaml
	- experiment - contains all the configurations of executed experiments
	- hydra - hydra configuration files
	- loggers - contains loggers for trainer
	- csv.yaml
	- many_loggers.yaml
	- tensorboards.yaml
	- wandb.yaml
	- model - contains model architecture hyperparameters
	- default.yaml
	- distiluse.yaml
	- custom_classification_head.yaml
	- multilabel.yaml
	- paths - contains paths for IO
	- prepare_data - contains configuration for data processing stage
	- cen_n82
	- default
	- trainer - contains trainer configurations
	- default.yaml
	- cpu.yaml
	- gpu.yaml

	#### Training

	1. Install requirements with poetry

	```
	poetry install
	```

	2. Use poetry environment in next steps

	```
	poetry shell
	```

	or

	```
	poetry run <command>
	```

	3. Prepare dataset

	```
	python3 src/prepare_data.py experiment=<experiment-name>
	```

	4. Train model

	```
	CUDA_VISIBLE_DEVICES=<device-id> python3 src/train.py experiment=<experiment-name>
	```

	5. (optional) Share model to Huggingface Hub

	```
	python3 src/share_model.py
	```

	## Evaluation

	Runs trained on `cen_n82` and `kpwr_n82`:
	\| name \|test/f1\|test/pdn2_f1\|test/acc\|test/precision\|test/recall\|
	\|---------\|-------\|------------\|--------\|--------------\|-----------\|
	\|distiluse\| 0.53 \| 0.61 \| 0.95 \| 0.55 \| 0.54 \|
	\| herbert \| 0.68 \| 0.78 \| 0.97 \| 0.7 \| 0.69 \|

	Runs trained and validated only on `cen_n82`:
	\| name \|test/f1\|test/pdn2_f1\|test/acc\|test/precision\|test/recall\|
	\|----------------\|-------\|------------\|--------\|--------------\|-----------\|
	\| distiluse_cen \| 0.58 \| 0.7 \| 0.96 \| 0.6 \| 0.59 \|
	\|herbert_cen_bs32\| 0.71 \| 0.84 \| 0.97 \| 0.72 \| 0.72 \|
	\| herbert_cen \| 0.72 \| 0.84 \| 0.97 \| 0.73 \| 0.73 \|

	Detailed results for `herbert`:
	\| tag \| f1 \|precision\|recall\|support\|
	\|-------------------------\|----\|---------\|------\|-------\|
	\| nam_eve_human_cultural \|0.65\| 0.53 \| 0.83 \| 88 \|
	\| nam_pro_title_document \|0.87\| 0.82 \| 0.92 \| 50 \|
	\| nam_loc_gpe_country \|0.82\| 0.76 \| 0.9 \| 258 \|
	\| nam_oth_www \|0.71\| 0.85 \| 0.61 \| 18 \|
	\| nam_liv_person \|0.94\| 0.89 \| 1.0 \| 8 \|
	\| nam_adj_country \|0.44\| 0.42 \| 0.46 \| 94 \|
	\| nam_org_institution \|0.15\| 0.16 \| 0.14 \| 22 \|
	\| nam_loc_land_continent \| 0.5\| 0.57 \| 0.44 \| 9 \|
	\| nam_org_organization \|0.64\| 0.59 \| 0.71 \| 58 \|
	\| nam_liv_god \|0.13\| 0.09 \| 0.25 \| 4 \|
	\| nam_loc_gpe_city \|0.56\| 0.51 \| 0.62 \| 87 \|
	\| nam_org_company \| 0.0\| 0.0 \| 0.0 \| 4 \|
	\| nam_oth_currency \|0.71\| 0.86 \| 0.6 \| 10 \|
	\| nam_org_group_team \|0.87\| 0.79 \| 0.96 \| 106 \|
	\| nam_fac_road \|0.67\| 0.67 \| 0.67 \| 6 \|
	\| nam_fac_park \|0.39\| 0.7 \| 0.27 \| 26 \|
	\| nam_pro_title_tv \|0.17\| 1.0 \| 0.09 \| 11 \|
	\| nam_loc_gpe_admin3 \|0.91\| 0.97 \| 0.86 \| 35 \|
	\| nam_adj \|0.47\| 0.5 \| 0.44 \| 9 \|
	\| nam_loc_gpe_admin1 \|0.92\| 0.91 \| 0.93 \| 1146 \|
	\| nam_oth_tech \| 0.0\| 0.0 \| 0.0 \| 4 \|
	\| nam_pro_brand \|0.93\| 0.88 \| 1.0 \| 14 \|
	\| nam_fac_goe \| 0.1\| 0.07 \| 0.14 \| 7 \|
	\| nam_eve_human \|0.76\| 0.73 \| 0.78 \| 74 \|
	\| nam_pro_vehicle \|0.81\| 0.79 \| 0.83 \| 36 \|
	\| nam_oth \| 0.8\| 0.82 \| 0.79 \| 47 \|
	\| nam_org_nation \|0.85\| 0.87 \| 0.84 \| 516 \|
	\| nam_pro_media_periodic \|0.95\| 0.94 \| 0.96 \| 603 \|
	\| nam_adj_city \|0.43\| 0.39 \| 0.47 \| 19 \|
	\| nam_oth_position \|0.56\| 0.54 \| 0.58 \| 26 \|
	\| nam_pro_title \|0.63\| 0.68 \| 0.59 \| 22 \|
	\| nam_pro_media_tv \|0.29\| 0.2 \| 0.5 \| 2 \|
	\| nam_fac_system \|0.29\| 0.2 \| 0.5 \| 2 \|
	\| nam_eve_human_holiday \| 1.0\| 1.0 \| 1.0 \| 2 \|
	\| nam_loc_gpe_admin2 \|0.83\| 0.91 \| 0.76 \| 51 \|
	\| nam_adj_person \|0.86\| 0.75 \| 1.0 \| 3 \|
	\| nam_pro_software \|0.67\| 1.0 \| 0.5 \| 2 \|
	\| nam_num_house \|0.88\| 0.9 \| 0.86 \| 43 \|
	\| nam_pro_media_web \|0.32\| 0.43 \| 0.25 \| 12 \|
	\| nam_org_group \| 0.5\| 0.45 \| 0.56 \| 9 \|
	\| nam_loc_hydronym_river \|0.67\| 0.61 \| 0.74 \| 19 \|
	\| nam_liv_animal \|0.88\| 0.79 \| 1.0 \| 11 \|
	\| nam_pro_award \| 0.8\| 1.0 \| 0.67 \| 3 \|
	\| nam_pro \|0.82\| 0.8 \| 0.83 \| 243 \|
	\| nam_org_political_party \|0.34\| 0.38 \| 0.32 \| 19 \|
	\| nam_eve_human_sport \|0.65\| 0.73 \| 0.58 \| 19 \|
	\| nam_pro_title_book \|0.94\| 0.93 \| 0.95 \| 149 \|
	\| nam_org_group_band \|0.74\| 0.73 \| 0.75 \| 359 \|
	\| nam_oth_data_format \|0.82\| 0.88 \| 0.76 \| 88 \|
	\| nam_loc_astronomical \|0.75\| 0.72 \| 0.79 \| 341 \|
	\| nam_loc_hydronym_sea \| 0.4\| 1.0 \| 0.25 \| 4 \|
	\| nam_loc_land_mountain \|0.95\| 0.96 \| 0.95 \| 74 \|
	\| nam_loc_land_island \|0.55\| 0.52 \| 0.59 \| 46 \|
	\| nam_num_phone \|0.91\| 0.91 \| 0.91 \| 137 \|
	\| nam_pro_model_car \|0.56\| 0.64 \| 0.5 \| 14 \|
	\| nam_loc_land_region \|0.52\| 0.5 \| 0.55 \| 11 \|
	\| nam_liv_habitant \|0.38\| 0.29 \| 0.54 \| 13 \|
	\| nam_eve \|0.47\| 0.38 \| 0.61 \| 85 \|
	\| nam_loc_historical_region\|0.44\| 0.8 \| 0.31 \| 26 \|
	\| nam_fac_bridge \|0.33\| 0.26 \| 0.46 \| 24 \|
	\| nam_oth_license \|0.65\| 0.74 \| 0.58 \| 24 \|
	\| nam_pro_media \|0.33\| 0.32 \| 0.35 \| 52 \|
	\| nam_loc_gpe_subdivision \| 0.0\| 0.0 \| 0.0 \| 9 \|
	\| nam_loc_gpe_district \|0.84\| 0.86 \| 0.81 \| 108 \|
	\| nam_loc \|0.67\| 0.6 \| 0.75 \| 4 \|
	\| nam_pro_software_game \|0.75\| 0.61 \| 0.95 \| 20 \|
	\| nam_pro_title_album \| 0.6\| 0.56 \| 0.65 \| 52 \|
	\| nam_loc_country_region \|0.81\| 0.74 \| 0.88 \| 26 \|
	\| nam_pro_title_song \|0.52\| 0.6 \| 0.45 \| 111 \|
	\| nam_org_organization_sub\| 0.0\| 0.0 \| 0.0 \| 3 \|
	\| nam_loc_land \| 0.4\| 0.31 \| 0.56 \| 36 \|
	\| nam_fac_square \| 0.5\| 0.6 \| 0.43 \| 7 \|
	\| nam_loc_hydronym \|0.67\| 0.56 \| 0.82 \| 11 \|
	\| nam_loc_hydronym_lake \|0.51\| 0.44 \| 0.61 \| 96 \|
	\| nam_fac_goe_stop \|0.35\| 0.3 \| 0.43 \| 7 \|
	\| nam_pro_media_radio \| 0.0\| 0.0 \| 0.0 \| 2 \|
	\| nam_pro_title_treaty \| 0.3\| 0.56 \| 0.21 \| 24 \|
	\| nam_loc_hydronym_ocean \|0.35\| 0.38 \| 0.33 \| 33 \|

	To see all the experiments and graphs head over to wandb - https://wandb.ai/clarin-pl/FastPDN

	## Authors

	- Grupa Wieszcze CLARIN-PL

	## Contact

	- Norbert Ropiak (norbert.ropiak@pwr.edu.pl)