bert-base-spanish-wwm-uncased-finetuned-NER-medical

This model is a fine-tuned version of dccuchile/bert-base-spanish-wwm-uncased on an adaptation of eHealth-KD Challenge 2020 dataset, filtered only for the task of NER. The dataset annotations for NER are ['Concept', 'Action', 'Predicate', 'Reference'].

Before the training process, the dataset had processed to label it with the BIO annotation format. Some cleaning and adaptations were needed, for example, to work with overlapped entities.

It achieves the following results on the evaluation set:

Loss: 0.6433
Precision: 0.8297
Recall: 0.8367
F1: 0.8332
Accuracy: 0.8876

Model description

A BERT adaptation for Spanish medical NER. This type of models can be part of NLP pipelines created, for example, to analyse clinical documents, automatic labelling of clinical documents following standard classifications such as CIE-10 o SNOMED, etc.

Training and evaluation data

The adapted dataset has this structure:

Training: 800 labelled sentences
Development: 199 labelled sentences
Testing: 100 labelled sentences

Training procedure

The chapter “Token classification” in the Hugging Face online course was used as starting point for the training process. We made some adaptions because our dataset follows a slightly different structure. Moreover, a conversion between string labels and integers labels was needed.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 12

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
0.1139	1.0	50	0.3932	0.8671	0.8378	0.8522	0.9004
0.074	2.0	100	0.4334	0.8682	0.8367	0.8522	0.9004
0.0564	3.0	150	0.4498	0.8654	0.8353	0.8501	0.8993
0.0431	4.0	200	0.4683	0.8629	0.8425	0.8526	0.8985
0.0328	5.0	250	0.4850	0.8508	0.8454	0.8481	0.8964
0.027	6.0	300	0.4983	0.8608	0.8432	0.8519	0.8988
0.0253	7.0	350	0.5334	0.8618	0.8457	0.8537	0.9004
0.0242	8.0	400	0.5546	0.8636	0.8450	0.8542	0.9009
0.0233	9.0	450	0.5507	0.8543	0.8436	0.8489	0.8961
0.0203	10.0	500	0.5410	0.8605	0.8432	0.8518	0.9001
0.0179	11.0	550	0.5547	0.8603	0.8507	0.8555	0.9006
0.0149	12.0	600	0.5568	0.8616	0.8446	0.8531	0.9012

Framework versions

Transformers 4.17.0
Pytorch 1.10.0+cu111
Datasets 2.0.0
Tokenizers 0.11.6

fmmolina
/

bert-base-spanish-wwm-uncased-finetuned-NER-medical