Model Card for Model ID

These model aim to recognise occupation mentions (NER) in Spanish clinical notes and to whom the occupation belongs.

Model Details

PLM Model	Learning rate	Batch size	Epochs	Max length	Optimizer	Max clip grad norm	Epsilon
PlanTL-GOB-ES/ roberta-base-biomedical-es	2e-05	8	10	510	AdamW	1	1e-08

Model Description

PlanTL-GOB-ES/roberta-base-biomedical-es model was fine-tuned using MEDDOPROF corpus (Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Briva-Iglesias, & Martin Krallinger. (2022). MEDDOPROF corpus: complete gold standard annotations for occupation detection in medical documents in Spanish [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7116201)

Two models were built: A model for occupation recognition (MEDDO_FINAL_ROBERTA_ner_sentencia_510_8_10_2e-05_1e-08) and a model to detect to whom the profession belongs (MEDDO_FINAL_ROBERTA_class_sentencia_510_8_10_2e-05_1e-08).

More details about this can be found in MEDDOPROF shared task: Lima-López, S., Farré-Maduell, E., Miranda-Escalada, A., Brivá-Iglesias, V., & Krallinger, M. (2021). Nlp applied to occupational health: Meddoprof shared task at iberlef 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. Procesamiento del Lenguaje Natural, 67, 243-256.

Developed by: Alfredo Madrid
Language(s) (NLP): Spanish
License: CC BY-SA 4.0
Finetuned from model [optional]: PlanTL-GOB-ES/roberta-base-biomedical-es

Model Sources

Repository: https://huggingface.co/HCSCRheuma/Occupations
Paper [optional]: Madrid García, A. (2023). Recognition of professions in medical documentation.

Uses

Model 1

import torch
import pandas as pd
import numpy as np

from transformers import AutoTokenizer, AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("MEDDO_FINAL_ROBERTA_ner_sentencia_510_8_10_2e-05_1e-08")
tokenizer = AutoTokenizer.from_pretrained("MEDDO_FINAL_ROBERTA_ner_sentencia_510_8_10_2e-05_1e-08")

note = "El paciente trabaja en una empresa de construccion los jueves"
tokenized_sentence = tokenizer.encode(note, truncation=True)
tokenized_words_ids = tokenizer(note, truncation=True)
word_ids = tokenized_words_ids.word_ids
input_ids = torch.tensor([tokenized_sentence])
with torch.no_grad():
    output = model(input_ids)
label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2)
tokens = tokenizer.convert_ids_to_tokens(input_ids.numpy()[0])
label_indices

df = pd.DataFrame(zip(tokens, label_indices[0], word_ids(0)), columns=["labels", "tokens", "relation"])
df['labels'] = df['labels'].str.replace('##', '')
df['tokens'] = df['tokens'].map({0: 'B-PROFESION', 1: 'B-SITUACION_LABORAL', 2: 'I-SITUACION_LABORAL', 3: 'I-ACTIVIDAD', 4: 'I-PROFESION', 5: 'O', 6: 'B-ACTIVIDAD', 7: 'PAD'})
df = df[1:-1]
df['relation'] = df['relation'].astype('int')
df['labels'] = df.groupby('relation')['labels'].transform(lambda x: ''.join(x))
df = df.groupby('relation').first()
df

Output

relation	labels	tokens
0	ĠEl	O
1	Ġpaciente	O
2	Ġtrabaja	B-PROFESION
3	Ġen	I-PROFESION
4	Ġuna	I-PROFESION
5	Ġempresa	I-PROFESION
6	Ġde	I-PROFESION
7	Ġconstruccion	I-PROFESION
8	Ġlos	O
9	Ġjueves	O

Model 2

import torch
import pandas as pd
import numpy as np

from transformers import AutoTokenizer, AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("MEDDO_FINAL_ROBERTA_class_sentencia_510_8_10_2e-05_1e-08")
tokenizer = AutoTokenizer.from_pretrained("MEDDO_FINAL_ROBERTA_class_sentencia_510_8_10_2e-05_1e-08")

note = "El paciente trabaja en una empresa de construccion los jueves"
tokenized_sentence = tokenizer.encode(note, truncation=True)
tokenized_words_ids = tokenizer(note, truncation=True)
word_ids = tokenized_words_ids.word_ids
input_ids = torch.tensor([tokenized_sentence])
with torch.no_grad():
    output = model(input_ids)
label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2)
tokens = tokenizer.convert_ids_to_tokens(input_ids.to('cpu').numpy()[0])
label_indices

df = pd.DataFrame(zip(tokens, label_indices[0], word_ids(0)), columns=["labels", "tokens", "relation"])
df['labels'] = df['labels'].str.replace('##', '')
df['tokens'] = df['tokens'].map({0: 'B-FAMILIAR', 1: 'I-PACIENTE', 2: 'I-OTROS', 3: 'B-SANITARIO', 4: 'B-PACIENTE', 5: 'I-FAMILIAR', 6: 'O', 7: 'B-OTROS', 8: 'I-SANITARIO', 9: 'PAD'}
)
df = df[1:-1]
df['relation'] = df['relation'].astype('int')
df['labels'] = df.groupby('relation')['labels'].transform(lambda x: ''.join(x))
df = df.groupby('relation').first()
df

Output

relation	labels	tokens
0	ĠEl	O
1	Ġpaciente	O
2	Ġtrabaja	B-PACIENTE
3	Ġen	I-PACIENTE
4	Ġuna	I-PACIENTE
5	Ġempresa	I-PACIENTE
6	Ġde	I-PACIENTE
7	Ġconstruccion	I-PACIENTE
8	Ġlos	O
9	Ġjueves	O