This is a model for named entity recognition of Japanese medical documents.
Introduction
This repository contains the base model and a support predict script for using the model and providing a XML tagged text output.
The original model was trained on the MedTxt-CR-JA dataset, so the provided prediction code outputs XML tags in the same format.
The script also provide the normalization method for the output entities, which is not embedded in the model.
If you want to re-train or update the model, we provide additional support scripts in this GitHub repository. Issues and suggestions can also be submitted there.
A note about loading the model using standard HuggingFace methods
This model should also be loadable using standard HuggingFace from_pretrained
methods. However, the model by itself only outputs labels in the format "LABEL_0", "LABEL1", etc.
The conversion of model outputs to the actual labels (", "", "" etc.) is not yet embedded into the model, so the extra id_to_tags.pkl
file is necessary
to make the conversion. It contains a mapping from the model output ids to the actual labels.
Such process can be done manually if needed, but the predict.py
script already does that.
We are currently working to better standardize the model to HuggingFace's standards.
How to use
Clone the repository and install the requirements:
pip install -r requirements.txt
The code has been developed tested with Python 3.9 in MacOS 14.1 (M1 MacBook Pro).
Prediction
The prediction script will output the results in the same XML format as the input file. It can be run with the following command:
python3 predict.py
The default parameters will take the model located in pytorch_model.bin
and the input file text.txt
.
The resulting predictions will be output to the screen.
To select a different model or input file, use the -m
and -i
parameters, respectively:
python3 predict.py -m <model_path> -i <your_input_file>.txt
The input file can be a single text file or a folder containing multiple .txt
files, for batch processing. For example:
python3 predict.py -m <model_path> -i <your_input_folder>
Entity normalization
This model supports entity normalization via dictionary matching. The dictionary is a list of medical terms or drugs and their standard forms.
Two different dictionaries are used for drug and disease normalization, stored in the dictionaries
folder as
drug_dict.csv
and disease_dict.csv
, respectively.
To enable normalization you can add the --normalize
flag to the predict.py
command.
python3 predict.py -m <model_path> --normalize
Normalization will add the norm
attribute to the output XML tags. This attribute can be empty if a normalized form of
the term is not found.
The provided disease normalization dictionary (dictionaties/disease_dict.csv
) is based on
the Manbyo Dictionary and provides normalization to the standard ICD code
for the diseases.
The default drug dictionary (dictionaties/drug_dict.csv
) is based on
the Hyakuyaku Dictionary.
The dictionary is a CSV file with three columns: the first column is the surface form term and the third column contain its standard form. The second column is not used.
Replacing the default dictionaries
User can freely change the dictionary to fit their needs by passing the path to a custom dictionary file. The dictionary file must have at least a column containing the list of surface forms and a column containing the list of normalized forms.
The parameters --drug_dict
and --disease_dict
can be used to specify the path to the drug and disease dictionaries,
respectively.
When doing so, the respective parameters informing the column index of the surface form and normalized form must also be
provided.
You don't need to replace both dictionaries at the same time, you can replace only one of them.
E.g.:
python3 predict.py --normalize --drug_dict dictionaries/drug_dict.csv --drug_surface_form 0 --drug_norm_form 2 --disease_dict dictionaries/disease_dict.csv --disease_surface_form 0 --disease_norm_form 2
Input Example
肥大型心筋症、心房細動に対してWF投与が開始となった。
治療経過中に非持続性心室頻拍が認められたためアミオダロンが併用となった。
Output Example
<d certainty="positive" norm="I422">肥大型心筋症、心房細動</d>に対して<m-key state="executed" norm="ワルファリンカリウム">WF</m-key>投与が開始となった。
<timex3 type="med">治療経過中</timex3>に<d certainty="positive" norm="I472">非持続性心室頻拍</d>が認められたため<m-key state="executed" norm="アミオダロン塩酸塩">アミオダロン</m-key>が併用となった。
Publication
This model can be cited as:
@misc {social_computing_lab_2023,
author = { {Social Computing Lab} },
title = { MedNERN-CR-JA (Revision 13dbcb6) },
year = 2023,
url = { https://huggingface.co/sociocom/MedNERN-CR-JA },
doi = { 10.57967/hf/0620 },
publisher = { Hugging Face }
}
- Downloads last month
- 91