gabrielandrade2/point-to-span-estimation

A model used to estimate the start and end of a Named Entity (NE) span based on a Point annotation, as used in the paper "Is boundary annotation necessary? Evaluating boundary-free approaches to improve clinical named entity annotation efficiency".

Basically, the goal of this model is to convert a point annotation to a corresponding span annotation with the correct span.

The model locates an identifier token (⧫) and based on its surround context estimates where the NE concept starts and ends.

The model is trained to estimate the spans of diseases and symptom names in Japanese medical texts.

If you want to re-train the model for a different language or domain, dataset preprocessing and training scripts are available here.

Concepts

Point annotation

Unlike span-based paradigms, a point annotation is composed by a single position within the NE span. It is a simple and fast way to annotate NEs, but it introduces ambiguity in the information captured by the annotation.

On this repository implementation, a point annotation is represented by a lozenge character (⧫).

Example:

The patient has a history of dia⧫betes.

Span annotation

A span annotation is composed by the two markings, identifying both start and end positions of the NE span.

The implementation on this repository is based on the span annotation schema defined by Yada et al. (2020).

Example:

The patient has a history of <C>diabetes</C>.

Model architecture

This model was fine-tuned on top of [cl-tohoku/bert-base-japanese-char-v2] (https://huggingface.co/cl-tohoku/bert-base-japanese-char-v2).

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

To be executed, this model requires the following dependencies:

fugashi
unidic-lite

Training data

The model was finetuned using a dataset of Japanese medical texts (which is not available pubicly), comprised of 1027 synthetic medication history notes generated through crowd-sourcing.

Ten experienced dispensing pharmacists were hired as writers to craft the corpus. Each writer was assigned one of 285 drug names and tasked with creating a ``typical'' clinical narrative. This corpus was later fully annotated for symptoms and disease names.

Each annotation received a ⧫ token within its span based on a Truncated normal distribution.

The model was then trained to identify this token and output a span corresponding to the surrounding concept.

Usage

The requirements.txt file contains all the dependencies needed to run the example code.

import mojimoji
import numpy as np
from transformers import AutoTokenizer, AutoModelForTokenClassification

import iob_util #pip install git+https://github.com/gabrielandrade2/IOB-util.git

model_name = "gabrielandrade2/point-to-span-estimation"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Point-annotated text
text = "肥大型心⧫筋症、心房⧫細動に対してＷＦ投与が開始となった。\
治療経過中に非持続性心⧫室頻拍が認められたためアミオダロンが併用となった。"

# Convert to zenkaku and tokenize
text = mojimoji.han_to_zen(text)
tokenized = tokenizer.tokenize(text)

# Encode text
input_ids = tokenizer.encode(text, return_tensors="pt")

# Predict spans
output = model(input_ids)
logits = output[0].detach().cpu().numpy()
tags = np.argmax(logits, axis=2)[:, :].tolist()[0]

# Convert model output to IOB format
id2label = model.config.id2label
tags = [id2label[t] for t in tags]

# Convert input_ids back to chars
tokens = [tokenizer.convert_ids_to_tokens(t) for t in input_ids][0]

# Remove model special tokens (CLS, SEP, PAD)
tags = [y for x, y in zip(tokens, tags) if x not in ['[CLS]', '[SEP]', '[PAD]']]
tokens = [x for x in tokens if x not in ['[CLS]', '[SEP]', '[PAD]']]

# Convert from IOB to XML tag format
xml_text = iob_util.convert_iob_to_xml(tokens, tags)
xml_text = xml_text.replace('⧫', '')
print(xml_text)

Output

<C>肥大型心筋症</C>、<C>心房細動</C>に対してWF投与が開始となった。治療経過中に<C>非持続性心室頻拍</C>が認められたためアミオダロンが併用となった。