Neg-CamemBERT-bio: Language model for negation detection in radiological and other clinical texts for the French language.
Neg-CamemBERT-bio is a refined version of the transformer-based CamemBERT-bio-base model, fine-tuned for the recognition of negations in clinical texts. Neg-CamemBERT-bio automatically detects both the negation cues and their scope. A public model is available for negation recognition in biomedical texts, along with two additional models that are kept private since trained with potentially sensitive data from Lyon University Hospital (Hospices Civils de Lyoon, HCL). — one dedicated to radiology reports and the other designed more broadly for various medical texts.
Model Details
1- Neg-CamemBERT-bio: Fine-tuning of the CamemBERT-bio-base model for negation recognition in biomedical texts in French.
2- Neg-Radio-CamemBERT-bio: Fine-tuning of the CamemBERT-bio-base model for negation recognition in anonymized texts extracted from French-written thoracic CT scans provided by the Radiology Department of the Hospices Civils of Lyon.
3- Neg-Medical-CamemBERT-bio: Fine-tuning of the CamemBERT-bio-base model for negation recognition in clinical texts across various medical domains (Radiology, Biomedical, Medical literature,...) in French.
Model name | Type | Corpus train | Number of sentences | Negative sentences |
---|---|---|---|---|
Neg-CamemBERT-bio |
public | ESSAI + CAS | 11 037 | 1 812 |
Neg-Radio-CamemBERT-bio |
privite | RADIO | 10 798 | 2 321 |
Neg-Medical-CamemBERT-bio |
privite | RADIO + ESSAI + CAS + QUAERO | 21 956 | 4 244 |
Training Data
Corpus | Details | Licence |
---|---|---|
RADIO | Corpus extracted from thoracic scan reports provided by the Radiology Department of Lyon University Hospital (Hospices Civils de Lyon, HCL). | Private corpus |
ESSAI | ESSAI Clinical is a freely-available corpus that contains clinical trial protocols in French language collected from the registry of the National Cancer Institute (data available here). | CC BY-NC-SA 4.0 DEED |
CAS | CAS is a freely-available corpus in French containing clinical cases as published in scientific, legal, or educational literature (data available here). | CC BY-NC-SA 4.0 DEED |
QUAERO | QUAERO is freely-available corpus that contains a vast amount of information in the biomedical field and is available in the form of free-text in natural language (data available here). | GNU Free Documentation License |
The training dataset distinguishes between the beginning, inside, and end of each negation entity using a BIO annotation scheme.
Abbreviation | Description |
---|---|
O | Outside of a named entity, represents the affirmative part of the sentence. |
B-cue | Beginning of the negation cue. |
I-cue | Inside of the negation cue. |
B-scope | Beginning of the negation scope. |
I-scope | Inside of the negation scope. |
Fine-tuning
The CamemBERT-bio-base model, along with its tokenizer, was fine-tuned for the token classification task in order to identify negation cues and their scope. We used the Adam optimizer with a learning rate of 5e−5, and a batch size of 16. The maximum length limit was set to 512 tokens. The model was trained using cross-validation with 10 folds (90% training / 10% validation). For each fold, the model was trained for 25 epochs. We eventually selected the best model out of the 10 folds.
Evaluation
To evaluate the performance of the model and quantify the results, we used the following metrics: precision (P), recall (R), and the F1 score (F1). The scores were measured using the seqeval tool.
Results
Validation was performed on a 10% sample of sentences from the training set for each model.
Model's | Validation Dataset | Metrcis Score | Entity 1: cue | Entity 2: scope |
---|---|---|---|---|
Neg-CamemBERT-bio |
10% (ESSAI+CAS) | P | 95.70 ± 0.94 | 86.43 ± 1.12 |
R | 97.70 ± 0.56 | 85.55 ± 1.33 | ||
F1 | 96.68 ± 0.46 | 87.46 ± 0.93 | ||
Neg-Radio-CamemBERT-bio |
10% (RADIO) | P | 99.35 ± 0.24 | 94.19 ± 0.94 |
R | 99.37 ± 0.31 | 94.80 ± 0.84 | ||
F1 | 99.36 ± 0.25 | 94.49 ± 0.80 | ||
Neg-Medical-CamemBERT-bio |
10%(NegRADIO + ESSAI + CAS + QUAERO) | P | 97.75 ± 0.45 | 90.48 ± 0.74 |
R | 98.67 ± 0.20 | 91.34 ± 0.60 | ||
F1 | 98.20 ± 0.21 | 90.90 ± 0.585 |
Model Description
- Developed by: Salim SADOUNE, Antoine Richard, François Talbot,Thomas Guyet, Loic Boussel and Hugues Berry
- Model type: NER
- Language(s) (NLP): French
- License: MIT
- Finetuned from model: See the CamemBERT-bio-base model for more information on this model.
Direct Use
You can use the public model with Transformers pipeline for NER.
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("aistrosight/Neg-CamemBERT-bio")
model = AutoModelForTokenClassification.from_pretrained("aistrosight/Neg-CamemBERT-bio")
NegNamedEntityRecogniser = pipeline(task="token-classification", model=model , tokenizer = tokenizer, aggregation_strategy="simple")
text = ["absence de signe d'anomalie des petites voies aériennes, notamment pas de signe de piégeage."]
sample =NegNamedEntityRecogniser(text)
print(sample)
You can visualize the cue and the scope of the negation with the library spacy in a Jupyter notebook.
import spacy
def visualize(sample, text):
colors = {'scope_neg': "#61ffab", "cue_neg": "#ff6961"}
options = {"ents": ['scope_neg', 'cue_neg'], "colors": colors}
for i in range(len(sample)):
entities = []
for ents in sample[i]:
entities.append({"end": ents["end"], "label": ents["entity_group"], "start": ents["start"]})
displacy.render({"ents": entities,"text": text[i]}, style="ent", manual=True,options=options, jupyter=True)
visualize(sample,text)
Limitations and bias
The capacity of the Neg-CamemBERT-bio model is constrained due to the limited size of its training set, which includes a restricted number of examples for certain negation indicators ("sauf", "jamais", "hormis",...) that appear less frequently. This limitation poses challenges for generalizing to other cases. The tokenizer, not specifically designed for radiology, can lead to confusion for the Neg-Radio-CamemBERT-bio model when predicting the scope. It is important to note that both the Radio corpus and the ESSAI + CAS corpus have been annotated by different annotators on purpose, which may introduce confusion into the ability of the Neg-Medical-CamemBERT-bio model to predict this scope accurately.
- Downloads last month
- 0