Neg-CamemBERT-bio: Language model for negation detection in radiological and other clinical texts for the French language.

Neg-CamemBERT-bio is a refined version of the transformer-based CamemBERT-bio-base model, fine-tuned for the recognition of negations in clinical texts. Neg-CamemBERT-bio automatically detects both the negation cues and their scope. A public model is available for negation recognition in biomedical texts, along with two additional models that are kept private since trained with potentially sensitive data from Lyon University Hospital (Hospices Civils de Lyoon, HCL). — one dedicated to radiology reports and the other designed more broadly for various medical texts.

Model Details

1- Neg-CamemBERT-bio: Fine-tuning of the CamemBERT-bio-base model for negation recognition in biomedical texts in French.

2- Neg-Radio-CamemBERT-bio: Fine-tuning of the CamemBERT-bio-base model for negation recognition in anonymized texts extracted from French-written thoracic CT scans provided by the Radiology Department of the Hospices Civils of Lyon.

3- Neg-Medical-CamemBERT-bio: Fine-tuning of the CamemBERT-bio-base model for negation recognition in clinical texts across various medical domains (Radiology, Biomedical, Medical literature,...) in French.

Model name	Type	Corpus train	Number of sentences	Negative sentences
`Neg-CamemBERT-bio`	public	ESSAI + CAS	11 037	1 812
`Neg-Radio-CamemBERT-bio`	privite	RADIO	10 798	2 321
`Neg-Medical-CamemBERT-bio`	privite	RADIO + ESSAI + CAS + QUAERO	21 956	4 244

Training Data

Corpus	Details	Licence
RADIO	Corpus extracted from thoracic scan reports provided by the Radiology Department of Lyon University Hospital (Hospices Civils de Lyon, HCL).	Private corpus
ESSAI	ESSAI Clinical is a freely-available corpus that contains clinical trial protocols in French language collected from the registry of the National Cancer Institute (data available here).	CC BY-NC-SA 4.0 DEED
CAS	CAS is a freely-available corpus in French containing clinical cases as published in scientific, legal, or educational literature (data available here).	CC BY-NC-SA 4.0 DEED
QUAERO	QUAERO is freely-available corpus that contains a vast amount of information in the biomedical field and is available in the form of free-text in natural language (data available here).	GNU Free Documentation License

The training dataset distinguishes between the beginning, inside, and end of each negation entity using a BIO annotation scheme.

Abbreviation	Description
O	Outside of a named entity, represents the affirmative part of the sentence.
B-cue	Beginning of the negation cue.
I-cue	Inside of the negation cue.
B-scope	Beginning of the negation scope.
I-scope	Inside of the negation scope.

Fine-tuning

The CamemBERT-bio-base model, along with its tokenizer, was fine-tuned for the token classification task in order to identify negation cues and their scope. We used the Adam optimizer with a learning rate of 5e−5, and a batch size of 16. The maximum length limit was set to 512 tokens. The model was trained using cross-validation with 10 folds (90% training / 10% validation). For each fold, the model was trained for 25 epochs. We eventually selected the best model out of the 10 folds.

Evaluation

To evaluate the performance of the model and quantify the results, we used the following metrics: precision (P), recall (R), and the F1 score (F1). The scores were measured using the seqeval tool.

Results

Validation was performed on a 10% sample of sentences from the training set for each model.

Model's	Validation Dataset	Metrcis Score	Entity 1: cue	Entity 2: scope
`Neg-CamemBERT-bio`	10% (ESSAI+CAS)	P	95.70 ± 0.94	86.43 ± 1.12
		R	97.70 ± 0.56	85.55 ± 1.33
		F1	96.68 ± 0.46	87.46 ± 0.93
`Neg-Radio-CamemBERT-bio`	10% (RADIO)	P	99.35 ± 0.24	94.19 ± 0.94
		R	99.37 ± 0.31	94.80 ± 0.84
		F1	99.36 ± 0.25	94.49 ± 0.80
`Neg-Medical-CamemBERT-bio`	10%(NegRADIO + ESSAI + CAS + QUAERO)	P	97.75 ± 0.45	90.48 ± 0.74
		R	98.67 ± 0.20	91.34 ± 0.60
		F1	98.20 ± 0.21	90.90 ± 0.585

Model Description

Developed by: Salim SADOUNE, Antoine Richard, François Talbot,Thomas Guyet, Loic Boussel and Hugues Berry
Model type: NER
Language(s) (NLP): French
License: MIT
Finetuned from model: See the CamemBERT-bio-base model for more information on this model.

Direct Use

You can use the public model with Transformers pipeline for NER.

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("aistrosight/Neg-CamemBERT-bio")
model = AutoModelForTokenClassification.from_pretrained("aistrosight/Neg-CamemBERT-bio")

NegNamedEntityRecogniser = pipeline(task="token-classification", model=model , tokenizer = tokenizer,   aggregation_strategy="simple")

text = ["absence de signe d'anomalie des petites voies aériennes, notamment pas de signe de piégeage."]

sample =NegNamedEntityRecogniser(text)
print(sample)

You can visualize the cue and the scope of the negation with the library spacy in a Jupyter notebook.

import spacy
def visualize(sample, text):

    colors = {'scope_neg': "#61ffab", "cue_neg": "#ff6961"}
    options = {"ents": ['scope_neg', 'cue_neg'], "colors": colors}

    for i in range(len(sample)):
        entities = []
        for ents in sample[i]:
            entities.append({"end": ents["end"], "label": ents["entity_group"], "start": ents["start"]})
        displacy.render({"ents": entities,"text": text[i]}, style="ent", manual=True,options=options, jupyter=True)
visualize(sample,text)

Limitations and bias

The capacity of the Neg-CamemBERT-bio model is constrained due to the limited size of its training set, which includes a restricted number of examples for certain negation indicators ("sauf", "jamais", "hormis",...) that appear less frequently. This limitation poses challenges for generalizing to other cases. The tokenizer, not specifically designed for radiology, can lead to confusion for the Neg-Radio-CamemBERT-bio model when predicting the scope. It is important to note that both the Radio corpus and the ESSAI + CAS corpus have been annotated by different annotators on purpose, which may introduce confusion into the ability of the Neg-Medical-CamemBERT-bio model to predict this scope accurately.

aistrosight
/

Neg-CamemBERT-bio

You need to agree to share your contact information to access this model