Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Neg-CamemBERT-bio: Language model for negation detection in radiological and other clinical texts for the French language.

Neg-CamemBERT-bio is a refined version of the transformer-based CamemBERT-bio-base model, fine-tuned for the recognition of negations in clinical texts. Neg-CamemBERT-bio automatically detects both the negation cues and their scope. A public model is available for negation recognition in biomedical texts, along with two additional models that are kept private since trained with potentially sensitive data from Lyon University Hospital (Hospices Civils de Lyoon, HCL). — one dedicated to radiology reports and the other designed more broadly for various medical texts.

Model Details

1- Neg-CamemBERT-bio: Fine-tuning of the CamemBERT-bio-base model for negation recognition in biomedical texts in French.

2- Neg-Radio-CamemBERT-bio: Fine-tuning of the CamemBERT-bio-base model for negation recognition in anonymized texts extracted from French-written thoracic CT scans provided by the Radiology Department of the Hospices Civils of Lyon.

3- Neg-Medical-CamemBERT-bio: Fine-tuning of the CamemBERT-bio-base model for negation recognition in clinical texts across various medical domains (Radiology, Biomedical, Medical literature,...) in French.

Model name Type Corpus train Number of sentences Negative sentences
Neg-CamemBERT-bio public ESSAI + CAS 11 037 1 812
Neg-Radio-CamemBERT-bio privite RADIO 10 798 2 321
Neg-Medical-CamemBERT-bio privite RADIO + ESSAI + CAS + QUAERO 21 956 4 244

Training Data

Corpus Details Licence
RADIO Corpus extracted from thoracic scan reports provided by the Radiology Department of Lyon University Hospital (Hospices Civils de Lyon, HCL). Private corpus
ESSAI ESSAI Clinical is a freely-available corpus that contains clinical trial protocols in French language collected from the registry of the National Cancer Institute (data available here). CC BY-NC-SA 4.0 DEED
CAS CAS is a freely-available corpus in French containing clinical cases as published in scientific, legal, or educational literature (data available here). CC BY-NC-SA 4.0 DEED
QUAERO QUAERO is freely-available corpus that contains a vast amount of information in the biomedical field and is available in the form of free-text in natural language (data available here). GNU Free Documentation License

The training dataset distinguishes between the beginning, inside, and end of each negation entity using a BIO annotation scheme.

Abbreviation Description
O Outside of a named entity, represents the affirmative part of the sentence.
B-cue Beginning of the negation cue.
I-cue Inside of the negation cue.
B-scope Beginning of the negation scope.
I-scope Inside of the negation scope.

Fine-tuning

The CamemBERT-bio-base model, along with its tokenizer, was fine-tuned for the token classification task in order to identify negation cues and their scope. We used the Adam optimizer with a learning rate of 5e−5, and a batch size of 16. The maximum length limit was set to 512 tokens. The model was trained using cross-validation with 10 folds (90% training / 10% validation). For each fold, the model was trained for 25 epochs. We eventually selected the best model out of the 10 folds.

Evaluation

To evaluate the performance of the model and quantify the results, we used the following metrics: precision (P), recall (R), and the F1 score (F1). The scores were measured using the seqeval tool.

Results

Validation was performed on a 10% sample of sentences from the training set for each model.

Model's Validation Dataset Metrcis Score Entity 1: cue Entity 2: scope
Neg-CamemBERT-bio 10% (ESSAI+CAS) P 95.70 ± 0.94 86.43 ± 1.12
R 97.70 ± 0.56 85.55 ± 1.33
F1 96.68 ± 0.46 87.46 ± 0.93
Neg-Radio-CamemBERT-bio 10% (RADIO) P 99.35 ± 0.24 94.19 ± 0.94
R 99.37 ± 0.31 94.80 ± 0.84
F1 99.36 ± 0.25 94.49 ± 0.80
Neg-Medical-CamemBERT-bio 10%(NegRADIO + ESSAI + CAS + QUAERO) P 97.75 ± 0.45 90.48 ± 0.74
R 98.67 ± 0.20 91.34 ± 0.60
F1 98.20 ± 0.21 90.90 ± 0.585

Model Description

  • Developed by: Salim SADOUNE, Antoine Richard, François Talbot,Thomas Guyet, Loic Boussel and Hugues Berry
  • Model type: NER
  • Language(s) (NLP): French
  • License: MIT
  • Finetuned from model: See the CamemBERT-bio-base model for more information on this model.

Direct Use

You can use the public model with Transformers pipeline for NER.

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("aistrosight/Neg-CamemBERT-bio")
model = AutoModelForTokenClassification.from_pretrained("aistrosight/Neg-CamemBERT-bio")

NegNamedEntityRecogniser = pipeline(task="token-classification", model=model , tokenizer = tokenizer,   aggregation_strategy="simple")

text = ["absence de signe d'anomalie des petites voies aériennes, notamment pas de signe de piégeage."]

sample =NegNamedEntityRecogniser(text)
print(sample)

You can visualize the cue and the scope of the negation with the library spacy in a Jupyter notebook.

import spacy
def visualize(sample, text):

    colors = {'scope_neg': "#61ffab", "cue_neg": "#ff6961"}
    options = {"ents": ['scope_neg', 'cue_neg'], "colors": colors}

    for i in range(len(sample)):
        entities = []
        for ents in sample[i]:
            entities.append({"end": ents["end"], "label": ents["entity_group"], "start": ents["start"]})
        displacy.render({"ents": entities,"text": text[i]}, style="ent", manual=True,options=options, jupyter=True)
visualize(sample,text)

Limitations and bias

The capacity of the Neg-CamemBERT-bio model is constrained due to the limited size of its training set, which includes a restricted number of examples for certain negation indicators ("sauf", "jamais", "hormis",...) that appear less frequently. This limitation poses challenges for generalizing to other cases. The tokenizer, not specifically designed for radiology, can lead to confusion for the Neg-Radio-CamemBERT-bio model when predicting the scope. It is important to note that both the Radio corpus and the ESSAI + CAS corpus have been annotated by different annotators on purpose, which may introduce confusion into the ability of the Neg-Medical-CamemBERT-bio model to predict this scope accurately.

Downloads last month
19