quinten-datalab/AliBERT-7GB: AliBERT: is a pre-trained language model for French biomedical text.

Introduction

AliBERT: is a pre-trained language model for French biomedical text. It is trained with masked language model like RoBERTa.

Here are the main contributions of our work:

A French biomedical language model, a language-specific and domain-specific PLM, which can be used to represent French biomedical text for different downstream tasks.
A normalization of a Unigram sub-word tokenization of French biomedical textual input which improves our vocabulary and overall performance of the models trained.
It is a foundation model that achieved state-of-the-art results on French biomedical text.

The Paper can be found here: https://aclanthology.org/2023.bionlp-1.19/

Data

The pre-training corpus was gathered from different sub-corpora. It is composed of 7GB French biomedical textual documents. The corpora were collected from different sources. Scientific articles are collected from ScienceDirect using an API provided on subscription and where French articles in biomedical domain were selected. The summaries of thesis manuscripts are collected from "Système universitaire de documentation (SuDoc)" which is a catalog of universities documentation system. Short texts and some complete sentences were collected from the public drug database which lists the characteristics of tens of thousands of drugs. Furthermore, a similar drug database known as "Résumé des Caractéristiques du Produit (RCP)" is also used to represent a description of medications that are intended to be utilized by biomedicine professionals.

How to use alibert-quinten/Oncology-NER with HuggingFace

Load quinten-datalab/AliBERT-7GB fill-mask model and the tokenizer used to train AliBERT:

from transformers import AutoTokenizer, AutoModelForTokenClassification,pipeline

tokenizer = AutoTokenizer.from_pretrained("quinten-datalab/AliBERT-7GB")

model = AutoModelForTokenMaskedLM.from_pretrained("quinten-datalab/AliBERT-7GB")

fill_mask=pipeline("fill-mask",model=model,tokenizer=tokenizer)
nlp_AliBERT=fill_mask("La prise de greffe a été systématiquement réalisée au niveau de la face interne de la [MASK] afin de limiter la plaie cicatricielle.")

[{'score': 0.7724128365516663,
  'token': 6749,
  'token_str': 'cuisse',
  'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la cuisse afin de limiter la plaie cicatricielle.'},
 {'score': 0.09472355246543884,
  'token': 4915,
  'token_str': 'jambe',
  'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la jambe afin de limiter la plaie cicatricielle.'},
 {'score': 0.03340734913945198,
  'token': 2050,
  'token_str': 'main',
  'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la main afin de limiter la plaie cicatricielle.'},
 {'score': 0.030924487859010696,
  'token': 844,
  'token_str': 'face',
  'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la face afin de limiter la plaie cicatricielle.'},
 {'score': 0.012518334202468395,
  'token': 3448,
  'token_str': 'joue',
  'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la joue afin de limiter la plaie cicatricielle.'}]

Metrics and results

The model has been evaluted in the following downstream tasks

Biomedical Named Entity Recognition (NER)

The model is evaluated on two (CAS and QUAERO) publically available Frech biomedical text.

CAS dataset

Models	CamemBERT			AliBERT			DrBERT
Entities	P	R	F1	P	R	F1	P	R	F1
Substance	0.96	0.87	0.91	0.96	0.91	0.93	0.83	0.83	0.82
Symptom	0.89	0.91	0.90	0.96	0.98	0.97	0.93	0.90	0.91
Anatomy	0.94	0.91	0.88	0.97	0.97	0.98	0.92	0.93	0.93
Value	0.88	0.46	0.60	0.98	0.99	0.98	0.91	0.91	0.91
Pathology	0.79	0.70	0.74	0.81	0.39	0.52	0.85	0.57	0.68
Macro Avg	0.89	0.79	0.81	0.94	0.85	0.88	0.92	0.87	0.89

Table 1: NER performances on CAS dataset

QUAERO dataset

Models	CamemBERT			AliBERT			DrBERT
Entity	P	R	F1	P	R	F1	P	R	F1
Anatomy	0.649	0.641	0.645	0.795	0.811	0.803	0.736	0.844	0.824
Chemical	0.844	0.847	0.846	0.878	0.893	0.885	0.505	0.823	0.777
Device	0.000	0.000	0.000	0.506	0.356	0.418	0.939	0.237	0.419
Disorder	0.772	0.818	0.794	0.857	0.843	0.850	0.883	0.809	0.845
Procedure	0.880	0.894	0.887	0.969	0.967	0.968	0.944	0.976	0.960
Macro Avg	0.655	0.656	0.655	0.807	0.783	0.793	0.818	0.755	0.782

Table 2: NER performances on QUAERO dataset

##AliBERT: A Pre-trained Language Model for French Biomedical Text