AliBERT-7GB / README.md

Quinten Datalab

Update README.md

e919f92 11 months ago

7.87 kB

	---
	license: mit
	language:
	- fr
	library_name: transformers
	tags:
	- Biomedical
	- Medical
	- French-Biomedical
	Mask token:
	- [MASK]
	widget:
	- text: "A l’admission, l’examen clinique mettait en évidence : - une hypotension artérielle avec une pression [MASK] à 6 mmHg."
	example_title: "Example 1"
	- text: "Le patient a été diagnostiqué avec une [MASK] lobaire aiguë et a été traité avec des antibiotiques appropriés"
	example_title: "Example 2"
	- text: "En mars 2001, le malade fut opéré, mais vu le caractère hémorragique de la tumeur, une simple biopsie surrénalienne a été réalisée ayant montré l’aspect de [MASK] malin non Hodgkinien de haut grade de malignité."
	example_title: "Example 3"
	- text: "La cytologie urinaire n’a mis en évidence que des cellules [MASK] normales et l’examen cyto-bactériologique des urines était stérile."
	example_title: "Example 4"
	- text: "La prise de greffe a été systématiquement réalisée au niveau de la face interne de la [MASK] afin de limiter la plaie cicatricielle."
	example_title: "Example 5"
	---

	# quinten-datalab/AliBERT-7GB: AliBERT: is a pre-trained language model for French biomedical text.


	# Introduction

	AliBERT: is a pre-trained language model for French biomedical text. It is trained with masked language model like RoBERTa.

	Here are the main contributions of our work:
	<ul>
	<li>
	A French biomedical language model, a language-specific and domain-specific PLM, which can be used to represent French biomedical text for different downstream tasks.
	</li>
	<li>
	A normalization of a Unigram sub-word tokenization of French biomedical textual input which improves our vocabulary and overall performance of the models trained.
	</li>
	<li>
	It is a foundation model that achieved state-of-the-art results on French biomedical text.
	</li>
	</ul>

	The Paper can be found here: https://aclanthology.org/2023.bionlp-1.19/

	# Data
	The pre-training corpus was gathered from different sub-corpora.It is composed of 7GB French biomedical textual documents. Here are the sources used.

	\|Dataset name\| Quantity\| Size \|
	\|----\|---\|---\|
	\|Drug leaflets (Base de données publique des médicament)\| 23K\| 550Mb \|
	\|RCP (a French equivalent of Physician’s Desk Reference)\| 35K\| 2200Mb\|
	\|Articles (biomedical articles from ScienceDirect)\| 500K\| 4300Mb \|
	\|Thesis (Thesis manuscripts in French)\| 300K\|300Mb \|
	\|Cochrane (articles from Cochrane database)\| 7.6K\| 27Mb\|
	Table 1: Pretraining dataset

	# How to use alibert-quinten/Oncology-NER with HuggingFace

	Load quinten-datalab/AliBERT-7GB fill-mask model and the tokenizer used to train AliBERT:

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification,pipeline

	tokenizer = AutoTokenizer.from_pretrained("quinten-datalab/AliBERT-7GB")

	model = AutoModelForTokenMaskedLM.from_pretrained("quinten-datalab/AliBERT-7GB")

	fill_mask=pipeline("fill-mask",model=model,tokenizer=tokenizer)
	nlp_AliBERT=fill_mask("La prise de greffe a été systématiquement réalisée au niveau de la face interne de la [MASK] afin de limiter la plaie cicatricielle.")

	[{'score': 0.7724128365516663,
	'token': 6749,
	'token_str': 'cuisse',
	'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la cuisse afin de limiter la plaie cicatricielle.'},
	{'score': 0.09472355246543884,
	'token': 4915,
	'token_str': 'jambe',
	'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la jambe afin de limiter la plaie cicatricielle.'},
	{'score': 0.03340734913945198,
	'token': 2050,
	'token_str': 'main',
	'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la main afin de limiter la plaie cicatricielle.'},
	{'score': 0.030924487859010696,
	'token': 844,
	'token_str': 'face',
	'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la face afin de limiter la plaie cicatricielle.'},
	{'score': 0.012518334202468395,
	'token': 3448,
	'token_str': 'joue',
	'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la joue afin de limiter la plaie cicatricielle.'}]
	```

	# Metrics and results
	The model has been evaluted in the following downstream tasks

	## Biomedical Named Entity Recognition (NER)
	The model is evaluated on two (CAS and QUAERO) publically available Frech biomedical text.
	#### CAS dataset

	<style type="text/css">
	.tg {border-collapse:collapse;border-spacing:0;}
	.tg .tg-baqh{text-align:center;vertical-align:top}
	.tg .tg-0lax{text-align:center;vertical-align:top}
	</style>
	<table class="tg">
	<thead>
	<tr>
	<th>Models</th>
	<th class="tg-0lax" colspan="3">CamemBERT</th>
	<th class="tg-0lax" colspan="3">AliBERT</th>
	<th class="tg-0lax" colspan="3">DrBERT</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>Entities</td>
	<td>P<br></td>
	<td>R</td>
	<td>F1</td>
	<td>P<br></td>
	<td>R</td>
	<td>F1</td>
	<td>P<br></td>
	<td>R</td>
	<td>F1</td>
	</tr>
	<tr>
	<td>Substance</td>
	<td>0.96</td>
	<td>0.87</td>
	<td>0.91</td>
	<td>0.96</td>
	<td>0.91</td>
	<td>0.93</td>
	<td>0.83</td>
	<td>0.83</td>
	<td>0.82</td>
	</tr>
	<tr>
	<td>Symptom</td> <td>0.89</td> <td>0.91</td> <td>0.90</td> <td>0.96</td> <td>0.98</td> <td>0.97</td> <td>0.93</td> <td>0.90</td> <td>0.91</td>
	</tr>
	<tr>
	<td>Anatomy</td> <td>0.94</td> <td>0.91</td> <td>0.88</td> <td>0.97</td> <td>0.97</td> <td>0.98</td> <td>0.92</td> <td>0.93</td> <td>0.93</td>
	</tr>
	<tr>
	<td>Value</td> <td>0.88</td> <td>0.46</td> <td>0.60</td> <td>0.98</td> <td>0.99</td> <td>0.98</td> <td>0.91</td> <td>0.91</td> <td>0.91</td>
	</tr>
	<tr>
	<td> Pathology</td> <td>0.79</td> <td>0.70</td> <td>0.74</td> <td>0.81</td> <td>0.39</td> <td>0.52</td> <td>0.85 <td>0.57</td> <td>0.68</td>
	</tr>
	<tr>
	<td>Macro Avg</td> <td>0.89 </td> <td>0.79</td> <td>0.81</td> <td> 0.94</td> <td>0.85</td> <td>0.88</td> <td> 0.92</td> <td> 0.87</td> <td>0.89</td>
	</tr>
	</tbody>
	</table>
	Table 2: NER performances on CAS dataset

	#### QUAERO dataset

	<table class="tg">
	<thead>
	<tr>
	<th>Models</th>
	<th class="tg-0lax" colspan="3">CamemBERT</th>
	<th class="tg-0lax" colspan="3">AliBERT</th>
	<th class="tg-0lax" colspan="3">DrBERT</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>Entity </td> <td> P </td> <td> R </td> <td> F1 </td> <td> P </td> <td> R </td> <td> F1 </td> <td> P </td> <td> R </td> <td> F1 </td>
	</tr>
	<tr>
	<td>Anatomy </td> <td> 0.649 </td> <td> 0.641 </td> <td> 0.645 </td> <td> 0.795 </td> <td> 0.811 </td> <td> 0.803 </td> <td> 0.736 </td> <td> 0.844 </td> <td> 0.824 </td>
	</tr>
	<tr>
	<td>Chemical </td> <td> 0.844 </td> <td> 0.847 </td> <td> 0.846 </td> <td> 0.878 </td> <td> 0.893 </td> <td> 0.885 </td> <td> 0.505 </td> <td> 0.823 </td> <td> 0.777 </td>
	</tr>
	<tr>
	<td>Device </td> <td> 0.000 </td> <td> 0.000 </td> <td> 0.000 </td> <td> 0.506 </td> <td> 0.356 </td> <td> 0.418 </td> <td> 0.939 </td> <td> 0.237 </td> <td> 0.419 </td>
	</tr>
	<tr>
	<td>Disorder </td> <td> 0.772 </td> <td> 0.818 </td> <td> 0.794 </td> <td> 0.857 </td> <td> 0.843 </td> <td> 0.850 </td> <td> 0.883 </td> <td> 0.809 </td> <td> 0.845 </td>
	</tr>
	<tr>
	<td>Procedure </td> <td> 0.880 </td> <td> 0.894 </td> <td> 0.887 </td> <td> 0.969 </td> <td> 0.967 </td> <td> 0.968 </td> <td> 0.944 </td> <td> 0.976 </td> <td> 0.960 </td>
	</tr>
	<tr>
	<td>Macro Avg </td> <td> 0.655 </td> <td> 0.656 </td> <td> 0.655 </td> <td> 0.807 </td> <td> 0.783 </td> <td> 0.793 </td> <td> 0.818 </td> <td> 0.755 </td> <td> 0.782 </td>
	</tr>
	</tbody>
	</table>
	Table 3: NER performances on QUAERO dataset

	##AliBERT: A Pre-trained Language Model for French Biomedical Text