a-mannion's picture
Update README.md
521b6f9 verified
|
raw
history blame
No virus
8.8 kB
---
license: mit
language:
- fr
library_name: transformers
tags:
- linformer
- medical
- RoBERTa
- pytorch
---
# Jargon-general-biomed
[Jargon](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf) is an efficient transformer encoder LM for French, combining the LinFormer attention mechanism with the RoBERTa model architecture.
Jargon is available in several versions with different context sizes and types of pre-training corpora.
<!-- Provide a quick summary of what the model is/does. -->
<!-- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
-->
| **Model** | **Initialised from...** |
|-------------------------------------------------------------------------------------|:-----------------------:|
| [jargon-general-base](https://huggingface.co/PantagrueLLM/jargon-general-base) | scratch |
| [jargon-general-biomed](https://huggingface.co/PantagrueLLM/jargon-general-biomed) | jargon-general-base |
| jargon-general-legal | jargon-general-base |
| [jargon-multidomain-base](https://huggingface.co/PantagrueLLM/jargon-multidomain-base) | jargon-general-base |
| jargon-legal | scratch |
| [jargon-legal-4096](https://huggingface.co/PantagrueLLM/jargon-legal-4096) | scratch |
| [jargon-biomed](https://huggingface.co/PantagrueLLM/jargon-biomed) | scratch |
| [jargon-biomed-4096](https://huggingface.co/PantagrueLLM/jargon-biomed-4096) | scratch |
| [jargon-NACHOS](https://huggingface.co/PantagrueLLM/jargon-NACHOS) | scratch |
| [jargon-NACHOS-4096](https://huggingface.co/PantagrueLLM/jargon-NACHOS-4096) | scratch |
## Evaluation
The Jargon models were evaluated on an range of specialized downstream tasks.
## Biomedical Benchmark
Results averaged across five funs with varying random seeds.
| |[**FrenchMedMCQA**](https://huggingface.co/datasets/qanastek/frenchmedmcqa)|[**MQC**](https://aclanthology.org/2020.lrec-1.72/)|[**CAS-POS**](https://clementdalloux.fr/?page_id=28)|[**ESSAI-POS**](https://clementdalloux.fr/?page_id=28)|[**CAS-SG**](https://aclanthology.org/W18-5614/)|[**MEDLINE**](https://huggingface.co/datasets/mnaguib/QuaeroFrenchMed)|[**EMEA**](https://huggingface.co/datasets/mnaguib/QuaeroFrenchMed)|[**E3C-NER**](https://live.european-language-grid.eu/catalogue/corpus/7618)|[**CLISTER**](https://aclanthology.org/2022.lrec-1.459/)|
|-------------------------|:-----------------------:|:-----------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|
| **Task Type** | Sequence Classification | Sequence Classification | Token Classification | Token Classification | Token Classification | Token Classification | Token Classification | Token Classification | STS |
| **Metric** | EMR | Accuracy | Macro-F1 | Macro-F1 | Weighted F1 | Weighted F1 | Weighted F1 | Weighted F1 | Spearman Correlation |
| jargon-general-base | 12.9 | 76.7 | 96.6 | 96.0 | 69.4 | 81.7 | 96.5 | 91.9 | 78.0 |
| jargon-biomed | 15.3 | 91.1 | 96.5 | 95.6 | 75.1 | 83.7 | 96.5 | 93.5 | 74.6 |
| jargon-biomed-4096 | 14.4 | 78.9 | 96.6 | 95.9 | 73.3 | 82.3 | 96.3 | 92.5 | 65.3 |
| jargon-general-biomed | 16.1 | 69.7 | 95.1 | 95.1 | 67.8 | 78.2 | 96.6 | 91.3 | 59.7 |
| jargon-multidomain-base | 14.9 | 86.9 | 96.3 | 96.0 | 70.6 | 82.4 | 96.6 | 92.6 | 74.8 |
| jargon-NACHOS | 13.3 | 90.7 | 96.3 | 96.2 | 75.0 | 83.4 | 96.8 | 93.1 | 70.9 |
| jargon-NACHOS-4096 | 18.4 | 93.2 | 96.2 | 95.9 | 74.9 | 83.8 | 96.8 | 93.2 | 74.9 |
For more info please check out the [paper](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf), accepted for publication at [LREC-COLING 2024](https://lrec-coling-2024.org/list-of-accepted-papers/).
## Using Jargon models with HuggingFace transformers
You can get started with `jargon-general-biomed` using the code snippet below:
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
tokenizer = AutoTokenizer.from_pretrained("PantagrueLLM/jargon-general-biomed", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("PantagrueLLM/jargon-general-biomed", trust_remote_code=True)
jargon_maskfiller = pipeline("fill-mask", model=model, tokenizer=tokenizer)
output = jargon_maskfiller("Il est allé au <mask> hier")
```
You can also use the classes `AutoModel`, `AutoModelForSequenceClassification`, or `AutoModelForTokenClassification` to load Jargon models, depending on the downstream task in question.
- **Language(s):** French
- **License:** MIT
- **Developed by:** Vincent Segonne
- **Funded by**
- GENCI-IDRIS (Grant 2022 A0131013801)
- French National Research Agency: Pantagruel grant ANR-23-IAS1-0001
- MIAI@Grenoble Alpes ANR-19-P3IA-0003
- PROPICTO ANR-20-CE93-0005
- Lawbot ANR-20-CE38-0013
- Swiss National Science Foundation (grant PROPICTO N°197864)
- **Authors**
- Vincent Segonne
- Aidan Mannion
- Laura Cristina Alonzo Canul
- Alexandre Audibert
- Xingyu Liu
- Cécile Macaire
- Adrien Pupier
- Yongxin Zhou
- Mathilde Aguiar
- Felix Herron
- Magali Norré
- Massih-Reza Amini
- Pierrette Bouillon
- Iris Eshkol-Taravella
- Emmanuelle Esperança-Rodier
- Thomas François
- Lorraine Goeuriot
- Jérôme Goulian
- Mathieu Lafourcade
- Benjamin Lecouteux
- François Portet
- Fabien Ringeval
- Vincent Vandeghinste
- Maximin Coavoux
- Marco Dinarelli
- Didier Schwab
## Citation
If you use this model for your own research work, please cite as follows:
```bibtex
@inproceedings{segonne:hal-04535557,
TITLE = {{Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains}},
AUTHOR = {Segonne, Vincent and Mannion, Aidan and Alonzo Canul, Laura Cristina and Audibert, Alexandre and Liu, Xingyu and Macaire, C{\'e}cile and Pupier, Adrien and Zhou, Yongxin and Aguiar, Mathilde and Herron, Felix and Norr{\'e}, Magali and Amini, Massih-Reza and Bouillon, Pierrette and Eshkol-Taravella, Iris and Esperan{\c c}a-Rodier, Emmanuelle and Fran{\c c}ois, Thomas and Goeuriot, Lorraine and Goulian, J{\'e}r{\^o}me and Lafourcade, Mathieu and Lecouteux, Benjamin and Portet, Fran{\c c}ois and Ringeval, Fabien and Vandeghinste, Vincent and Coavoux, Maximin and Dinarelli, Marco and Schwab, Didier},
URL = {https://hal.science/hal-04535557},
BOOKTITLE = {{LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation}},
ADDRESS = {Turin, Italy},
YEAR = {2024},
MONTH = May,
KEYWORDS = {Self-supervised learning ; Pretrained language models ; Evaluation benchmark ; Biomedical document processing ; Legal document processing ; Speech transcription},
PDF = {https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf},
HAL_ID = {hal-04535557},
HAL_VERSION = {v1},
}
```
<!-- - **Finetuned from model [optional]:** [More Information Needed] -->
<!--
### Model Sources [optional]
<!-- Provide the basic links for the model. -->