jargon-general-biomed / README.md

a-mannion

Update README.md

521b6f9 verified 3 months ago

preview code

raw

history blame

No virus

8.8 kB

	---
	license: mit
	language:
	- fr
	library_name: transformers
	tags:
	- linformer
	- medical
	- RoBERTa
	- pytorch
	---

	# Jargon-general-biomed

	[Jargon](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf) is an efficient transformer encoder LM for French, combining the LinFormer attention mechanism with the RoBERTa model architecture.

	Jargon is available in several versions with different context sizes and types of pre-training corpora.

	<!-- Provide a quick summary of what the model is/does. -->

	<!-- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
	-->

	\| Model \| Initialised from... \|
	\|-------------------------------------------------------------------------------------\|:-----------------------:\|
	\| [jargon-general-base](https://huggingface.co/PantagrueLLM/jargon-general-base) \| scratch \|
	\| [jargon-general-biomed](https://huggingface.co/PantagrueLLM/jargon-general-biomed) \| jargon-general-base \|
	\| jargon-general-legal \| jargon-general-base \|
	\| [jargon-multidomain-base](https://huggingface.co/PantagrueLLM/jargon-multidomain-base) \| jargon-general-base \|
	\| jargon-legal \| scratch \|
	\| [jargon-legal-4096](https://huggingface.co/PantagrueLLM/jargon-legal-4096) \| scratch \|
	\| [jargon-biomed](https://huggingface.co/PantagrueLLM/jargon-biomed) \| scratch \|
	\| [jargon-biomed-4096](https://huggingface.co/PantagrueLLM/jargon-biomed-4096) \| scratch \|
	\| [jargon-NACHOS](https://huggingface.co/PantagrueLLM/jargon-NACHOS) \| scratch \|
	\| [jargon-NACHOS-4096](https://huggingface.co/PantagrueLLM/jargon-NACHOS-4096) \| scratch \|


	## Evaluation

	The Jargon models were evaluated on an range of specialized downstream tasks.

	## Biomedical Benchmark

	Results averaged across five funs with varying random seeds.

	\| \|[FrenchMedMCQA](https://huggingface.co/datasets/qanastek/frenchmedmcqa)\|[MQC](https://aclanthology.org/2020.lrec-1.72/)\|[CAS-POS](https://clementdalloux.fr/?page_id=28)\|[ESSAI-POS](https://clementdalloux.fr/?page_id=28)\|[CAS-SG](https://aclanthology.org/W18-5614/)\|[MEDLINE](https://huggingface.co/datasets/mnaguib/QuaeroFrenchMed)\|[EMEA](https://huggingface.co/datasets/mnaguib/QuaeroFrenchMed)\|[E3C-NER](https://live.european-language-grid.eu/catalogue/corpus/7618)\|[CLISTER](https://aclanthology.org/2022.lrec-1.459/)\|
	\|-------------------------\|:-----------------------:\|:-----------------------:\|:--------------------:\|:--------------------:\|:--------------------:\|:--------------------:\|:--------------------:\|:--------------------:\|:--------------------:\|
	\| Task Type \| Sequence Classification \| Sequence Classification \| Token Classification \| Token Classification \| Token Classification \| Token Classification \| Token Classification \| Token Classification \| STS \|
	\| Metric \| EMR \| Accuracy \| Macro-F1 \| Macro-F1 \| Weighted F1 \| Weighted F1 \| Weighted F1 \| Weighted F1 \| Spearman Correlation \|
	\| jargon-general-base \| 12.9 \| 76.7 \| 96.6 \| 96.0 \| 69.4 \| 81.7 \| 96.5 \| 91.9 \| 78.0 \|
	\| jargon-biomed \| 15.3 \| 91.1 \| 96.5 \| 95.6 \| 75.1 \| 83.7 \| 96.5 \| 93.5 \| 74.6 \|
	\| jargon-biomed-4096 \| 14.4 \| 78.9 \| 96.6 \| 95.9 \| 73.3 \| 82.3 \| 96.3 \| 92.5 \| 65.3 \|
	\| jargon-general-biomed \| 16.1 \| 69.7 \| 95.1 \| 95.1 \| 67.8 \| 78.2 \| 96.6 \| 91.3 \| 59.7 \|
	\| jargon-multidomain-base \| 14.9 \| 86.9 \| 96.3 \| 96.0 \| 70.6 \| 82.4 \| 96.6 \| 92.6 \| 74.8 \|
	\| jargon-NACHOS \| 13.3 \| 90.7 \| 96.3 \| 96.2 \| 75.0 \| 83.4 \| 96.8 \| 93.1 \| 70.9 \|
	\| jargon-NACHOS-4096 \| 18.4 \| 93.2 \| 96.2 \| 95.9 \| 74.9 \| 83.8 \| 96.8 \| 93.2 \| 74.9 \|

	For more info please check out the [paper](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf), accepted for publication at [LREC-COLING 2024](https://lrec-coling-2024.org/list-of-accepted-papers/).


	## Using Jargon models with HuggingFace transformers

	You can get started with `jargon-general-biomed` using the code snippet below:

	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

	tokenizer = AutoTokenizer.from_pretrained("PantagrueLLM/jargon-general-biomed", trust_remote_code=True)
	model = AutoModelForMaskedLM.from_pretrained("PantagrueLLM/jargon-general-biomed", trust_remote_code=True)

	jargon_maskfiller = pipeline("fill-mask", model=model, tokenizer=tokenizer)
	output = jargon_maskfiller("Il est allé au <mask> hier")
	```

	You can also use the classes `AutoModel`, `AutoModelForSequenceClassification`, or `AutoModelForTokenClassification` to load Jargon models, depending on the downstream task in question.

	- Language(s): French
	- License: MIT
	- Developed by: Vincent Segonne
	- Funded by
	- GENCI-IDRIS (Grant 2022 A0131013801)
	- French National Research Agency: Pantagruel grant ANR-23-IAS1-0001
	- MIAI@Grenoble Alpes ANR-19-P3IA-0003
	- PROPICTO ANR-20-CE93-0005
	- Lawbot ANR-20-CE38-0013
	- Swiss National Science Foundation (grant PROPICTO N°197864)
	- Authors
	- Vincent Segonne
	- Aidan Mannion
	- Laura Cristina Alonzo Canul
	- Alexandre Audibert
	- Xingyu Liu
	- Cécile Macaire
	- Adrien Pupier
	- Yongxin Zhou
	- Mathilde Aguiar
	- Felix Herron
	- Magali Norré
	- Massih-Reza Amini
	- Pierrette Bouillon
	- Iris Eshkol-Taravella
	- Emmanuelle Esperança-Rodier
	- Thomas François
	- Lorraine Goeuriot
	- Jérôme Goulian
	- Mathieu Lafourcade
	- Benjamin Lecouteux
	- François Portet
	- Fabien Ringeval
	- Vincent Vandeghinste
	- Maximin Coavoux
	- Marco Dinarelli
	- Didier Schwab



	## Citation

	If you use this model for your own research work, please cite as follows:

	```bibtex
	@inproceedings{segonne:hal-04535557,
	TITLE = {{Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains}},
	AUTHOR = {Segonne, Vincent and Mannion, Aidan and Alonzo Canul, Laura Cristina and Audibert, Alexandre and Liu, Xingyu and Macaire, C{\'e}cile and Pupier, Adrien and Zhou, Yongxin and Aguiar, Mathilde and Herron, Felix and Norr{\'e}, Magali and Amini, Massih-Reza and Bouillon, Pierrette and Eshkol-Taravella, Iris and Esperan{\c c}a-Rodier, Emmanuelle and Fran{\c c}ois, Thomas and Goeuriot, Lorraine and Goulian, J{\'e}r{\^o}me and Lafourcade, Mathieu and Lecouteux, Benjamin and Portet, Fran{\c c}ois and Ringeval, Fabien and Vandeghinste, Vincent and Coavoux, Maximin and Dinarelli, Marco and Schwab, Didier},
	URL = {https://hal.science/hal-04535557},
	BOOKTITLE = {{LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation}},
	ADDRESS = {Turin, Italy},
	YEAR = {2024},
	MONTH = May,
	KEYWORDS = {Self-supervised learning ; Pretrained language models ; Evaluation benchmark ; Biomedical document processing ; Legal document processing ; Speech transcription},
	PDF = {https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf},
	HAL_ID = {hal-04535557},
	HAL_VERSION = {v1},
	}
	```



	<!-- - Finetuned from model [optional]: [More Information Needed] -->
	<!--
	### Model Sources [optional]


	<!-- Provide the basic links for the model. -->