File size: 8,804 Bytes
27896c2
 
 
 
 
 
 
 
 
 
 
 
1ce8fc9
27896c2
 
 
 
 
 
 
 
 
 
7505ac2
 
73b207e
 
7505ac2
73b207e
7505ac2
521b6f9
73b207e
 
 
 
7505ac2
 
 
 
9d2919e
7505ac2
521b6f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73b207e
27896c2
1ce8fc9
27896c2
 
1ce8fc9
27896c2
 
 
 
1ce8fc9
 
27896c2
 
 
 
 
1ce8fc9
 
215d4d5
 
 
27896c2
 
 
 
 
 
 
215d4d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ce8fc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27896c2
 
 
 
1ce8fc9
27896c2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
license: mit
language:
- fr
library_name: transformers
tags:
- linformer
- medical
- RoBERTa
- pytorch
---

# Jargon-general-biomed

[Jargon](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf) is an efficient transformer encoder LM for French, combining the LinFormer attention mechanism with the RoBERTa model architecture.

Jargon is available in several versions with different context sizes and types of pre-training corpora.

<!-- Provide a quick summary of what the model is/does. -->

<!-- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
 -->

| **Model**                                                                           | **Initialised from...** |
|-------------------------------------------------------------------------------------|:-----------------------:|
| [jargon-general-base](https://huggingface.co/PantagrueLLM/jargon-general-base)        |         scratch         |
| [jargon-general-biomed](https://huggingface.co/PantagrueLLM/jargon-general-biomed)    |   jargon-general-base   |
| jargon-general-legal                                                                |   jargon-general-base   |
| [jargon-multidomain-base](https://huggingface.co/PantagrueLLM/jargon-multidomain-base) |   jargon-general-base   |
| jargon-legal                                                                        |         scratch         |
| [jargon-legal-4096](https://huggingface.co/PantagrueLLM/jargon-legal-4096)         |         scratch         |
| [jargon-biomed](https://huggingface.co/PantagrueLLM/jargon-biomed)                    |         scratch         |
| [jargon-biomed-4096](https://huggingface.co/PantagrueLLM/jargon-biomed-4096)          |         scratch         |
| [jargon-NACHOS](https://huggingface.co/PantagrueLLM/jargon-NACHOS)                    |         scratch         |
| [jargon-NACHOS-4096](https://huggingface.co/PantagrueLLM/jargon-NACHOS-4096)        |         scratch         |


## Evaluation

The Jargon models were evaluated on an range of specialized downstream tasks.

## Biomedical Benchmark

Results averaged across five funs with varying random seeds.

| |[**FrenchMedMCQA**](https://huggingface.co/datasets/qanastek/frenchmedmcqa)|[**MQC**](https://aclanthology.org/2020.lrec-1.72/)|[**CAS-POS**](https://clementdalloux.fr/?page_id=28)|[**ESSAI-POS**](https://clementdalloux.fr/?page_id=28)|[**CAS-SG**](https://aclanthology.org/W18-5614/)|[**MEDLINE**](https://huggingface.co/datasets/mnaguib/QuaeroFrenchMed)|[**EMEA**](https://huggingface.co/datasets/mnaguib/QuaeroFrenchMed)|[**E3C-NER**](https://live.european-language-grid.eu/catalogue/corpus/7618)|[**CLISTER**](https://aclanthology.org/2022.lrec-1.459/)|
|-------------------------|:-----------------------:|:-----------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|
| **Task Type**           | Sequence Classification | Sequence Classification | Token Classification | Token Classification | Token Classification | Token Classification | Token Classification | Token Classification |          STS         |
| **Metric**              |           EMR           |         Accuracy        |       Macro-F1       |       Macro-F1       |      Weighted F1     |      Weighted F1     |      Weighted F1     |      Weighted F1     | Spearman Correlation |
| jargon-general-base     |           12.9          |           76.7          |         96.6         |         96.0         |         69.4         |         81.7         |         96.5         |         91.9         |         78.0         |
| jargon-biomed           |           15.3          |           91.1          |         96.5         |         95.6         |         75.1         |         83.7         |         96.5         |         93.5         |         74.6         |
| jargon-biomed-4096      |           14.4          |           78.9          |         96.6         |         95.9         |         73.3         |         82.3         |         96.3         |         92.5         |         65.3         |
| jargon-general-biomed   |           16.1          |           69.7          |         95.1         |         95.1         |         67.8         |         78.2         |         96.6         |         91.3         |         59.7         |
| jargon-multidomain-base |           14.9          |           86.9          |         96.3         |         96.0         |         70.6         |         82.4         |         96.6         |         92.6         |         74.8         |
| jargon-NACHOS           |           13.3          |           90.7          |         96.3         |         96.2         |         75.0         |         83.4         |         96.8         |         93.1         |         70.9         |
| jargon-NACHOS-4096      |           18.4          |           93.2          |         96.2         |         95.9         |         74.9         |         83.8         |         96.8         |         93.2         |         74.9         |

For more info please check out the [paper](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf), accepted for publication at [LREC-COLING 2024](https://lrec-coling-2024.org/list-of-accepted-papers/).


## Using Jargon models with HuggingFace transformers

You can get started with `jargon-general-biomed` using the code snippet below:

```python
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained("PantagrueLLM/jargon-general-biomed", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("PantagrueLLM/jargon-general-biomed", trust_remote_code=True)

jargon_maskfiller = pipeline("fill-mask", model=model, tokenizer=tokenizer)
output = jargon_maskfiller("Il est allé au <mask> hier")
```

You can also use the classes `AutoModel`, `AutoModelForSequenceClassification`, or `AutoModelForTokenClassification` to load Jargon models, depending on the downstream task in question.

- **Language(s):** French
- **License:** MIT
- **Developed by:** Vincent Segonne
- **Funded by**
  - GENCI-IDRIS (Grant 2022 A0131013801)
  - French National Research Agency: Pantagruel grant ANR-23-IAS1-0001
  - MIAI@Grenoble Alpes ANR-19-P3IA-0003
  - PROPICTO ANR-20-CE93-0005
  - Lawbot ANR-20-CE38-0013
  - Swiss National Science Foundation (grant PROPICTO N°197864)
- **Authors**
  - Vincent Segonne
  - Aidan Mannion
  - Laura Cristina Alonzo Canul
  - Alexandre Audibert
  - Xingyu Liu
  - Cécile Macaire
  - Adrien Pupier
  - Yongxin Zhou
  - Mathilde Aguiar
  - Felix Herron
  - Magali Norré
  - Massih-Reza Amini
  - Pierrette Bouillon
  - Iris Eshkol-Taravella
  - Emmanuelle Esperança-Rodier
  - Thomas François
  - Lorraine Goeuriot
  - Jérôme Goulian
  - Mathieu Lafourcade
  - Benjamin Lecouteux
  - François Portet
  - Fabien Ringeval
  - Vincent Vandeghinste
  - Maximin Coavoux
  - Marco Dinarelli
  - Didier Schwab



## Citation

If you use this model for your own research work, please cite as follows:

```bibtex
@inproceedings{segonne:hal-04535557,
  TITLE = {{Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains}},
  AUTHOR = {Segonne, Vincent and Mannion, Aidan and Alonzo Canul, Laura Cristina and Audibert, Alexandre and Liu, Xingyu and Macaire, C{\'e}cile and Pupier, Adrien and Zhou, Yongxin and Aguiar, Mathilde and Herron, Felix and Norr{\'e}, Magali and Amini, Massih-Reza and Bouillon, Pierrette and Eshkol-Taravella, Iris and Esperan{\c c}a-Rodier, Emmanuelle and Fran{\c c}ois, Thomas and Goeuriot, Lorraine and Goulian, J{\'e}r{\^o}me and Lafourcade, Mathieu and Lecouteux, Benjamin and Portet, Fran{\c c}ois and Ringeval, Fabien and Vandeghinste, Vincent and Coavoux, Maximin and Dinarelli, Marco and Schwab, Didier},
  URL = {https://hal.science/hal-04535557},
  BOOKTITLE = {{LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation}},
  ADDRESS = {Turin, Italy},
  YEAR = {2024},
  MONTH = May,
  KEYWORDS = {Self-supervised learning ; Pretrained language models ; Evaluation benchmark ; Biomedical document processing ; Legal document processing ; Speech transcription},
  PDF = {https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf},
  HAL_ID = {hal-04535557},
  HAL_VERSION = {v1},
}
```



<!-- - **Finetuned from model [optional]:** [More Information Needed] -->
<!-- 
### Model Sources [optional]


<!-- Provide the basic links for the model. -->