Edit model card

drawing

DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains

In recent years, pre-trained language models (PLMs) achieve the best performance on a wide range of natural language processing (NLP) tasks. While the first models were trained on general domain data, specialized ones have emerged to more effectively treat specific domains. In this paper, we propose an original study of PLMs in the medical domain on French language. We compare, for the first time, the performance of PLMs trained on both public data from the web and private data from healthcare establishments. We also evaluate different learning strategies on a set of biomedical tasks. Finally, we release the first specialized PLMs for the biomedical field in French, called DrBERT, as well as the largest corpus of medical data under free license on which these models are trained.

CAS: French Corpus with Clinical Cases

Train Dev Test
Documents 5,306 1,137 1,137

The ESSAIS (Dalloux et al., 2021) and CAS (Grabar et al., 2018) corpora respectively contain 13,848 and 7,580 clinical cases in French. Some clinical cases are associated with discussions. A subset of the whole set of cases is enriched with morpho-syntactic (part-of-speech (POS) tagging, lemmatization) and semantic (UMLS concepts, negation, uncertainty) annotations. In our case, we focus only on the POS tagging task.

Model Metric

 precision    recall  f1-score   support

         ABR     0.8683    0.8480    0.8580       171
         ADJ     0.9634    0.9751    0.9692      4018
         ADV     0.9935    0.9849    0.9892       926
     DET:ART     0.9982    0.9997    0.9989      3308
     DET:POS     1.0000    1.0000    1.0000       133
         INT     1.0000    0.7000    0.8235        10
         KON     0.9883    0.9976    0.9929       845
         NAM     0.9144    0.9353    0.9247       834
         NOM     0.9827    0.9803    0.9815      7980
         NUM     0.9825    0.9845    0.9835      1422
     PRO:DEM     0.9924    1.0000    0.9962       131
     PRO:IND     0.9630    1.0000    0.9811        78
     PRO:PER     0.9948    0.9931    0.9939       579
     PRO:REL     1.0000    0.9908    0.9954       109
         PRP     0.9989    0.9982    0.9985      3785
     PRP:det     1.0000    0.9985    0.9993       681
         PUN     0.9996    0.9958    0.9977      2376
     PUN:cit     0.9756    0.9524    0.9639        84
        SENT     1.0000    0.9974    0.9987      1174
         SYM     0.9495    1.0000    0.9741        94
    VER:cond     1.0000    1.0000    1.0000        11
    VER:futu     1.0000    0.9444    0.9714        18
    VER:impf     1.0000    0.9963    0.9981       804
    VER:infi     1.0000    0.9585    0.9788       193
    VER:pper     0.9742    0.9564    0.9652      1261
    VER:ppre     0.9617    0.9901    0.9757       203
    VER:pres     0.9833    0.9904    0.9868       830
    VER:simp     0.9123    0.7761    0.8387        67
    VER:subi     1.0000    0.7000    0.8235        10
    VER:subp     1.0000    0.8333    0.9091        18

    accuracy                         0.9842     32153
   macro avg     0.9799    0.9492    0.9623     32153
weighted avg     0.9843    0.9842    0.9842     32153

Citation BibTeX

@inproceedings{labrak2023drbert,
    title = {{DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains}},
    author = {Labrak, Yanis and Bazoge, Adrien and Dufour, Richard and Rouvier, Mickael and Morin, Emmanuel and Daille, Béatrice and Gourraud, Pierre-Antoine},
    booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL'23), Long Paper},
    month = july,
    year = 2023,
    address = {Toronto, Canada},
    publisher = {Association for Computational Linguistics}
}
Downloads last month
69

Dataset used to train Dr-BERT/CAS-Biomedical-POS-Tagging

Space using Dr-BERT/CAS-Biomedical-POS-Tagging 1