Camila Arias commited on
Commit
4d14d9c
1 Parent(s): 92d8f05

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +128 -0
README.md CHANGED
@@ -1,3 +1,131 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - fr
5
+ metrics:
6
+ - seqeval
7
+ library_name: transformers
8
+ pipeline_tag: token-classification
9
+ tags:
10
+ - medical
11
+ - biomedical
12
  ---
13
+
14
+
15
+ # DrBERT-CASM2
16
+
17
+ ## Model description
18
+
19
+ **DrBERT-CASM2** is a French Named Entity Recognition model that was fine-tuned from
20
+ [DrBERT](https://huggingface.co/Dr-BERT/DrBERT-4GB-CP-PubMedBERT): A PreTrained model in French for biomedical and clinical domains.
21
+ It has been trained to detect the following type of entities: **problem**, **treatment** and **test** using the medkit Trainer.
22
+
23
+ - **Fine-tuned using** medkit [GitHub Repo](https://github.com/TeamHeka/medkit)
24
+ - **Developed by** @camila-ud, medkit, HeKA Research team
25
+ - **Dataset from** @aneuraz, CASM2
26
+
27
+ # Intended uses & limitations
28
+
29
+ ## Limitations and bias
30
+
31
+ This model was trained for **development and test phases**.
32
+ This model is limited by its training dataset, and it should be used with caution.
33
+ The results are not guaranteed, and the model should be used only in data exploration stages.
34
+ The model may be able to detect entities in the early stages of the analysis of medical documents in French.
35
+
36
+ The maximum token size was reduced to **128 tokens** to minimize training time.
37
+
38
+ # How to use
39
+
40
+ ## Install medkit
41
+
42
+ First of all, please install medkit with the following command:
43
+
44
+ ```
45
+ pip install 'medkit-lib[optional]'
46
+ ```
47
+
48
+ Please check the [documentation](https://medkit.readthedocs.io/en/latest/user_guide/install.html) for more info and examples.
49
+
50
+ ## Using the model
51
+
52
+ ```python
53
+ from medkit.core.text import TextDocument
54
+ from medkit.text.ner.hf_entity_matcher import HFEntityMatcher
55
+
56
+ matcher = HFEntityMatcher(model="dcariasvi/DrBERT-CASM2")
57
+ test_doc = TextDocument("Elle souffre d'asthme mais n'a pas besoin d'Allegra")
58
+
59
+ # detect entities in the raw segment
60
+ detected_entities = matcher.run([test_doc.raw_segment])
61
+ msg = "|".join(f"'{entity.label}':{entity.text}" for entity in detected_entities)
62
+ print(f"Text: '{test_doc.text}'\n{msg}")w
63
+ ```
64
+ ```
65
+ Text: "Elle souffre d'asthme mais n'a pas besoin d'Allegra"
66
+ 'problem':asthme|'treatment':Allegra
67
+ ```
68
+
69
+ # Training data
70
+
71
+ This model was fine-tuned on **CASM2**, an internal corpus with clinical cases (in french) annotated by master students.
72
+ The corpus contains more than 5000 medkit documents (~ phrases) with entities to detect.
73
+
74
+ **Number of documents (~ phrases) by split**
75
+
76
+ | Split | # medkit docs |
77
+ | ---------- | ------------- |
78
+ | Train | 5824 |
79
+ | Validation | 1457 |
80
+ | Test | 1821 |
81
+
82
+
83
+ **Number of examples per entity type**
84
+
85
+ | Split | treatment | test | problem |
86
+ | ---------- | --------- | ---- | ------- |
87
+ | Train | 3258 | 3990 | 6808 |
88
+ | Validation | 842 | 1007 | 1745 |
89
+ | Test | 994 | 1289 | 2113 |
90
+
91
+ ## Training procedure
92
+
93
+ This model was fine-tuned using the medkit trainer on CPU, it takes about 3h.
94
+
95
+ # Model perfomances
96
+
97
+ Model performances computes on CASM2 test dataset (using medkit seqeval evaluator)
98
+
99
+ Entity|precision|recall|f1
100
+ -|-|-|-
101
+ treatment|0.7492|0.7666|0.7578
102
+ test|0.7449|0.8240|0.7824
103
+ problem|0.6884|0.7304|0.7088
104
+ Overall|0.7188|0.7660|0.7416
105
+
106
+ ## How to evaluate using medkit
107
+ ```python
108
+ from medkit.text.metrics.ner import SeqEvalEvaluator
109
+
110
+ # load the matcher and get predicted entities by document
111
+ matcher = HFEntityMatcher(model="dcariasvi/DrBERT-CASM2")
112
+ predicted_entities = [matchers.run([doc.raw_segment]) for doc in test_documents]
113
+
114
+ # define seqeval evaluator
115
+ evaluator = SeqEvalEvaluator(tagging_scheme="iob2")
116
+ evaluator.compute(test_documents,predicted_entities=predicted_entities)
117
+ ```
118
+
119
+ # Citation
120
+
121
+ ```
122
+ @online{medkit-lib,
123
+ author={HeKA Research Team},
124
+ title={medkit, A Python library for a learning health system},
125
+ url={https://pypi.org/project/medkit-lib/},
126
+ urldate = {2023-07-24},
127
+ }
128
+ ```
129
+ ```
130
+ HeKA Research Team, “medkit, a Python library for a learning health system.” https://pypi.org/project/medkit-lib/ (accessed Jul. 24, 2023).
131
+ ```