File size: 16,719 Bytes
c202d7d
 
 
cc05af6
c202d7d
 
 
 
 
eb6e73c
c202d7d
 
 
 
cc05af6
 
 
 
 
 
 
 
 
 
eb6e73c
300388f
eb6e73c
 
 
 
300388f
eb6e73c
300388f
 
 
 
 
 
eb6e73c
cc05af6
 
c202d7d
 
 
 
215aa92
ec3d4c4
97d89f5
ec3d4c4
c202d7d
 
215aa92
 
e4970fb
 
215aa92
 
c202d7d
 
 
54f77a5
eb6e73c
215aa92
7d35d7a
54f77a5
 
48c47ed
215aa92
 
 
54f77a5
288f409
54f77a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48c47ed
28687f6
 
54f77a5
28687f6
54f77a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215aa92
7d35d7a
54f77a5
 
7d35d7a
c202d7d
54f77a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0c2f99e
c202d7d
54f77a5
 
 
 
 
 
 
 
c202d7d
 
 
215aa92
 
28687f6
215aa92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c202d7d
 
 
 
04d7686
 
 
 
 
 
 
 
 
 
c202d7d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
from dataclasses import dataclass
from enum import Enum


@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str
  

# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard
    # task0 = Task("anli_r1", "acc", "ANLI")
    # task1 = Task("logiqa", "acc_norm", "LogiQA")
    task0 = Task("ncbi", "f1", "NCBI")
    task1 = Task("bc5cdr", "f1", "BC5CD")
    task3 = Task("chia", "f1", "CHIA")
    task4 = Task("biored", "f1", "BIORED")
    # task5 = Task("", "f1", "")
    # task6 = Task("", "f1", "")

@dataclass
class ClinicalType:
    benchmark: str
    metric: str
    col_name: str

class ClinicalTypes(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard
    type0 = ClinicalType("condition", "f1", "CONDITION") 
    type1 = ClinicalType("measurement", "f1", "MEASUREMENT") 
    type2 = ClinicalType("drug", "f1", "DRUG") 
    type3 = ClinicalType("procedure", "f1", "PROCEDURE") 
    type4 = ClinicalType("gene", "f1", "GENE")
    type5 = ClinicalType("gene variant", "f1", "GENE VARIANT")


NUM_FEWSHOT = 0  # Change with your few shot
# ---------------------------------------------------


# Your leaderboard name
TITLE = """""" #<h1 align="center" id="space-title"> NER Leaderboard</h1>"""
# LOGO = """<img src="file/assets/image.png" alt="Clinical X HF" width="500" height="333">"""
LOGO = """<img src="https://huggingface.co/spaces/m42-health/clinical_ner_leaderboard/resolve/main/assets/image.png" alt="Clinical X HF" width="500" height="333">"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
The main goal of the Named Clinical Entity Recognition Leaderboard is to evaluate and benchmark the performance of various language models in accurately identifying and classifying named clinical entities across diverse medical domains. This task is crucial for advancing natural language processing (NLP) applications in healthcare, as accurate entity recognition is foundational for tasks such as information extraction, clinical decision support, and automated documentation.  

The datasets used for this evaluation encompass a wide range of medical entities, including diseases, symptoms, medications, procedures and anatomical terms. These datasets are sourced from openly available clinical data (including annotations) to ensure comprehensive coverage and reflect the complexity of real-world medical language. The evaluation metrics used in this leaderboard focus primarily on the F1-score, a widely recognized measure of a model's accuracy. 
More details about the datasets and metrics can be found below in the 'About' section and the [NCER paper](https://arxiv.org/abs/2410.05046).  

Disclaimer: It is important to note that the purpose of this evaluation is purely academic and exploratory. The models assessed here have not been approved for clinical use, and their results should not be interpreted as clinically validated. The leaderboard serves as a platform for researchers to compare models, understand their strengths and limitations, and drive further advancements in the field of clinical NLP.
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT_1 = f"""

## About

The Named Clinical Entity Recognition Leaderboard is aimed at advancing the field of natural language processing in healthcare. It provides a standardized platform for evaluating and comparing the performance of various language models in recognizing named clinical entities, a critical task for applications such as clinical documentation, decision support, and information extraction. By fostering transparency and facilitating benchmarking, the leaderboard's goal is to drive innovation and improvement in NLP models. It also helps researchers identify the strengths and weaknesses of different approaches, ultimately contributing to the development of more accurate and reliable tools for clinical use. Despite its exploratory nature, the leaderboard aims to play a role in guiding research and ensuring that advancements are grounded in rigorous and comprehensive evaluations. 

## Evaluation method and metrics
When training a Named Entity Recognition (NER) system, the most common evaluation methods involve measuring precision, recall, and F1-score at the token level. While these metrics are useful for fine-tuning the NER system, evaluating the predicted named entities for downstream tasks requires metrics at the full named-entity level. We include both evaluation methods: token-based and span-based. We provide an example below which helps in understanding the difference between the methods.
Example Sentence: "The patient was diagnosed with a skin cancer disease."
For simplicity, let's assume the an example sentence which contains 10 tokens, with a single two-token disease entity (as shown in the figure below).
"""
EVALUATION_EXAMPLE_IMG = """<img src="https://huggingface.co/spaces/m42-health/clinical_ner_leaderboard/resolve/main/assets/ner_evaluation_example.png" alt="Clinical X HF" width="750" height="500">"""
LLM_BENCHMARKS_TEXT_2 = """
Token-based evaluation involves obtaining the set of token labels (ground-truth annotations) for the annotated entities and the set of token predictions, comparing these sets, and computing a classification report. Hence, the results for the example above are shown below.
**Token-based metrics:**

  
  
| Model   | TP  | FP  | FN  | Precision | Recall | F1-Score |
| ------- | --- | --- | --- | --------- | ------ | -------- |
| Model D | 0   | 1   | 0   | 0.00      | 0.00   | 0.00     |
| Model C | 1   | 1   | 1   | 0.50      | 0.50   | 0.50     |
| Model B | 2   | 2   | 0   | 0.50      | 1.00   | 0.67     |
| Model A | 2   | 1   | 0   | 0.67      | 1.00   | 0.80     |


Where,
$$ Precision = TP / (TP + FP)$$
$$ Recall = TP / (TP + FN)$$
$$ f1score = 2 * (Prec * Rec) / (Prec + Rec)$$
  
  

With this token-based approach, we have a broad idea of the performance of the model at the token level. However, it may misrepresent the performance at the entity level when the entity includes more than 1 token (which may be more relevant for certain applications). In addition, depending on the annotations of certain datasets, we may not want to penalize a model for a "partial" match with a certain entity.  
The span-based method attempts to address some of these issues, by determining the full or partial matches at the entity level to classify the predictions as correct, incorrect, missed and spurious. These are then used to calculate precision, recall and F1-score. Given this, for the example below.  
 
**Span-based metrics:**


| Model   | Correct | Incorrect | Missed | Spurious | Precision | Recall | F1-Score |
| ------- | ------- | --------- | ------ | -------- | --------- | ------ | -------- |
| Model A | 1       | 0         | 0      | 0        | 1.00      | 1.00   | 1.00     |
| Model B | 1       | 0         | 0      | 0        | 1.00      | 1.00   | 1.00     |
| Model C | 1       | 0         | 0      | 0        | 1.00      | 1.00   | 1.00     |
| Model D | 0       | 0         | 1      | 1        | 0.00      | 0.00   | 0.00     |


Where,
$$ Precision = COR / (COR + INC + SPU)$$
$$ Recall = COR / (COR + INC + MIS)$$
$$ f1score = 2 * (Prec * Rec) / (Prec + Rec)$$

Note:
1. Span-based approach here is equivalent to the 'Span Based Evaluation with Partial Overlap' in [NER Metrics Showdown!](https://huggingface.co/spaces/wadood/ner_evaluation_metrics) and is equivalent to Partial Match ("Type") in the nervaluate python package.
2. Token-based approach here is equivalent to the 'Token Based Evaluation With Macro Average' in [NER Metrics Showdown!](https://huggingface.co/spaces/wadood/ner_evaluation_metrics)

Additional examples can be tested on the [NER Metrics Showdown!](https://huggingface.co/spaces/wadood/ner_evaluation_metrics) huggingface space.

## Datasets
The following datasets (test splits only) have been included in the evaluation. 

### [NCBI Disease](https://huggingface.co/datasets/m42-health/clinical_ncbi)
The NCBI Disease corpus includes mention and concept level annotations on PubMed abstracts. It covers annotations of diseases. 

|            | Counts |
| ---------- | ------ |
| Samples    | 100    |
| Annotation | 960    |


### [CHIA](https://huggingface.co/datasets/m42-health/clinical_chia)
This is a large, annotated corpus of patient eligibility criteria extracted from registered clinical trials (ClinicalTrials.gov). Annotations cover 15 different entity types, including conditions, drugs, procedures, and measurements.


|            | Counts |
| ---------- | ------ |
| Samples    | 194    |
| Annotation | 3981   |


### [BC5CDR](https://huggingface.co/datasets/m42-health/clinical_bc5cdr)
The BC5CDR corpus consists of 1500 PubMed articles with annotated chemicals and diseases.


|            | Counts |
| ---------- | ------ |
| Samples    | 500    |
| Annotation | 9928   |


### [BIORED](https://huggingface.co/datasets/m42-health/clinical_biored)
The BIORED corpus includes a set of PubMed abstracts with annotations of multiple entity types (e.g., gene/protein, disease, chemical).


|            | Counts |
| ---------- | ------ |
| Samples    | 100    |
| Annotation | 3535   |


Datasets summary

A summary of the datasets used are summarized here.


| Dataset | # samples | # annotations | # original entities | # clinical entities |
| ------- | --------- | ------------- | ------------------- | ------------------- |
| NCBI    | 100       | 960           | 4                   | 1                   |
| CHIA    | 194       | 3981          | 16                  | 4                   |
| BIORED  | 500       | 9928          | 2                   | 4                   |
| BC5CDR  | 100       | 3535          | 6                   | 2                   |


## Clinical Entity Types

The above datasets are modified to cater to the clinical setting. For this, the entity types that are clinically relevant are retained and the rest are dropped. Further, the clinical entity type is standardized across the dataset to obtain a total of 6 clinical entity types shown below.


| Clinical Entity | Combined Annotation |
| --------------- | ------------------- |
| Condition       | 7514                |
| Drug            | 6443                |
| Procedure       | 300                 |
| Measurement     | 258                 |
| Gene            | 1180                |
| Gene Variant    | 241                 |


"""

ENTITY_DISTRIBUTION_IMG = """<img src="file/assets/entity_distribution.png" alt="Clinical X HF" width="750" height="500">"""
LLM_BENCHMARKS_TEXT_3="""
## Decoder Model Evaluation
Evaluating encoder models, such as BERT, for token classification tasks (e.g., NER) is straightforward given that these models process the entire input sequence simultaneously. This allows them to generate token-level classifications by leveraging bidirectional context, facilitating a direct comparison of predicted tags against the gold standard labels for each token in the input sequence.  

In contrast, decoder-only models, like GPT, generate responses sequentially, predicting one token at a time based on the preceding context. Evaluating the performance of these models for token classification tasks requires a different approach. First, we prompt the decoder-only LLM with a specific task of tagging the different entity types within a given text. This task is clearly defined to the model, ensuring it understands which types of entities to identify (i.e., conditions, drugs, procedures, etc).   
An example of the task prompt is shown below.
```
## Instruction
Your task is to generate an HTML version of an input text, marking up specific entities related to healthcare. The entities to be identified are: symptom, disorder. Use HTML <span > tags to highlight these entities. Each <span > should have a class attribute indicating the type of the entity. Do NOT provide further examples and just consider the input provided below. Do NOT provide an explanation nor notes about the reasoning. Do NOT reformat nor summarize the input text. Follow the instruction and the format of the example below.
 
## Entity markup guide
Use <span class='symptom' > to denote a symptom.
Use <span class='disorder' > to denote a disorder.
```
To ensure deterministic and consistent outputs, the temperature for generation is kept at 0.0. The model then generates a sequential response that includes the tagged entities, as shown in the example below.
```
## Input:
He had been diagnosed with osteoarthritis of the knees and had undergone arthroscopy years prior to admission.
## Output:
He had been diagnosed with <span class="disease" >osteoarthritis of the knees</span >and had undergone <span class="procedure" >arthroscopy</span >years prior to admission.
```

After the tagged output is generated, it is parsed to extract the tagged entities. The parsed data are then compared against the gold standard labels, and performance metrics are computed as above. This evaluation method ensures a consistent and objective assessment of decoder-only LLM's performance in NER tasks, despite the differences in their architecture compared to encoder models. 

# Reproducibility
To reproduce our results, follow the steps detailed [here](https://github.com/WadoodAbdul/clinical_ner_benchmark/blob/master/docs/reproducing_results.md)

# Disclaimer and Advisory
The Leaderboard is maintained by the authors and affiliated entity as part of our ongoing contribution to open research in the field of NLP in healthcare. The leaderboard is intended for academic and exploratory purposes only. The language models evaluated on this platform (to the best knowledge of the authors) have not been approved for clinical use, and their performance should not be interpreted as clinically validated or suitable for real-world medical applications.  

Users are advised to approach the results with an understanding of the inherent limitations and the experimental nature of this evaluation. The authors and affiliated entity do not endorse any specific model or approach, and the leaderboard is provided without any warranties or guarantees. Researchers and practitioners are encouraged to use the leaderboard as a resource to guide further research and development, keeping in mind the necessity for rigorous testing and validation in clinical settings before any practical application.




"""

EVALUATION_QUEUE_TEXT = """

Currently, the benchmark supports evaluation for models hosted on the huggingface hub and of type encoder, decoder or gliner type models.
If your model needs a custom implementation, follow the steps outlined in the [clinical_ner_benchmark](https://github.com/WadoodAbdul/clinical_ner_benchmark/blob/e66eb566f34e33c4b6c3e5258ac85aba42ec7894/docs/custom_model_implementation.md) repo or reach out to our team!


### Fields Explanation

#### Model Type:
- Fine-Tuned: If the training data consisted of any split/variation of the datasets on the leaderboard.  
- Zero-Shot: If the model did not have any exposure to the datasets on the leaderboard while training.

#### Model Architecture:
- Encoder: The standard transformer encoder architecture with a token classification head on top.  
- Decoder: Transformer based autoregressive token generation model.  
- GLiNER: Architecture outlined in the [GLiNER Paper](https://arxiv.org/abs/2311.08526)

#### Label Normalization Map:
Not all models have been tuned to output the ner label names in the clinical datasets on this leaderboard. Some models cater to the same entity names with a synonym of it.
The normalization map can be used to ensure that the models's output are aligned with the labels expected in the datasets.

Note: Multiple model labels can be mapped to a single entity type in the leaderboard dataset. Ex: 'synonym' and 'disease' to 'condition'


Upon successful submission of your request, your model's result would be updated on the leaderboard within 5 working days!
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@misc{abdul2024namedclinicalentityrecognition,
      title={Named Clinical Entity Recognition Benchmark}, 
      author={Wadood M Abdul and Marco AF Pimentel and Muhammad Umar Salman and Tathagata Raha and Clément Christophe and Praveen K Kanithi and Nasir Hayat and Ronnie Rajan and Shadab Khan},
      year={2024},
      eprint={2410.05046},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.05046}, 
}
 
"""