Spaces:
Running
Running
APP_TITLE = "📐 NER Metrics Comparison ⚖️" | |
APP_INTRO = "The NER task is performed over a piece of text and involves recognition of entities belonging to a desired entity set and classifying them. The various metrics are explained in the explanation tab. Once you go through them, head to the comparision tab to test out some examples." | |
### EXPLANATION TAB ### | |
EVAL_FUNCTION_INTRO = "An evaluation function tells us how well a model is performing. The basic working of any evaluation function involves comparing the model's output with the ground truth to give a score of correctness." | |
EVAL_FUNCTION_PROPERTIES = """ | |
Some basic properties of an evaluation function are - | |
1. Give an output score equivalent to the upper bound when the prediction is completely correct(in some tasks, multiple variations of a predictions can be considered correct) | |
2. Give an output score equivalent to the lower bound when the prediction is completely wrong. | |
3. GIve an output score between upper and lower bound in other cases, corresponding to the degree of correctness. | |
""" | |
NER_TASK_EXPLAINER = """ | |
The output of the NER task can be represented in either token format or span format. | |
""" | |
SPAN_BASED_METRICS_EXPLANATION = """ | |
Span based metrics use the offsets & labels of the NER spans to compare the ground truths and predictions. These are present in the NER Span representation object, which looks like this | |
``` | |
span_ner_object = {"start_offset": 3, "end_offset":5, "label":"label_name"} | |
``` | |
Comparing the ground truth and predicted span objects we get the following broad categories of cases (detailed explanation can be found [here](https://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/)) | |
##### Comparison Cases | |
| Category | Explanation | | |
| --------------- | ------------------------------------------------------------------------ | | |
| Correct (COR) | both are the same | | |
| Incorrect (INC) | the output of a system and the golden annotation don’t match | | |
| Partial (PAR) | system and the golden annotation are somewhat “similar” but not the same | | |
| Missing (MIS) | a golden annotation is not captured by a system | | |
| Spurious (SPU) | system produces a response which doesn’t exist in the golden annotation | | |
The specifics of this categorization is defined based on our metric of choice. For example, in some cases we might want to consider a partial overlap of offsets correct and in other cases incorrect. | |
Based on this we have the Partial & Exact span based criterias. These categorization of these schemas are shown below | |
| Ground Truth Entity | Ground Truth String | Pred Entity | Pred String | Partial | Exact | | |
| ------------------- | ------------------- | ----------- | ------------------- | ------- | ----- | | |
| BRAND | tikosyn | - | - | MIS | MIS | | |
| - | - | BRAND | healthy | SPU | SPU | | |
| DRUG | warfarin | DRUG | of warfarin | COR | INC | | |
| DRUG | propranolol | BRAND | propranolol | INC | INC | | |
| DRUG | phenytoin | DRUG | phenytoin | COR | COR | | |
| GROUP | contraceptives | DRUG | oral contraceptives | INC | INC | | |
To compute precision, recall and f1-score from these cases, | |
$$ Precision = TP / (TP + FP) = COR / (COR + INC + PAR + SPU) $$ | |
$$ Recall = TP / (TP+FN) = COR / (COR + INC + PAR + MIS) $$ | |
This f1-score is then computed using the harmonic mean of precision and recall. | |
""" | |
TOKEN_BASED_METRICS_EXPLANATION = """ | |
Token based metrics use the NER token based representation object, which tokenized the input text and assigns a label to each of the token. This essentially transforms the evaluation/modelling task to a classification task. | |
The token based representation object is shown below | |
``` | |
# Here, O represents the null label | |
token_ner_object = [('My', O), ('name', O), ('is', O), ('John', NAME), ('.', O)] | |
``` | |
Once we have the token_objects for ground truth and predictions, we compute a classification report comparing the labels. | |
The final evaluation score is calculated based on exact token metric of choice. | |
###### Macro Average | |
Calculates the metrics for each label, and finds their unweighted mean. This does not take label imbalance into account. | |
###### Micro Average | |
Calculates the metrics globally by counting the total true positives, false negatives and false positives. | |
###### Weighted Average | |
Calculates the metrics for each label, and finds their average weighted by support (the number of true instances for each label). | |
This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall. | |
""" | |
### COMPARISION TAB ### | |
PREDICTION_ADDITION_INSTRUCTION = """ | |
Add predictions to the list of predictions on which the evaluation metric will be caculated. | |
- Select the entity type/label name and then highlight the span in the text below. | |
- To remove a span, double click on the higlighted text. | |
- Once you have your desired prediction, click on the 'Add' button.(The prediction created is shown in a json below) | |
""" | |