APP_TITLE = "📐 NER Metrics Comparison ⚖️" APP_INTRO = "The NER task is performed over a piece of text and involves recognition of entities belonging to a desired entity set and classifying them. The various metrics are explained in the explanation tab. Once you go through them, head to the comparision tab to test out some examples." ### EXPLANATION TAB ### EVAL_FUNCTION_INTRO = "An evaluation function tells us how well a model is performing. The basic working of any evaluation function involves comparing the model's output with the ground truth to give a score of correctness." EVAL_FUNCTION_PROPERTIES = """ Some basic properties of an evaluation function are - 1. Give an output score equivalent to the upper bound when the prediction is completely correct(in some tasks, multiple variations of a predictions can be considered correct) 2. Give an output score equivalent to the lower bound when the prediction is completely wrong. 3. GIve an output score between upper and lower bound in other cases, corresponding to the degree of correctness. """ NER_TASK_EXPLAINER = """ The output of the NER task can be represented in either token format or span format. """ SPAN_BASED_METRICS_EXPLANATION = """ Span based metrics use the offsets & labels of the NER spans to compare the ground truths and predictions. These are present in the NER Span representation object, which looks like this ``` span_ner_object = {"start_offset": 3, "end_offset":5, "label":"label_name"} ``` Comparing the ground truth and predicted span objects we get the following broad categories of cases (detailed explanation can be found [here](https://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/)) ##### Comparison Cases | Category | Explanation | | --------------- | --------------------------------------------------------------------------------------- | | Correct (COR) | both are the same | | Incorrect (INC) | the output of a system and the golden annotation don’t match | | Partial (PAR) | system and the golden annotation are somewhat “similar” but not the same | | Missing (MIS) | a golden annotation is not captured by a system | | Spurious (SPU) | system produces a response which doesn’t exist in the golden annotation (Hallucination) | The specifics of this categorization is defined based on our metric of choice. For example, in some cases we might want to consider a partial overlap of offsets correct and in other cases incorrect. Based on this we have the Partial & Exact span based criterias. These categorization of these schemas are shown below | Ground Truth Entity | Ground Truth String | Pred Entity | Pred String | Partial | Exact | | ------------------- | ------------------- | ----------- | ------------------- | ------- | ----- | | BRAND | tikosyn | - | - | MIS | MIS | | - | - | BRAND | healthy | SPU | SPU | | DRUG | warfarin | DRUG | of warfarin | COR | INC | | DRUG | propranolol | BRAND | propranolol | INC | INC | | DRUG | phenytoin | DRUG | phenytoin | COR | COR | | GROUP | contraceptives | DRUG | oral contraceptives | INC | INC | To compute precision, recall and f1-score from these cases, $$ Precision = TP / (TP + FP) = COR / (COR + INC + PAR + SPU) $$ $$ Recall = TP / (TP+FN) = COR / (COR + INC + PAR + MIS) $$ This f1-score is then computed using the harmonic mean of precision and recall. """ TOKEN_BASED_METRICS_EXPLANATION = """ Token based metrics use the NER token based representation object, which tokenized the input text and assigns a label to each of the token. This essentially transforms the evaluation/modelling task to a classification task. The token based representation object is shown below ``` # Here, O represents the null label token_ner_object = [('My', O), ('name', O), ('is', O), ('John', NAME), ('.', O)] ``` Once we have the token_objects for ground truth and predictions, we compute a classification report comparing the labels. The final evaluation score is calculated based on exact token metric of choice. ###### Macro Average Calculates the metrics for each label, and finds their unweighted mean. This does not take label imbalance into account. ###### Micro Average Calculates the metrics globally by counting the total true positives, false negatives and false positives. ###### Weighted Average Calculates the metrics for each label, and finds their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall. """ ### COMPARISION TAB ### PREDICTION_ADDITION_INSTRUCTION = """ Add predictions to the list of predictions on which the evaluation metric will be caculated. - Select the entity type/label name and then highlight the span in the text below. - To remove a span, double click on the higlighted text. - Once you have your desired prediction, click on the 'Add' button.(The prediction created is shown in a json below) """