Spaces:

wadood
/

ner_evaluation_metrics

Running

App Files Files Community

wadood commited on 9 days ago

Commit

0a85c1f

•

1 Parent(s): 3d4ed4b

updated explanations

Browse files

Files changed (2) hide show

app.py +19 -1
constants.py +69 -0

app.py CHANGED Viewed

@@ -12,6 +12,8 @@ from constants import (
     EVAL_FUNCTION_PROPERTIES,
     NER_TASK_EXPLAINER,
     PREDICTION_ADDITION_INSTRUCTION,
 )
 from evaluation_metrics import EVALUATION_METRICS
 from predefined_example import EXAMPLES
@@ -35,7 +37,11 @@ def get_examples_attributes(selected_example):
 if __name__ == "__main__":
     st.set_page_config(layout="wide")
-    st.title(APP_TITLE)
     st.write(APP_INTRO)
     explanation_tab, comparision_tab = st.tabs(["📙 Explanation", "⚖️ Comparision"])
@@ -57,7 +63,19 @@ if __name__ == "__main__":
             "\n"
             f"{metric_names}"
         )
     with comparision_tab:
         # with st.container():
         st.subheader("Ground Truth & Predictions")  # , divider='rainbow')

     EVAL_FUNCTION_PROPERTIES,
     NER_TASK_EXPLAINER,
     PREDICTION_ADDITION_INSTRUCTION,
+    SPAN_BASED_METRICS_EXPLANATION,
+    TOKEN_BASED_METRICS_EXPLANATION,
 )
 from evaluation_metrics import EVALUATION_METRICS
 from predefined_example import EXAMPLES
 if __name__ == "__main__":
     st.set_page_config(layout="wide")
+    # st.title(APP_TITLE)
+    st.markdown(
+        f"<h1 style='text-align: center; color: grey;'>{APP_TITLE}</h1>",
+        unsafe_allow_html=True,
+    )
     st.write(APP_INTRO)
     explanation_tab, comparision_tab = st.tabs(["📙 Explanation", "⚖️ Comparision"])
             "\n"
             f"{metric_names}"
         )
+        st.markdown(
+            "These metrics can be broadly classified as 'Span Based' and 'Token Based' metrics."
+        )
+        st.markdown("### Span Based Metrics")
+        st.markdown(SPAN_BASED_METRICS_EXPLANATION)
+        st.markdown("### Token Based Metrics")
+        st.markdown(TOKEN_BASED_METRICS_EXPLANATION)
+        st.divider()
+        st.markdown(
+            "Now that you have read the basics of the metrics calculation, head to the comparision section to try out some examples!"
+        )
     with comparision_tab:
         # with st.container():
         st.subheader("Ground Truth & Predictions")  # , divider='rainbow')

constants.py CHANGED Viewed

@@ -15,8 +15,77 @@ Some basic properties of an evaluation function are -
 NER_TASK_EXPLAINER = """
 The output of the NER task can be represented in either token format or span format.
 """
 ### COMPARISION TAB ###
 PREDICTION_ADDITION_INSTRUCTION = """
 Add predictions to the list of predictions on which the evaluation metric will be caculated.
 - Select the entity type/label name and then highlight the span in the text below.

 NER_TASK_EXPLAINER = """
 The output of the NER task can be represented in either token format or span format.
 """
+SPAN_BASED_METRICS_EXPLANATION = """
+Span based metrics use the offsets & labels of the NER spans to compare the ground truths and predictions. These are present in the NER Span representation object, which looks like this
+```
+span_ner_object = {"start_offset": 3, "end_offset":5, "label":"label_name"}
+```
+Comparing the ground truth and predicted span objects we get the following broad categories of cases (detailed explanation can be found [here](https://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/))
+##### Comaparision Categories
+| Category        | Explanation                                                              |
+| --------------- | ------------------------------------------------------------------------ |
+| Correct (COR)   | both are the same                                                        |
+| Incorrect (INC) | the output of a system and the golden annotation don’t match             |
+| Partial (PAR)   | system and the golden annotation are somewhat “similar” but not the same |
+| Missing (MIS)   | a golden annotation is not captured by a system                          |
+| Spurious (SPU)  | system produces a response which doesn’t exist in the golden annotation  |
+The specifics of this categorization is defined based on our metric of choice. For example, in some cases we might want to consider a partial overlap of offsets correct and in other cases incorrect.
+Based on this we have the Partial & Exact span based criterias. These categorization of these schemas are shown below
+| Ground Truth Entity | Ground Truth String | Pred Entity | Pred String         | Partial | Exact |
+| ------------------- | ------------------- | ----------- | ------------------- | ------- | ----- |
+| BRAND               | tikosyn             |    -        |          -          | MIS     | MIS   |
+| -                   | -                   | BRAND       | healthy             | SPU     | SPU   |
+| DRUG                | warfarin            | DRUG        | of warfarin         | COR     | INC   |
+| DRUG                | propranolol         | BRAND       | propranolol         | INC     | INC   |
+| DRUG                | phenytoin           | DRUG        | phenytoin           | COR     | COR   |
+| GROUP               | contraceptives      | DRUG        | oral contraceptives | INC     | INC   |
+To compute precision, recall and f1-score from these cases,
+$$ Precision = TP / (TP + FP) = COR / (COR + INC + PAR + SPU) $$
+$$ Recall = TP / (TP+FN) = COR / (COR + INC + PAR + MIS) $$
+This f1-score is  then computed using the harmonic mean of precision and recall.
+"""
+TOKEN_BASED_METRICS_EXPLANATION = """
+Token based metrics use the NER token based representation object, which tokenized the input text and assigns a label to each of the token. This essentially transforms the evaluation/modelling task to a classification task.
+The token based representation object is shown below
+```
+# Here, O represents the null label
+token_ner_object = [('My', O), ('name', O), ('is', O), ('John', NAME), ('.', O)]
+```
+Once we have the token_objects for ground truth and predictions, we compute a classification report comparing the labels.
+The final evaluation score is calculated based on exact token metric of choice.
+###### Macro Average
+Calculates the metrics for each label, and finds their unweighted mean. This does not take label imbalance into account.
+###### Micro Average
+Calculates the metrics globally by counting the total true positives, false negatives and false positives.
+###### Weighted Average
+Calculates the metrics for each label, and finds their average weighted by support (the number of true instances for each label).
+This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
+"""
 ### COMPARISION TAB ###
 PREDICTION_ADDITION_INSTRUCTION = """
 Add predictions to the list of predictions on which the evaluation metric will be caculated.
 - Select the entity type/label name and then highlight the span in the text below.