gorkaartola's picture
Upload README.md
2f51751
|
raw
history blame
4.2 kB
metadata
title: metric_for_TP_FP_samples
datasets:
  - null
tags:
  - evaluate
  - metric
description: >-
  This metric is specially designed to measure the performance of sentence
  classification models over multiclass test datasets containing both True
  Positive samples, meaning that the label associated to the sentence in the
  sample is correctly assigned, and False Positive samples, meaning that the
  label associated to the sentence in the sample is incorrectly assigned.
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false

Metric Card for metric_for_TP_FP_samples

Metric Description

This metric is specially designed to measure the performance of sentence classification models over multiclass test datasets containing both True Positive samples, meaning that the label associated to the sentence in the sample is correctly assigned, and False Positive samples, meaning that the label associated to the sentence in the sample is incorrectly assigned.

How to Use

In addition to the classical predictions and references inputs, this metric includes a kwarg named prediction_strategies (list(str)), that refer to a family of prediction strategies that the metric can handle.

Add predictions, references and prediction_strategies as follows:

    metric = evaluate.load(metric_selector)
    metric.add_batch(predictions = predictions, references = references)
    results = metric.compute(prediction_strategies = prediction_strategies)

The minimum fields required by this metric for the test datasets are the following (not necessarily with these names):

  • title containing the first sentence to be compared with different queries representing each class.
  • label_ids containing the id of the class the sample refers to. Including samples of all the classes is advised.
  • nli_label which is '0' if the sample represents a True Positive or '2' if the sample represents a False Positive, meaning that the label_ids is incorrectly assigned to the title. Including both True Positive and False Positive samples for all classes is advised.

Example:

title label_ids nli_label
'Together we can save the arctic': celebrity advocacy and the Rio Earth Summit 2012 8 0
Tuple-based semantic and structural mapping for a sustainable interoperability 16 2

Inputs

  • predictions, (numpy.array(float32)[sentences to classify,number of classes]): numpy array with the softmax logits values of the entailment dimension of the NLI inference on the sentences to be classified for each class.

  • references , *(numpy.array(int32)[sentences to classify,2]: numpy array with the reference label_ids and nli_label of the sentences to be classified, given in the test_dataset.

  • kwarg named prediction_strategies = list(list(str, int(optional))), each list(list(str, int(optional))) describing a desired prediction strategy. The prediction_strategies implemented in this metric are:

    • argmax, which takes the highest value of the softmax inference logits to select the prediction. Syntax: ["argmax_max"]
    • threshold, which takes all softmax inference logits above a certain value to select the predictions. Syntax: ["threshold", desired value]
    • topk, which takes the highest k softmax inference logits to select the predictions. Syntax: ["topk", desired value]

    Example:

    prediction_strategies = [['argmax_max'],['threshold', 0.5],['topk,3']] 

Output Values

  • dict, with the names of the used prediction_strategies as keys and a pandas.DataFrame with a detailed table of metrics including, recall, precision, f1-score and accuracy of the predictions for each class, and both overall micro and macro averages.

Citation

BibLaTeX

@online{TP_FP_metric,
  author = {Gorka Artola},
  title = {Metric for True Positive and False Positive Samples},
  year = 2022,
  url = {https://huggingface.co/spaces/gorkaartola/metric_for_tp_fp_samples},
  urldate = {2022-08-11}
}