metadata

title: SemnCG
datasets:
  - null
tags:
  - evaluate
  - metric
description: >-
  Sem-nCG (Semantic Normalized Cumulative Gain) Metric evaluates the quality of
  predicted sentences  (abstractive/extractive) in relation to reference
  sentences and documents using Semantic Normalized Cumulative Gain  (NCG). It
  computes gain values and NCG scores based on cosine similarity between
  sentence embeddings, leveraging a  Sentence-BERT encoder. This metric is
  designed to assess the relevance and ranking of predicted sentences, making
  it  useful for tasks such as summarization and information retrieval.
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false

Metric Card for Sem-nCG

Metric Description

Sem-nCG (Semantic Normalized Cumulative Gain) metric evaluates system-generated summaries (predictions) by comparing them with ground truth reference summaries (references) and input documents (documents). It computes the Semantic Normalized Cumulative Gain (NCG) scores based on sentence embeddings, which assess the quality of summaries by evaluating the relevance of predicted sentences to the reference and input document sentences.

How to Use

Before using this metric, you need to install the dependencies:

pip install -U sentence-transformers nltk

Sem-nCG takes three mandatory arguments:

predictions - List of predictions
references - List of references
documents - List of input documents

from evaluate import load
predictions = [
    "This is a prediction1 sentence 1. This is a prediction1 sentence 2.", 
    "This is a prediction2 sentence 1."
]
references = [
    "This is a reference1 sentence 1. This is a reference1 sentence 2.",
    "This is a reference2 sentence 1. This is a reference2 sentence 2."
]
documents = [
    "This is a document1 sentence 1. This is a document1 sentence 2. This is a document1 sentence 3.",
    "This is a document2 sentence 1. This is a document2 sentence 2."
]
model_name = "all-MiniLM-L6-v2"
metric = load("nbansal/semncg", model_name=model_name)  # model_name is optional. Default=all-MiniLM-L6-v2
mean_score, scores = metric.compute(predictions=predictions, references=references, documents=documents)
print(f"Mean SemnCG: {mean_score}")

Sem-nCG also accepts several optional arguments:

tokenize_sentences (bool): Flag to indicate whether to tokenize the sentences in the input documents. Default: True
pre_compute_embeddings (bool): Flag to indicate whether to pre-compute embeddings for all sentences. Default=False
k (int): The rank threshold used for evaluating gains (typically top-k sentences). Default is 3.
gpu (Union[bool, str, int, List[Union[str, int]]]): Whether to use GPU, CPU, or multiple processes for computation.
batch_size (int): Batch size for encoding. Default is 32.
verbose (bool): Flag to indicate verbose output. Default is False.
debug (bool): Flag to return detailed debug information including ranked gains. Default is False.

Refer to the inputs descriptions for more detailed usage as follows:

import evaluate
metric = evaluate.load("nbansal/semncg")
print(metric.inputs_description)

Output Values

The output is a tuple containing:

Mean Sem-nCG score: float: The average Sem-nCG score. scores: List[Union[float, RankedGains]]: List of Sem-nCG scores or RankedGains objects for each document.

Extensions

The current implementation supports any model from Huggingface/SentenceTransformer that is compatible with SentenceTransformer, such as all-mpnet-base-v2 or roberta-base. You can extend the metric with more models by extending the Encoder base class in the encoder_models.py file.

Deviations from Published Methodology

Citation

@inproceedings{akter-etal-2022-revisiting,
    title = "Revisiting Automatic Evaluation of Extractive Summarization Task: Can We Do Better than {ROUGE}?",
    author = "Akter, Mousumi  and
      Bansal, Naman  and
      Karmaker, Shubhra Kanti",
    editor = "Muresan, Smaranda  and
      Nakov, Preslav  and
      Villavicencio, Aline",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-acl.122",
    doi = "10.18653/v1/2022.findings-acl.122",
    pages = "1547--1560",
    abstract = "It has been the norm for a long time to evaluate automated summarization tasks using the popular ROUGE metric. Although several studies in the past have highlighted the limitations of ROUGE, researchers have struggled to reach a consensus on a better alternative until today. One major limitation of the traditional ROUGE metric is the lack of semantic understanding (relies on direct overlap of n-grams). In this paper, we exclusively focus on the extractive summarization task and propose a semantic-aware nCG (normalized cumulative gain)-based evaluation metric (called Sem-nCG) for evaluating this task. One fundamental contribution of the paper is that it demonstrates how we can generate more reliable semantic-aware ground truths for evaluating extractive summarization tasks without any additional human intervention. To the best of our knowledge, this work is the first of its kind. We have conducted extensive experiments with this new metric using the widely used CNN/DailyMail dataset. Experimental results show that the new Sem-nCG metric is indeed semantic-aware, shows higher correlation with human judgement (more reliable) and yields a large number of disagreements with the original ROUGE metric (suggesting that ROUGE often leads to inaccurate conclusions also verified by humans).",
}

Further References

Paper
Video