Spaces:
Running
Running
File size: 7,675 Bytes
bb71c81 bc9b537 668c6f3 a54024a bb71c81 668c6f3 bb71c81 1996e0c bb71c81 a54024a 668c6f3 a54024a 668c6f3 a54024a 17b14df a54024a 27a1559 a54024a 27a1559 a54024a 668c6f3 a54024a 668c6f3 a54024a 668c6f3 9892755 668c6f3 a54024a 668c6f3 a54024a eb36db1 8072c3b eb36db1 668c6f3 a54024a 668c6f3 a54024a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
---
title: Sem-nCG
tags:
- evaluate
- metric
description: "Sem-nCG (Semantic Normalized Cumulative Gain) Metric evaluates the quality of predicted sentences
(abstractive/extractive) in relation to reference sentences and documents using Semantic Normalized Cumulative Gain
(NCG). It computes gain values and NCG scores based on cosine similarity between sentence embeddings, leveraging a
Sentence-BERT encoder. This metric is designed to assess the relevance and ranking of predicted sentences, making it
useful for tasks such as summarization and information retrieval."
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
authors:
- user: nbansal
---
# Metric Card for Sem-nCG
## Metric Description
Sem-nCG (Semantic Normalized Cumulative Gain) metric evaluates system-generated summaries (`predictions`) by comparing
them with ground truth reference summaries (`references`) and input documents (`documents`). It computes the Semantic
Normalized Cumulative Gain (NCG) scores based on sentence embeddings, which assess the quality of summaries by
evaluating the relevance of predicted sentences to the reference and input document sentences.
## How to Use
Before using this metric, you need to install the dependencies:
```bash
pip install -U evaluate sentence-transformers nltk
```
#### Python Usage
```python
from evaluate import load
predictions = [
"This is a prediction1 sentence 1. This is a prediction1 sentence 2.",
"This is a prediction2 sentence 1."
]
references = [
"This is a reference1 sentence 1. This is a reference1 sentence 2.",
"This is a reference2 sentence 1. This is a reference2 sentence 2."
]
documents = [
"This is a document1 sentence 1. This is a document1 sentence 2. This is a document1 sentence 3.",
"This is a document2 sentence 1. This is a document2 sentence 2."
]
model_name = "all-MiniLM-L6-v2"
metric = load("nbansal/semncg", model_name=model_name) # model_name is optional. Default=all-MiniLM-L6-v2
mean_score, scores = metric.compute(predictions=predictions, references=references, documents=documents)
print(f"Mean SemnCG: {mean_score}")
```
First step is to initialize the metric as `metric = load("nbansal/semncg", model_name=model_name)` where `model_name` is
the sentence embedding model. The default value is `all-MiniLM-L6-v2`.
To `compute` the Sem-nCG scores, you need to provide three mandatory arguments:
- `predictions` - List of predictions
- `references` - List of references
- `documents` - List of input documents
Sem-nCG also accepts several optional arguments:
- `tokenize_sentences (bool)`: Flag to indicate whether to tokenize the sentences in the input documents. Default: True
- `pre_compute_embeddings (bool)`: Flag to indicate whether to pre-compute embeddings for all sentences. Default=False
- `k (int)`: The rank threshold used for evaluating gains (typically top-k sentences). Default is 3.
- `gpu (Union[bool, str, int, List[Union[str, int]]])`: Whether to use GPU, CPU, or multiple processes for computation.
- `batch_size (int)`: Batch size for encoding. Default is 32.
- `verbose (bool)`: Flag to indicate verbose output. Default is False.
- `debug (bool)`: Flag to return detailed debug information including ranked gains. Default is False.
Refer to the inputs descriptions for more detailed usage as follows:
```python
import evaluate
metric = evaluate.load("nbansal/semncg")
print(metric.inputs_description)
```
### Output Values
The output is a tuple containing:
- Mean Sem-nCG score: float: The average Sem-nCG score.
- scores: List[Union[float, RankedGains]]: List of Sem-nCG scores or RankedGains objects for each document.
## Extensions
The current implementation supports any model from Huggingface/SentenceTransformer that is compatible with
SentenceTransformer, such as `all-mpnet-base-v2` or `roberta-base`. You can extend the metric with more models by
extending the `Encoder` base class in the `encoder_models.py` file.
## Deviations from Published Methodology
In our implementation, we expand upon the methodology presented in the original paper, which focused solely on
extractive model summaries. The primary approach in the paper involved ranking sentences in the source document based on
ground-truth reference sentences. The Normalized Cumulative Gain (NCG) score was computed using the formula:
```ncg = $\frac{\text{cumulative gain}}{\text{ideal cumulative gain}}$```
Key deviations in our implementation from the paper include:
1. **Inclusion of Abstractive Model Summaries:** Unlike the paper, which exclusively considered extractive model
summaries, our implementation supports both extractive and abstractive summarization models.
2. **Enhanced Calculation of NCG Scores:** For both extractive and abstractive summaries, we compute rankings based on
both the reference/ground truth (`gt_gain`) and predicted summaries (`pred_gain`). The NCG score is calculated using the
method shown below:
```python
def compute_ncg(pred_gains, gt_gains, k: int) -> float:
gt_dict = dict(gt_gains)
gt_rel = [v for _, v in gt_gains[:k]]
model_rel = [gt_dict[position] for position, _ in pred_gains[:k]]
return sum(model_rel)/sum(gt_rel)
```
This approach allows us to evaluate summarization quality across both extractive and abstractive methods, providing a
more comprehensive assessment than the original methodology.
## Citation
```bibtex
@inproceedings{akter-etal-2022-revisiting,
title = "Revisiting Automatic Evaluation of Extractive Summarization Task: Can We Do Better than {ROUGE}?",
author = "Akter, Mousumi and
Bansal, Naman and
Karmaker, Shubhra Kanti",
editor = "Muresan, Smaranda and
Nakov, Preslav and
Villavicencio, Aline",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-acl.122",
doi = "10.18653/v1/2022.findings-acl.122",
pages = "1547--1560",
abstract = "It has been the norm for a long time to evaluate automated summarization tasks using the popular ROUGE metric. Although several studies in the past have highlighted the limitations of ROUGE, researchers have struggled to reach a consensus on a better alternative until today. One major limitation of the traditional ROUGE metric is the lack of semantic understanding (relies on direct overlap of n-grams). In this paper, we exclusively focus on the extractive summarization task and propose a semantic-aware nCG (normalized cumulative gain)-based evaluation metric (called Sem-nCG) for evaluating this task. One fundamental contribution of the paper is that it demonstrates how we can generate more reliable semantic-aware ground truths for evaluating extractive summarization tasks without any additional human intervention. To the best of our knowledge, this work is the first of its kind. We have conducted extensive experiments with this new metric using the widely used CNN/DailyMail dataset. Experimental results show that the new Sem-nCG metric is indeed semantic-aware, shows higher correlation with human judgement (more reliable) and yields a large number of disagreements with the original ROUGE metric (suggesting that ROUGE often leads to inaccurate conclusions also verified by humans).",
}
```
## Further References
- [Paper](https://aclanthology.org/2022.findings-acl.122/)
- [Video](https://underline.io/lecture/50182-findings-revisiting-automatic-evaluation-of-extractive-summarization-task-can-we-do-better-than-rougequestion)
|