Spaces:

DemAI-Lab-UCF
/

Sem-nCG

Sleeping

App Files Files Community

nbansal commited on Jul 8

Commit

a54024a

•

1 Parent(s): 668c6f3

Added SemNCG metric

Browse files

Files changed (9) hide show

.gitignore +1 -0
README.md +85 -20
__init__.py +0 -0
encoder_models.py +129 -0
requirements.txt +3 -1
semncg.py +475 -45
tests.py +418 -17
type_aliases.py +11 -0
utils.py +280 -0

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ __pycache__/

README.md CHANGED Viewed

@@ -5,46 +5,111 @@ datasets:
 tags:
 - evaluate
 - metric
-description: "TODO: add a description here"
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
 pinned: false
 ---
-# Metric Card for SemnCG
-***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 ## Metric Description
-*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 ## How to Use
-*Give general statement of how to use the metric*
-*Provide simplest possible example for using the metric*
-### Inputs
-*List all input arguments in the format below*
-- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 ### Output Values
-*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
-*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
-#### Values from Popular Papers
-*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
-### Examples
-*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
-## Limitations and Bias
-*Note any known limitations or biases that the metric has, with links and references if possible.*
 ## Citation
-*Cite the source where this metric was introduced.*
 ## Further References
-*Add any useful further references.*

 tags:
 - evaluate
 - metric
+description: "Sem-nCG (Semantic Normalized Cumulative Gain) Metric evaluates the quality of predicted sentences
+(abstractive/extractive) in relation to reference sentences and documents using Semantic Normalized Cumulative Gain
+(NCG). It computes gain values and NCG scores based on cosine similarity between sentence embeddings, leveraging a
+Sentence-BERT encoder. This metric is designed to assess the relevance and ranking of predicted sentences, making it
+useful for tasks such as summarization and information retrieval."
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
 pinned: false
 ---
+# Metric Card for Sem-nCG
 ## Metric Description
+Sem-nCG (Semantic Normalized Cumulative Gain) metric evaluates system-generated summaries (`predictions`) by comparing
+them with ground truth reference summaries (`references`) and input documents (`documents`). It computes the Semantic
+Normalized Cumulative Gain (NCG) scores based on sentence embeddings, which assess the quality of summaries by
+evaluating the relevance of predicted sentences to the reference and input document sentences.
 ## How to Use
+Before using this metric, you need to install the dependencies:
+```bash
+pip install -U sentence-transformers nltk
+```
+Sem-nCG takes three mandatory arguments:
+ - `predictions` - List of predictions
+ - `references` - List of references
+ - `documents` - List of input documents
+```python
+from evaluate import load
+predictions = [
+    "This is a prediction1 sentence 1. This is a prediction1 sentence 2.",
+    "This is a prediction2 sentence 1."
+]
+references = [
+    "This is a reference1 sentence 1. This is a reference1 sentence 2.",
+    "This is a reference2 sentence 1. This is a reference2 sentence 2."
+]
+documents = [
+    "This is a document1 sentence 1. This is a document1 sentence 2. This is a document1 sentence 3.",
+    "This is a document2 sentence 1. This is a document2 sentence 2."
+]
+model_name = "all-MiniLM-L6-v2"
+metric = load("nbansal/semncg", model_name=model_name)  # model_name is optional. Default=all-MiniLM-L6-v2
+mean_score, scores = metric.compute(predictions=predictions, references=references, documents=documents)
+print(f"Mean SemnCG: {mean_score}")
+```
+Sem-nCG also accepts several optional arguments:
+ - `tokenize_sentences (bool)`: Flag to indicate whether to tokenize the sentences in the input documents. Default: True
+ - `pre_compute_embeddings (bool)`: Flag to indicate whether to pre-compute embeddings for all sentences. Default=False
+ - `k (int)`: The rank threshold used for evaluating gains (typically top-k sentences). Default is 3.
+ - `gpu (Union[bool, str, int, List[Union[str, int]]])`: Whether to use GPU, CPU, or multiple processes for computation.
+ - `batch_size (int)`: Batch size for encoding. Default is 32.
+ - `verbose (bool)`: Flag to indicate verbose output. Default is False.
+ - `debug (bool)`: Flag to return detailed debug information including ranked gains. Default is False.
+Refer to the inputs descriptions for more detailed usage as follows:
+```python
+import evaluate
+metric = evaluate.load("nbansal/semncg")
+print(metric.inputs_description)
+```
 ### Output Values
+The output is a tuple containing:
+Mean Sem-nCG score: float: The average Sem-nCG score.
+scores: List[Union[float, RankedGains]]: List of Sem-nCG scores or RankedGains objects for each document.
+## Extensions
+The current implementation supports any model from Huggingface/SentenceTransformer that is compatible with
+SentenceTransformer, such as `all-mpnet-base-v2` or `roberta-base`. You can extend the metric with more models by
+extending the `Encoder` base class in the `encoder_models.py` file.
+## Deviations from Published Methodology
 ## Citation
+```bibtex
+@inproceedings{akter-etal-2022-revisiting,
+    title = "Revisiting Automatic Evaluation of Extractive Summarization Task: Can We Do Better than {ROUGE}?",
+    author = "Akter, Mousumi  and
+      Bansal, Naman  and
+      Karmaker, Shubhra Kanti",
+    editor = "Muresan, Smaranda  and
+      Nakov, Preslav  and
+      Villavicencio, Aline",
+    booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
+    month = may,
+    year = "2022",
+    address = "Dublin, Ireland",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2022.findings-acl.122",
+    doi = "10.18653/v1/2022.findings-acl.122",
+    pages = "1547--1560",
+    abstract = "It has been the norm for a long time to evaluate automated summarization tasks using the popular ROUGE metric. Although several studies in the past have highlighted the limitations of ROUGE, researchers have struggled to reach a consensus on a better alternative until today. One major limitation of the traditional ROUGE metric is the lack of semantic understanding (relies on direct overlap of n-grams). In this paper, we exclusively focus on the extractive summarization task and propose a semantic-aware nCG (normalized cumulative gain)-based evaluation metric (called Sem-nCG) for evaluating this task. One fundamental contribution of the paper is that it demonstrates how we can generate more reliable semantic-aware ground truths for evaluating extractive summarization tasks without any additional human intervention. To the best of our knowledge, this work is the first of its kind. We have conducted extensive experiments with this new metric using the widely used CNN/DailyMail dataset. Experimental results show that the new Sem-nCG metric is indeed semantic-aware, shows higher correlation with human judgement (more reliable) and yields a large number of disagreements with the original ROUGE metric (suggesting that ROUGE often leads to inaccurate conclusions also verified by humans).",
+}
+```
 ## Further References
+ - [Paper](https://aclanthology.org/2022.findings-acl.122/)
+ - [Video](https://underline.io/lecture/50182-findings-revisiting-automatic-evaluation-of-extractive-summarization-task-can-we-do-better-than-rougequestion)

__init__.py ADDED Viewed

File without changes

encoder_models.py ADDED Viewed

	@@ -0,0 +1,129 @@

+import abc
+from typing import List, Union
+from numpy.typing import NDArray
+from sentence_transformers import SentenceTransformer
+from .type_aliases import ENCODER_DEVICE_TYPE
+class Encoder(abc.ABC):
+    @abc.abstractmethod
+    def encode(self, prediction: List[str]) -> NDArray:
+        """
+            Abstract method to encode a list of sentences into sentence embeddings.
+            Args:
+                prediction (List[str]): List of sentences to encode.
+            Returns:
+                NDArray: Array of sentence embeddings with shape (num_sentences, embedding_dim).
+            Raises:
+                NotImplementedError: If the method is not implemented in the subclass.
+        """
+        raise NotImplementedError("Method 'encode' must be implemented in subclass.")
+class SBertEncoder(Encoder):
+    def __init__(self, model: SentenceTransformer, device: ENCODER_DEVICE_TYPE, batch_size: int, verbose: bool):
+        """
+        Initialize SBertEncoder instance.
+        Args:
+            model (SentenceTransformer): The Sentence Transformer model instance to use for encoding.
+            device (Union[str, int, List[Union[str, int]]]): Device specification for encoding
+            batch_size (int): Batch size for encoding.
+            verbose (bool): Whether to print verbose information during encoding.
+        """
+        self.model = model
+        self.device = device
+        self.batch_size = batch_size
+        self.verbose = verbose
+    def encode(self, prediction: List[str]) -> NDArray:
+        """
+           Encode a list of sentences into sentence embeddings.
+           Args:
+               prediction (List[str]): List of sentences to encode.
+           Returns:
+               NDArray: Array of sentence embeddings with shape (num_sentences, embedding_dim).
+        """
+        # SBert output is always Batch x Dim
+        if isinstance(self.device, list):
+            # Use multiprocess encoding for list of devices
+            pool = self.model.start_multi_process_pool(target_devices=self.device)
+            embeddings = self.model.encode_multi_process(prediction, pool=pool, batch_size=self.batch_size)
+            self.model.stop_multi_process_pool(pool)
+        else:
+            # Single device encoding
+            embeddings = self.model.encode(
+                prediction,
+                device=self.device,
+                batch_size=self.batch_size,
+                show_progress_bar=self.verbose,
+            )
+        return embeddings
+def get_encoder(
+        sbert_model: SentenceTransformer,
+        device: ENCODER_DEVICE_TYPE,
+        batch_size: int,
+        verbose: bool,
+) -> Encoder:
+    """
+    Get an instance of SBertEncoder using the provided parameters.
+    Args:
+        sbert_model (SentenceTransformer): An instance of SentenceTransformer model to use for encoding.
+        device (Union[str, int, List[Union[str, int]]): Device specification for the encoder
+            (e.g., "cuda", 0 for GPU, "cpu").
+        batch_size (int): Batch size to use for encoding.
+        verbose (bool): Whether to print verbose information during encoding.
+    Returns:
+        SBertEncoder: Instance of the selected encoder based on the model_name.
+    Example:
+        >>> model_name = "paraphrase-distilroberta-base-v1"
+        >>> sbert_model = get_sbert_encoder(model_name)
+        >>> device = get_gpu("cuda")
+        >>> batch_size = 32
+        >>> verbose = True
+        >>> encoder = get_encoder(sbert_model, device, batch_size, verbose)
+    """
+    encoder = SBertEncoder(sbert_model, device, batch_size, verbose)
+    return encoder
+def get_sbert_encoder(model_name: str) -> SentenceTransformer:
+    """
+    Get an instance of SentenceTransformer encoder based on the specified model name.
+    Args:
+        model_name (str): Name of the model to instantiate. You can use any model on Huggingface/SentenceTransformer
+            that is supported by SentenceTransformer.
+    Returns:
+        SentenceTransformer: Instance of the selected encoder based on the model_name.
+    Raises:
+        EnvironmentError: If an unsupported model_name is provided.
+        RuntimeError: If there's an issue during instantiation of the encoder.
+    """
+    try:
+        encoder = SentenceTransformer(model_name, trust_remote_code=True)
+    except EnvironmentError as err:
+        raise EnvironmentError(str(err)) from None
+    except Exception as err:
+        raise RuntimeError(str(err)) from None
+    return encoder

requirements.txt CHANGED Viewed

	@@ -1 +1,3 @@
1	- git+https://github.com/huggingface/evaluate@main

+git+https://github.com/huggingface/evaluate@main
+nltk
+sentence-transformers

semncg.py CHANGED Viewed

@@ -11,55 +11,340 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""TODO: Add a description here."""
 import evaluate
 import datasets
-# TODO: Add BibTeX citation
 _CITATION = """\
-@InProceedings{huggingface:module,
-title = {A great new module},
-authors={huggingface, Inc.},
-year={2020}
 }
 """
-# TODO: Add description of the module here
 _DESCRIPTION = """\
-This new module is designed to solve this great ML task and is crafted with a lot of care.
 """
-# TODO: Add description of the arguments of the module here
 _KWARGS_DESCRIPTION = """
-Calculates how good are predictions given some references, using certain scores
 Args:
-    predictions: list of predictions to score. Each predictions
-        should be a string with tokens separated by spaces.
-    references: list of reference for each prediction. Each
-        reference should be a string with tokens separated by spaces.
 Returns:
-    accuracy: description of the first score,
-    another_score: description of the second score,
 Examples:
-    Examples should be written in doctest format, and should illustrate how
-    to use the function.
-    >>> my_new_module = evaluate.load("my_new_module")
-    >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
-    >>> print(results)
-    {'accuracy': 1.0}
 """
-# TODO: Define external resources urls if needed
-BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class SemnCG(evaluate.Metric):
-    """TODO: Short description of my evaluation module."""
     def _info(self):
         # TODO: Specifies the evaluate.EvaluationModuleInfo object
@@ -70,26 +355,171 @@ class SemnCG(evaluate.Metric):
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,
             # This defines the format of each prediction and reference
-            features=datasets.Features({
-                'predictions': datasets.Value('int64'),
-                'references': datasets.Value('int64'),
-            }),
-            # Homepage of the module for documentation
-            homepage="http://module.homepage",
-            # Additional links to the codebase or references
-            codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
-            reference_urls=["http://path.to.reference.url/new_module"]
         )
     def _download_and_prepare(self, dl_manager):
         """Optional: download external resources useful to compute the scores"""
-        # TODO: Download external resources if needed
-        pass
-    def _compute(self, predictions, references):
-        """Returns the scores"""
-        # TODO: Compute the different scores of the module
-        accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
-        return {
-            "accuracy": accuracy,
-        }

 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+"""Sem-NCG metric"""
+from dataclasses import dataclass
 import evaluate
 import datasets
+import re
+import statistics
+from typing import Dict, List, Tuple, Union
+import nltk
+import numpy as np
+from sklearn.metrics.pairwise import cosine_similarity
+from tqdm import tqdm
+from .encoder_models import get_sbert_encoder, get_encoder
+from .type_aliases import DEVICE_TYPE, NDArray, DOCUMENT_TYPE
+from .utils import get_gpu, prep_sentences, flatten_list, slice_embeddings, is_nested_list_of_type, tokenize_and_prep_document
 _CITATION = """\
+@inproceedings{akter-etal-2022-revisiting,
+    title = "Revisiting Automatic Evaluation of Extractive Summarization Task: Can We Do Better than {ROUGE}?",
+    author = "Akter, Mousumi  and
+      Bansal, Naman  and
+      Karmaker, Shubhra Kanti",
+    editor = "Muresan, Smaranda  and
+      Nakov, Preslav  and
+      Villavicencio, Aline",
+    booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
+    month = may,
+    year = "2022",
+    address = "Dublin, Ireland",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2022.findings-acl.122",
+    doi = "10.18653/v1/2022.findings-acl.122",
+    pages = "1547--1560",
+    abstract = "It has been the norm for a long time to evaluate automated summarization tasks using the popular ROUGE metric. Although several studies in the past have highlighted the limitations of ROUGE, researchers have struggled to reach a consensus on a better alternative until today. One major limitation of the traditional ROUGE metric is the lack of semantic understanding (relies on direct overlap of n-grams). In this paper, we exclusively focus on the extractive summarization task and propose a semantic-aware nCG (normalized cumulative gain)-based evaluation metric (called Sem-nCG) for evaluating this task. One fundamental contribution of the paper is that it demonstrates how we can generate more reliable semantic-aware ground truths for evaluating extractive summarization tasks without any additional human intervention. To the best of our knowledge, this work is the first of its kind. We have conducted extensive experiments with this new metric using the widely used CNN/DailyMail dataset. Experimental results show that the new Sem-nCG metric is indeed semantic-aware, shows higher correlation with human judgement (more reliable) and yields a large number of disagreements with the original ROUGE metric (suggesting that ROUGE often leads to inaccurate conclusions also verified by humans).",
 }
 """
 _DESCRIPTION = """\
+Sem-nCG (Semantic Normalized Cumulative Gain) Metric evaluates the quality of predicted sentences
+(abstractive/extractive) in relation to reference sentences and documents using Semantic Normalized Cumulative Gain
+(NCG). It computes gain values and NCG scores based on cosine similarity between sentence embeddings, leveraging a
+Sentence-BERT encoder. This metric is designed to assess the relevance and ranking of predicted sentences, making it
+useful for tasks such as summarization and information retrieval.
 """
 _KWARGS_DESCRIPTION = """
+Sem-nCG (Semantic Normalized Cumulative Gain) compares the system-generated summaries (predictions) with ground truth
+reference summaries (references) and input documents (documents) using Semantic Normalized Cumulative Gain (NCG).
+It computes gain values and NCG scores based on sentence embeddings.
 Args:
+    predictions (DOCUMENT_TYPE): The predicted sentences.
+                                 `tokenize_sentences`=True -> predictions: List[str]
+                                 `tokenize_sentences`=False -> predictions: List[List[str]]
+    references (DOCUMENT_TYPE): The reference sentences.
+                                `tokenize_sentences`=True -> references: List[str]
+                                `tokenize_sentences`=False -> references: List[List[str]]
+    documents (DOCUMENT_TYPE): Input documents.
+                               `tokenize_sentences`=True -> documents: List[str]
+                               `tokenize_sentences`=False -> documents: List[List[str]]
+    k (int): The rank threshold used for evaluating gains (typically top-k sentences). Default is 3.
+    gpu (Union[bool, str, int, List[Union[str, int]]]): Whether to use GPU or CPU for computation.
+        bool -
+            False - CPU (Default)
+            True - GPU (device 0) if gpu is available else CPU
+        int -
+            n - GPU, device index n
+        str -
+            'cuda', 'gpu', 'cpu'
+        List[Union[str, int]] - Multiple GPUs/cpus i.e. use multiple processes when computing embeddings
+    batch_size (int): Batch size for encoding. Default is 32.
+    verbose (bool): Flag to indicate verbose output. Default is False.
+    tokenize_sentences (bool): Flag to indicate whether to tokenize the sentences in the input documents. Default: True.
+    pre_compute_embeddings (bool): Flag to indicate whether to pre-compute embeddings for all sentences. This speeds up
+                                   computation but requires more memory. Default is False.
+    debug (bool): Flag to return detailed debug information including ranked gains. Default is False.
 Returns:
+    Union[Tuple[float, List[float]], Tuple[float, List[RankedGains]]]:
+    If `debug` is False, returns a tuple containing the mean SemnCG score and a list of SemnCG scores for each document.
+    If `debug` is True, returns a tuple containing the mean SemnCG score and a list of `RankedGains` objects with
+    detailed gain information for each document.
+Examples of input formats:
+Case 1: tokenize_sentences = True
+    predictions: List[str] - List of predictions where each prediction is a document.
+    references: List[str] - List of references where each reference is a document.
+    documents: List[str] - List of input documents where each document is a document.
+    Example:
+        predictions = ["This is a prediction sentence 1. This is a prediction sentence 2."]
+        references = ["This is a reference sentence 1. This is a reference sentence 2."]
+        documents = ["This is a document sentence 1. This is a document sentence 2."]
+Case 2: tokenize_sentences = False
+    predictions: List[List[str]] - List of predictions where each prediction is a list of sentences.
+    references: List[List[str]] - List of references where each reference is a list of sentences.
+    documents: List[List[str]] - List of input documents where each document is a list of sentences.
+    Example:
+        predictions = [["This is a prediction sentence 1.", "This is a prediction sentence 2."]]
+        references = [["This is a reference sentence 1.", "This is a reference sentence 2."]]
+        documents = [["This is a document sentence 1.", "This is a document sentence 2."]]
 Examples:
+    >>> import evaluate
+    >>> predictions = ["This is a prediction sentence 1. This is a prediction sentence 2."]
+    >>> references = ["This is a reference sentence 1. This is a reference sentence 2."]
+    >>> documents = ["This is a document sentence 1. This is a document sentence 2."]
+    >>> metric = evaluate.load("nbansal/semncg", model_name="all-MiniLM-L6-v2")
+    >>> mean_score, scores = metric.compute(predictions=predictions, references=references, documents=documents)
+    >>> print(f"Mean SemnCG: {mean_score}")
 """
+@dataclass
+class RankedGains:
+    """
+   Dataclass to store ranked gains and associated metadata.
+   Attributes:
+       gt_gains (List[Tuple[str, float]]): List of tuples representing ground truth (ideal) gains,
+           where each tuple contains a document sentence and its corresponding gain value.
+       pred_gains (List[Tuple[str, float]]): List of tuples representing predicted gains by the model,
+           where each tuple contains a document identifier and its corresponding gain value.
+       k (int): The rank threshold used for evaluating gains (typically top-k documents).
+       ncg (float): Normalized Cumulative Gain (NCG) score calculated based on the predicted gains
+           compared to the ground truth gains.
+   Notes:
+       - `gt_gains` and `pred_gains` are typically sorted in descending order
+       - `k` specifies the top-k threshold used for evaluating the gains.
+       - `ncg` provides a normalized measure of the model's performance.
+   """
+    gt_gains: List[Tuple[str, float]]
+    pred_gains: List[Tuple[str, float]]
+    k: int
+    ncg: float
+def compute_cosine_similarity(doc_embeds: NDArray, ref_embeds: NDArray) -> List[float]:
+    """
+   Compute cosine similarity scores between each document embedding and reference embeddings.
+   Args:
+       doc_embeds (NDArray): 2D array of shape (#Docs, Embedding_dim) containing document embeddings.
+       ref_embeds (NDArray): 2D array of shape (#Refs, Embedding_dim) containing reference embeddings.
+   Returns:
+       List[float]: A list of mean cosine similarity scores between each document and reference embeddings.
+                    The length of the list is equal to the number of documents (#Docs).
+   Notes:
+       - Uses cosine_similarity function from sklearn.metrics.pairwise to compute pairwise cosine similarities.
+       - Returns the mean cosine similarity scores across reference embeddings for each document embedding.
+   """
+    # Compute cosine similarity between predicted and reference embeddings
+    cosine_scores = cosine_similarity(doc_embeds, ref_embeds)  # [#Docs, #Refs]
+    return np.mean(cosine_scores, axis=1).tolist()
+def compute_gain(sim_scores: List[float]) -> List[Tuple[int, float]]:
+    """
+    Compute gain values for ranked similarity scores.
+    Args:
+        sim_scores (List[float]): List of similarity scores for documents (`compute_cosine_similarity(doc_embeds, ref_embeds)`)
+    Returns:
+        List[Tuple[int, float]]: A list of tuples where each tuple contains a document index and its corresponding gain
+                                 value. The list is sorted by descending order of gain values.
+    Notes:
+        - Computes gain values based on the rank order of similarity scores, where higher scores indicate higher gains.
+        - Uses the formula: gain = rank_position / sum of ranks, where rank_position starts from 1 for the highest score
+        - Returns a list sorted by descending gain values.
+    """
+    count = len(sim_scores)
+    sim_scores = np.array(sim_scores).argsort()[::-1]  # Reverse Sorted Order of doc sentence indices
+    denominator = count * (count + 1) / 2  # (n * (n+1))/2
+    return [(s_idx, val / denominator) for s_idx, val in zip(sim_scores, range(count, 0, -1))]
+def score_ncg(model_relevance: List[float], gt_relevance: List[float]) -> float:
+    """
+    Calculate the Normalized Cumulative Gain (NCG) score based on model relevance and ground truth relevance.
+    Args:
+        model_relevance (List[float]): List of gain values representing the relevance scores predicted by the model.
+        gt_relevance (List[float]): List of gain values representing the ground truth (ideal) relevance scores.
+    Returns:
+        float: Normalized Cumulative Gain (NCG) score, which measures the effectiveness of the model's relevance
+               predictions compared to the ideal relevance scores. The score ranges from 0 to 1, where higher values
+               indicate better performance.
+    Notes:
+        - Calculates Cumulative Gain (CG) for both model and ground truth relevance lists.
+        - Normalizes CG scores by dividing model CG by ground truth CG to get the NCG score.
+        - Returns 0 if the ground truth CG (icg) is 0 to avoid division by zero.
+    """
+    # CG score
+    cg = sum(model_relevance)
+    # ICG score
+    icg = sum(gt_relevance)
+    # Normalized CG score
+    return cg / icg if icg != 0 else 0
+def compute_ncg(pred_gains: List[Tuple[int, float]], gt_gains: List[Tuple[int, float]], k: int) -> float:
+    """
+    Compute the Normalized Cumulative Gain (NCG) score based on predicted and ground truth gains up to rank k.
+    Args:
+       pred_gains (List[Tuple[int, float]]): List of tuples representing predicted gains by the model,
+           where each tuple contains a document position (or index) and its corresponding gain value.
+           (Sorted in Descending Order)
+       gt_gains (List[Tuple[int, float]]): List of tuples representing ground truth gains (ideal gains),
+           where each tuple contains a document position (or index) and its corresponding gain value.
+           (Sorted in Descending Order)
+       k (int): The rank threshold used for evaluating gains (typically top-k documents).
+    Returns:
+       float: Normalized Cumulative Gain (NCG) score based on the predicted gains compared to the ground truth gains.
+    Notes:
+       - Both `pred_gains` and `gt_gains` should be sorted lists (in descending order) where higher gain values indicate
+        higher relevance.
+       - The function calculates NCG up to rank `k`, considering only the top-k documents.
+       - Uses the `score_ncg` function to compute the NCG score based on the model's predicted gains and the ground
+        truth.
+    """
+    gt_dict = dict(gt_gains)
+    gt_rel = [v for _, v in gt_gains[:k]]
+    model_rel = [gt_dict[position] for position, _ in pred_gains[:k]]
+    return score_ncg(model_rel, gt_rel)
+def _validate_input_format(
+        tokenize_sentences: bool,
+        predictions: DOCUMENT_TYPE,
+        references: DOCUMENT_TYPE,
+        documents: DOCUMENT_TYPE
+):
+    """
+    Validate the format of predictions, references, and documents based on specified criteria.
+    Args:
+        tokenize_sentences (bool): Flag indicating whether sentences should be tokenized.
+        predictions (DOCUMENT_TYPE): Predictions to validate.
+        references (DOCUMENT_TYPE): References to validate.
+        documents (DOCUMENT_TYPE): Documents to validate.
+    Raises:
+        ValueError: If the format of predictions, references, or documents does not meet the specified criteria.
+    Validation Criteria:
+    The function validates predictions, references, and documents based on the following conditions:
+    1. If `tokenize_sentences` is True:
+       - Predictions, references, and documents must all be lists of strings (`is_list_of_strings_at_depth(obj, 1)`).
+    2. If `tokenize_sentences` is False:
+       - Predictions, references, and documents must all be lists of lists of strings
+       (`is_list_of_strings_at_depth(obj, 2)`).
+    The function checks these conditions and raises a ValueError if any condition is not met,
+    indicating that predictions, references, or documents are not in the valid input format.
+    Notes:
+    - `DOCUMENT_TYPE`: Union[List[str], List[List[str]]]
+    - Uses helper function `is_list_of_strings_at_depth` to validate the format of lists of strings.
+    Example:
+        >>> tokenize_sentences = True
+        >>> predictions = ["This is prediction 1.", "This is prediction 2."]
+        >>> references = ["Reference for prediction 1.", "Reference for prediction 2."]
+        >>> documents = ["Document 1 content.", "Document 2 content."]
+        >>> _validate_input_format(tokenize_sentences, predictions, references, documents)
+    Example:
+        >>> tokenize_sentences = False
+        >>> predictions = [["Sentence 1 in prediction 1.", "Sentence 2 in prediction 1."],
+        >>>                ["Sentence 1 in prediction 2.", "Sentence 2 in prediction 2."]]
+        >>> references = [["Sentences in reference 1."], ["Sentences in reference 2."]]
+        >>> documents = [["Sentence 1 in document 1.", "Sentence 2 in document 1."],
+        >>>              ["Sentence 1 in document 2.", "Sentence 2 in document 2."]]
+        >>> _validate_input_format(tokenize_sentences, predictions, references, documents)
+    """
+    if not (len(predictions) == len(references) == len(documents)):
+        raise ValueError("Predictions, References and Documents must have the same length.")
+    if len(predictions) == 0:
+        raise ValueError("Can't have empty inputs")
+    def is_list_of_strings_at_depth(lst_obj, depth: int):
+        return is_nested_list_of_type(lst_obj, element_type=str, depth=depth)
+    if tokenize_sentences:
+        condition = (
+                is_list_of_strings_at_depth(predictions, 1) and
+                is_list_of_strings_at_depth(references, 1) and
+                is_list_of_strings_at_depth(documents, 1)
+        )
+    else:
+        condition = (
+                is_list_of_strings_at_depth(predictions, 2) and
+                is_list_of_strings_at_depth(references, 2) and
+                is_list_of_strings_at_depth(documents, 2)
+        )
+    if not condition:
+        raise ValueError("Predictions, References and Documents are not valid input format. Refer to documentation.")
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class SemnCG(evaluate.Metric):
+    """
+    SemnCG (Semantic Normalized Cumulative Gain) Metric.
+    This metric evaluates the quality of predicted sentences in relation to reference sentences and documents
+    using Semantic Normalized Cumulative Gain (NCG). It computes the gain values and NCG scores based on
+    cosine similarity between sentence embeddings, leveraging a Sentence-BERT encoder.
+    """
+    def __init__(self, model_name: str = "all-MiniLM-L6-v2", **kwargs):
+        self.sbert_encoder = get_sbert_encoder(model_name)
+        super().__init__(**kwargs)
     def _info(self):
         # TODO: Specifies the evaluate.EvaluationModuleInfo object
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,
             # This defines the format of each prediction and reference
+            features=[
+                # Tokenize_Sentences = True
+                datasets.Features(
+                    {
+                        "predictions": datasets.Value("string"),
+                        "references": datasets.Value("string"),
+                        "documents": datasets.Value("string"),
+                    }
+                ),
+                # Tokenize_Sentences = False
+                datasets.Features(
+                    {
+                        "predictions": datasets.Sequence(datasets.Value("string", id="sequence"), id="predictions"),
+                        "references": datasets.Sequence(datasets.Value("string", id="sequence"), id="references"),
+                        "documents": datasets.Sequence(datasets.Value("string", id="sequence"), id="documents"),
+                    }
+                ),
+            ],
+            # # Homepage of the module for documentation
+            # homepage="http://module.homepage",
+            # # Additional links to the codebase or references
+            # codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
+            reference_urls=["https://aclanthology.org/2022.findings-acl.122/"]
         )
     def _download_and_prepare(self, dl_manager):
         """Optional: download external resources useful to compute the scores"""
+        nltk.download("punkt", quiet=True)
+    def _compute(
+            self,
+            predictions: DOCUMENT_TYPE,
+            references: DOCUMENT_TYPE,
+            documents: DOCUMENT_TYPE,
+            k: int = 3,
+            gpu: DEVICE_TYPE = False,
+            verbose: bool = False,
+            batch_size: int = 32,
+            tokenize_sentences: bool = True,
+            pre_compute_embeddings: bool = False,
+            debug: bool = False,
+    ) -> Union[Tuple[float, List[float]], Tuple[float, List[RankedGains]]]:
+        """
+        Compute the Semantic Normalized Cumulative Gain (SemnCG) score.
+        Args:
+            predictions (DOCUMENT_TYPE): The predicted sentences.
+                                         `tokenize_sentences`=True -> predictions: List[str]
+                                         `tokenize_sentences`=False -> predictions: List[List[str]]
+            references (DOCUMENT_TYPE): The reference sentences.
+                                        `tokenize_sentences`=True -> references: List[str]
+                                        `tokenize_sentences`=False -> references: List[List[str]]
+            documents (DOCUMENT_TYPE): Input documents.
+                                       `tokenize_sentences`=True -> references: List[str]
+                                       `tokenize_sentences`=False -> references: List[List[str]]
+            k (int, optional): The rank threshold used for evaluating gains (typically top-k sentences). Default is 3.
+            gpu (DEVICE_TYPE, optional): Whether to use GPU for computation. Default is False.
+            verbose (bool, optional): Whether to print verbose logs. Default is False.
+            batch_size (int, optional): The batch size for encoding sentences. Default is 32.
+            tokenize_sentences (bool, optional): Whether to tokenize sentences. If True, sentences are tokenized before
+                                                 processing. Default is True.
+            pre_compute_embeddings (bool, optional): Whether to pre-compute embeddings for all sentences. This speeds up
+                                                     computation but requires more memory. Default is False.
+            debug (bool, optional): Whether to return detailed debug information including ranked gains. Default=False.
+        Returns:
+            Union[Tuple[float, List[float]], Tuple[float, List[RankedGains]]]:
+            If `debug` is False, returns a tuple containing the mean SemnCG score and a list of SemnCG scores for each document.
+            If `debug` is True, returns a tuple containing the mean SemnCG score and a list of `RankedGains` objects with detailed gain information for each document.
+        Raises:
+            ValueError: If the format of predictions, references, or documents does not meet the specified criteria.
+        Notes:
+            - Validates the format of predictions, references, and documents based on `tokenize_sentences`.
+            - Computes embeddings using a Sentence-BERT encoder.
+            - Computes cosine similarity between document, reference, and prediction embeddings.
+            - Calculates gain values and Normalized Cumulative Gain (NCG) scores.
+            - Optionally returns detailed debug information for each document if `debug` is True.
+        """
+        # Validate inputs corresponding to flags
+        _validate_input_format(tokenize_sentences, predictions, references, documents)
+        # Get GPU
+        device = get_gpu(gpu)
+        if verbose:
+            print(f"Using devices: {device}")
+        # Get model
+        encoder = get_encoder(self.sbert_encoder, device=device, batch_size=batch_size, verbose=verbose)
+        if pre_compute_embeddings:  # fast but takes more memory
+            predictions = [tokenize_and_prep_document(pred, tokenize_sentences) for pred in predictions]
+            references = [tokenize_and_prep_document(ref, tokenize_sentences) for ref in references]
+            documents = [tokenize_and_prep_document(doc, tokenize_sentences) for doc in documents]
+            # This is only done for debug case
+            sent_tokenized_documents = documents
+            # Compute All Embeddings
+            all_sentences = flatten_list(documents) + flatten_list(references) + flatten_list(predictions)
+            embeddings = encoder.encode(all_sentences)
+            prediction_sentences_count = [len(pred) for pred in predictions]
+            reference_sentences_count = [len(ref) for ref in references]
+            document_sentences_count = [len(doc) for doc in documents]
+            # Get embeddings corresponding to documents, references and predictions (IN ORDER)
+            doc_embeddings = slice_embeddings(embeddings, document_sentences_count)
+            ref_embeddings = slice_embeddings(embeddings[sum(document_sentences_count):], reference_sentences_count)
+            pred_embeddings = slice_embeddings(
+                embeddings[sum(document_sentences_count+reference_sentences_count):], prediction_sentences_count
+            )
+            iterable_obj = zip(pred_embeddings, ref_embeddings, doc_embeddings)
+        else:
+            iterable_obj = zip(predictions, references, documents)
+        out = []
+        for idx, (pred, ref, doc) in enumerate(tqdm(iterable_obj)):
+            if not pre_compute_embeddings:  # Compute embeddings
+                ref_sentences = tokenize_and_prep_document(ref, tokenize_sentences)
+                pred_sentences = tokenize_and_prep_document(pred, tokenize_sentences)
+                doc_sentences = tokenize_and_prep_document(doc, tokenize_sentences)
+                # Compute Embeddings
+                doc_sentence_count = len(doc_sentences)
+                ref_sentence_count = len(ref_sentences)
+                all_sentences = doc_sentences + ref_sentences + pred_sentences
+                embeddings = encoder.encode(all_sentences)
+                doc_embeddings = embeddings[:doc_sentence_count]
+                ref_embeddings = embeddings[doc_sentence_count:doc_sentence_count + ref_sentence_count]
+                pred_embeddings = embeddings[doc_sentence_count + ref_sentence_count:]
+            else:  # we already have embeddings
+                doc_embeddings = doc
+                ref_embeddings = ref
+                pred_embeddings = pred
+                doc_sentences = sent_tokenized_documents[idx]
+            # Compute Pair-Wise Cosine Similarity
+            ref_sim_scores = compute_cosine_similarity(doc_embeddings, ref_embeddings)
+            pred_sim_scores = compute_cosine_similarity(doc_embeddings, pred_embeddings)
+            # Compute Gains
+            ground_truth_gain = compute_gain(ref_sim_scores)
+            # this is used to compute top-predicted sentence indices
+            pred_gain = compute_gain(pred_sim_scores)
+            real_k = min(len(pred_gain), k)
+            # Compute NCG Scores
+            ncg_score = compute_ncg(pred_gain, ground_truth_gain, real_k)
+            if debug:
+                ground_truth_gain = [(doc_sentences[sent_idx], gain_val) for sent_idx, gain_val in ground_truth_gain]
+                pred_gain = [(doc_sentences[sent_idx], gain_val) for sent_idx, gain_val in pred_gain]
+                out.append(RankedGains(ground_truth_gain, pred_gain, k=real_k, ncg=ncg_score))
+            else:
+                out.append(ncg_score)
+        if debug:
+            return statistics.mean([ele.ncg for ele in out]), out
+        return statistics.mean(out), out

tests.py CHANGED Viewed

@@ -1,17 +1,418 @@
-test_cases = [
-    {
-        "predictions": [0, 0],
-        "references": [1, 1],
-        "result": {"metric_score": 0}
-    },
-    {
-        "predictions": [1, 1],
-        "references": [1, 1],
-        "result": {"metric_score": 1}
-    },
-    {
-        "predictions": [1, 0],
-        "references": [1, 1],
-        "result": {"metric_score": 0.5}
-    }
-]

+import statistics
+import unittest
+from unittest.mock import patch, MagicMock
+import numpy as np
+import torch
+from numpy.testing import assert_almost_equal
+from sentence_transformers import SentenceTransformer
+from sklearn.metrics.pairwise import cosine_similarity
+from .encoder_models import SBertEncoder, get_encoder, get_sbert_encoder
+from .semncg import RankedGains, compute_cosine_similarity, compute_gain, score_ncg, compute_ncg, _validate_input_format, SemnCG
+from .utils import get_gpu, slice_embeddings, is_nested_list_of_type, flatten_list, prep_sentences, tokenize_and_prep_document
+class TestUtils(unittest.TestCase):
+    def test_get_gpu(self):
+        gpu_count = torch.cuda.device_count()
+        gpu_available = torch.cuda.is_available()
+        # Test single boolean input
+        self.assertEqual(get_gpu(True), 0 if gpu_available else "cpu")
+        self.assertEqual(get_gpu(False), "cpu")
+        # Test single string input
+        self.assertEqual(get_gpu("cpu"), "cpu")
+        self.assertEqual(get_gpu("gpu"), 0 if gpu_available else "cpu")
+        self.assertEqual(get_gpu("cuda"), 0 if gpu_available else "cpu")
+        # Test single integer input
+        self.assertEqual(get_gpu(0), 0 if gpu_available else "cpu")
+        self.assertEqual(get_gpu(1), 1 if gpu_available else "cpu")
+        # Test list input with unique elements
+        self.assertEqual(get_gpu([True, "cpu", 0]), [0, "cpu"] if gpu_available else ["cpu", "cpu", "cpu"])
+        # Test list input with duplicate elements
+        self.assertEqual(get_gpu([0, 0, "gpu"]), 0 if gpu_available else ["cpu", "cpu", "cpu"])
+        # Test list input with duplicate elements of different types
+        self.assertEqual(get_gpu([True, 0, "gpu"]), 0 if gpu_available else ["cpu", "cpu", "cpu"])
+        # Test list input but only one element
+        self.assertEqual(get_gpu([True]), 0 if gpu_available else "cpu")
+        # Test list input with all integers
+        self.assertEqual(get_gpu(list(range(gpu_count))),
+                         list(range(gpu_count)) if gpu_available else gpu_count * ["cpu"])
+        with self.assertRaises(ValueError):
+            get_gpu("invalid")
+        with self.assertRaises(ValueError):
+            get_gpu(torch.cuda.device_count())
+    def test_prep_sentences(self):
+        # Test normal case
+        self.assertEqual(prep_sentences(["Hello, world!", " This is a test. ", "!!!"]),
+                         ['Hello, world!', 'This is a test.'])
+        # Test case with only punctuations
+        with self.assertRaises(ValueError):
+            prep_sentences(["!!!", "..."])
+        # Test case with empty list
+        with self.assertRaises(ValueError):
+            prep_sentences([])
+    def test_tokenize_and_prep_document(self):
+        # Test tokenize=True with string input
+        self.assertEqual(tokenize_and_prep_document("Hello, world! This is a test.", True),
+                         ['Hello, world!', 'This is a test.'])
+        # Test tokenize=False with list of strings input
+        self.assertEqual(tokenize_and_prep_document(["Hello, world!", "This is a test."], False),
+                         ['Hello, world!', 'This is a test.'])
+        # Test tokenize=True with empty document
+        with self.assertRaises(ValueError):
+            tokenize_and_prep_document("!!! ...", True)
+    def test_slice_embeddings(self):
+        # Case 1
+        embeddings = np.random.rand(10, 5)
+        num_sentences = [3, 2, 5]
+        expected_output = [embeddings[:3], embeddings[3:5], embeddings[5:]]
+        self.assertTrue(
+            all(np.array_equal(a, b) for a, b in zip(slice_embeddings(embeddings, num_sentences),
+                                                     expected_output))
+        )
+        # Case 2
+        num_sentences_nested = [[2, 1], [3, 4]]
+        expected_output_nested = [[embeddings[:2], embeddings[2:3]], [embeddings[3:6], embeddings[6:]]]
+        self.assertTrue(
+            slice_embeddings(embeddings, num_sentences_nested), expected_output_nested
+        )
+        # Case 3
+        document_sentences_count = [10, 8, 7]
+        reference_sentences_count = [5, 3, 2]
+        pred_sentences_count = [2, 2, 1]
+        all_embeddings = np.random.rand(
+            sum(document_sentences_count + reference_sentences_count + pred_sentences_count), 5,
+        )
+        embeddings = all_embeddings
+        expected_doc_embeddings = [embeddings[:10], embeddings[10:18], embeddings[18:25]]
+        embeddings = all_embeddings[25:]
+        expected_ref_embeddings = [embeddings[:5], embeddings[5:8], embeddings[8:10]]
+        embeddings = all_embeddings[35:]
+        expected_pred_embeddings = [embeddings[:2], embeddings[2:4], embeddings[4:5]]
+        doc_embeddings = slice_embeddings(all_embeddings, document_sentences_count)
+        ref_embeddings = slice_embeddings(all_embeddings[sum(document_sentences_count):], reference_sentences_count)
+        pred_embeddings = slice_embeddings(
+            all_embeddings[sum(document_sentences_count+reference_sentences_count):], pred_sentences_count
+        )
+        self.assertTrue(doc_embeddings, expected_doc_embeddings)
+        self.assertTrue(ref_embeddings, expected_ref_embeddings)
+        self.assertTrue(pred_embeddings, expected_pred_embeddings)
+        with self.assertRaises(TypeError):
+            slice_embeddings(embeddings, "invalid")
+    def test_is_nested_list_of_type(self):
+        # Test case: Depth 0, single element matching element_type
+        self.assertTrue(is_nested_list_of_type("test", str, 0))
+        # Test case: Depth 0, single element not matching element_type
+        self.assertFalse(is_nested_list_of_type("test", int, 0))
+        # Test case: Depth 1, list of elements matching element_type
+        self.assertTrue(is_nested_list_of_type(["apple", "banana"], str, 1))
+        # Test case: Depth 1, list of elements not matching element_type
+        self.assertFalse(is_nested_list_of_type([1, 2, 3], str, 1))
+        # Test case: Depth 0 (Wrong), list of elements matching element_type
+        self.assertFalse(is_nested_list_of_type([1, 2, 3], str, 0))
+        # Depth 2
+        self.assertTrue(is_nested_list_of_type([[1, 2], [3, 4]], int, 2))
+        self.assertTrue(is_nested_list_of_type([['1', '2'], ['3', '4']], str, 2))
+        self.assertFalse(is_nested_list_of_type([[1, 2], ["a", "b"]], int, 2))
+        # Depth 3
+        self.assertFalse(is_nested_list_of_type([[[1], [2]], [[3], [4]]], list, 3))
+        self.assertTrue(is_nested_list_of_type([[[1], [2]], [[3], [4]]], int, 3))
+        with self.assertRaises(ValueError):
+            is_nested_list_of_type([1, 2], int, -1)
+    def test_flatten_list(self):
+        self.assertEqual(flatten_list([1, [2, 3], [[4], 5]]), [1, 2, 3, 4, 5])
+        self.assertEqual(flatten_list([]), [])
+        self.assertEqual(flatten_list([1, 2, 3]), [1, 2, 3])
+        self.assertEqual(flatten_list([[[[1]]]]), [1])
+class TestSBertEncoder(unittest.TestCase):
+    def setUp(self) -> None:
+        # Set up a test SentenceTransformer model
+        self.model_name = "paraphrase-distilroberta-base-v1"
+        self.sbert_model = get_sbert_encoder(self.model_name)
+        self.device = "cpu"  # For testing on CPU
+        self.batch_size = 32
+        self.verbose = False
+        self.encoder = SBertEncoder(self.sbert_model, self.device, self.batch_size, self.verbose)
+    def test_encode_single_sentence(self):
+        sentence = "Hello, world!"
+        embeddings = self.encoder.encode([sentence])
+        self.assertEqual(embeddings.shape, (1, 768))  # Adjust shape based on your model's embedding dimension
+    def test_encode_multiple_sentences(self):
+        sentences = ["Hello, world!", "This is a test."]
+        embeddings = self.encoder.encode(sentences)
+        self.assertEqual(embeddings.shape, (2, 768))  # Adjust shape based on your model's embedding dimension
+    def test_get_sbert_encoder(self):
+        model_name = "paraphrase-distilroberta-base-v1"
+        sbert_model = get_sbert_encoder(model_name)
+        self.assertIsInstance(sbert_model, SentenceTransformer)
+    def test_encode_with_gpu(self):
+        if torch.cuda.is_available():
+            device = "cuda"
+            encoder = get_encoder(self.sbert_model, device, self.batch_size, self.verbose)
+            sentences = ["Hello, world!", "This is a test."]
+            embeddings = encoder.encode(sentences)
+            self.assertEqual(embeddings.shape, (2, 768))  # Adjust shape based on your model's embedding dimension
+        else:
+            self.skipTest("CUDA not available, skipping GPU test.")
+    def test_encode_multi_device(self):
+        if torch.cuda.device_count() < 2:
+            self.skipTest("Multi-GPU test requires at least 2 GPUs.")
+        else:
+            devices = ["cuda:0", "cuda:1"]
+            encoder = get_encoder(self.sbert_model, devices, self.batch_size, self.verbose)
+            sentences = ["This is a test sentence.", "Here is another sentence.", "This is a test sentence."]
+            embeddings = encoder.encode(sentences)
+            self.assertIsInstance(embeddings, np.ndarray)
+            self.assertEqual(embeddings.shape[0], 3)
+            self.assertEqual(embeddings.shape[1], self.encoder.model.get_sentence_embedding_dimension())
+class TestGetEncoder(unittest.TestCase):
+    def setUp(self):
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        self.batch_size = 8
+        self.verbose = False
+    def _base_test(self, model_name):
+        sbert_model = get_sbert_encoder(model_name)
+        encoder = get_encoder(sbert_model, self.device, self.batch_size, self.verbose)
+        # Assert
+        self.assertIsInstance(encoder, SBertEncoder)
+        self.assertEqual(encoder.device, self.device)
+        self.assertEqual(encoder.batch_size, self.batch_size)
+        self.assertEqual(encoder.verbose, self.verbose)
+    def test_get_sbert_encoder(self):
+        model_name = "stsb-roberta-large"
+        self._base_test(model_name)
+    def test_sbert_model(self):
+        model_name = "all-mpnet-base-v2"
+        self._base_test(model_name)
+    def test_huggingface_model(self):
+        """Test Huggingface models which work with SBert library"""
+        model_name = "roberta-base"
+        self._base_test(model_name)
+    def test_get_encoder_environment_error(self):  # This parameter is used when using patch decorator
+        model_name = "abc"  # Wrong model_name
+        with self.assertRaises(EnvironmentError):
+            get_sbert_encoder(model_name)
+    def test_get_encoder_other_exception(self):
+        model_name = "apple/OpenELM-270M"  # This model is not supported by SentenceTransformer lib
+        with self.assertRaises(RuntimeError):
+            get_sbert_encoder(model_name)
+class TestRankedGainsDataclass(unittest.TestCase):
+    def test_ranked_gains_dataclass(self):
+        # Test initialization and attribute access
+        gt_gains = [("doc1", 0.8), ("doc2", 0.6)]
+        pred_gains = [("doc2", 0.7), ("doc1", 0.5)]
+        k = 2
+        ncg = 0.75
+        ranked_gains = RankedGains(gt_gains, pred_gains, k, ncg)
+        self.assertEqual(ranked_gains.gt_gains, gt_gains)
+        self.assertEqual(ranked_gains.pred_gains, pred_gains)
+        self.assertEqual(ranked_gains.k, k)
+        self.assertEqual(ranked_gains.ncg, ncg)
+class TestComputeCosineSimilarity(unittest.TestCase):
+    def test_compute_cosine_similarity(self):
+        doc_embeds = np.array([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]])
+        ref_embeds = np.array([[0.2, 0.3, 0.4], [0.5, 0.6, 0.7]])
+        # Test compute_cosine_similarity function
+        similarity_scores = compute_cosine_similarity(doc_embeds, ref_embeds)
+        print(similarity_scores)
+        # Example values, change as per actual function output
+        expected_scores = [0.980, 0.997]
+        self.assertAlmostEqual(similarity_scores[0], expected_scores[0], places=3)
+        self.assertAlmostEqual(similarity_scores[1], expected_scores[1], places=3)
+class TestComputeGain(unittest.TestCase):
+    def test_compute_gain(self):
+        # Test compute_gain function
+        sim_scores = [0.8, 0.6, 0.7]
+        gains = compute_gain(sim_scores)
+        print(gains)
+        # Example values, change as per actual function output
+        expected_gains = [(0, 0.5), (2, 0.3333333333333333), (1, 0.16666666666666666)]
+        self.assertEqual(gains, expected_gains)
+class TestScoreNcg(unittest.TestCase):
+    def test_score_ncg(self):
+        # Test score_ncg function
+        model_relevance = [0.8, 0.7, 0.6]
+        gt_relevance = [1.0, 0.9, 0.8]
+        ncg_score = score_ncg(model_relevance, gt_relevance)
+        expected_ncg = 0.778  # Example value, change as per actual function output
+        self.assertAlmostEqual(ncg_score, expected_ncg, places=3)
+class TestComputeNcg(unittest.TestCase):
+    def test_compute_ncg(self):
+        # Test compute_ncg function
+        pred_gains = [(0, 0.8), (2, 0.7), (1, 0.6)]
+        gt_gains = [(0, 1.0), (1, 0.9), (2, 0.8)]
+        k = 3
+        ncg_score = compute_ncg(pred_gains, gt_gains, k)
+        expected_ncg = 1.0  # TODO: Confirm this with Dr. Santu
+        self.assertAlmostEqual(ncg_score, expected_ncg, places=6)
+class TestValidateInputFormat(unittest.TestCase):
+    def test_validate_input_format(self):
+        # Test _validate_input_format function
+        tokenize_sentences = True
+        predictions = ["Prediction 1", "Prediction 2"]
+        references = ["Reference 1", "Reference 2"]
+        documents = ["Document 1", "Document 2"]
+        # No exception should be raised for valid input
+        try:
+            _validate_input_format(tokenize_sentences, predictions, references, documents)
+        except ValueError as e:
+            self.fail(f"_validate_input_format raised ValueError unexpectedly: {str(e)}")
+        # Test invalid input format
+        predictions_invalid = [["Sentence 1 in prediction 1.", "Sentence 2 in prediction 1."],
+                               ["Sentence 1 in prediction 2.", "Sentence 2 in prediction 2."]]
+        references_invalid = [["Sentences in reference 1."], ["Sentences in reference 2."]]
+        documents_invalid = [["Sentence 1 in document 1.", "Sentence 2 in document 1."],
+                             ["Sentence 1 in document 2.", "Sentence 2 in document 2."]]
+        with self.assertRaises(ValueError):
+            _validate_input_format(tokenize_sentences, predictions_invalid, references, documents)
+        with self.assertRaises(ValueError):
+            _validate_input_format(tokenize_sentences, predictions, references_invalid, documents)
+        with self.assertRaises(ValueError):
+            _validate_input_format(tokenize_sentences, predictions, references, documents_invalid)
+class TestSemnCG(unittest.TestCase):
+    def setUp(self):
+        self.model_name = "stsb-distilbert-base"
+        self.metric = SemnCG(self.model_name)
+    def _basic_assertion(self, result, debug: bool = False):
+        self.assertIsInstance(result, tuple)
+        self.assertEqual(len(result), 2)
+        self.assertIsInstance(result[0], float)
+        self.assertTrue(0.0 <= result[0] <= 1.0)
+        self.assertIsInstance(result[1], list)
+        if debug:
+            for ranked_gain in result[1]:
+                self.assertTrue(isinstance(ranked_gain, RankedGains))
+                self.assertTrue(0.0 <= ranked_gain.ncg <= 1.0)
+        else:
+            for gain in result[1]:
+                self.assertTrue(isinstance(gain, float))
+                self.assertTrue(0.0 <= gain <= 1.0)
+    def test_compute_basic(self):
+        predictions = ["The cat sat on the mat.", "The quick brown fox jumps over the lazy dog."]
+        references = ["A cat was sitting on a mat.", "A quick brown fox jumped over a lazy dog."]
+        documents = ["There was a cat on a mat.", "The quick brown fox jumped over the lazy dog."]
+        result = self.metric.compute(predictions=predictions, references=references, documents=documents)
+        self._basic_assertion(result)
+    def test_compute_with_tokenization(self):
+        predictions = [["The cat sat on the mat."], ["The quick brown fox jumps over the lazy dog."]]
+        references = [["A cat was sitting on a mat."], ["A quick brown fox jumped over a lazy dog."]]
+        documents = [["There was a cat on a mat."], ["The quick brown fox jumped over the lazy dog."]]
+        result = self.metric.compute(
+            predictions=predictions, references=references, documents=documents, tokenize_sentences=False
+        )
+        self._basic_assertion(result)
+    def test_compute_with_pre_compute_embeddings(self):
+        predictions = ["The cat sat on the mat.", "The quick brown fox jumps over the lazy dog."]
+        references = ["A cat was sitting on a mat.", "A quick brown fox jumped over a lazy dog."]
+        documents = ["There was a cat on a mat.", "The quick brown fox jumped over the lazy dog."]
+        result = self.metric.compute(
+            predictions=predictions, references=references, documents=documents, pre_compute_embeddings=True
+        )
+        self._basic_assertion(result)
+    def test_compute_with_debug(self):
+        predictions = ["The cat sat on the mat.", "The quick brown fox jumps over the lazy dog."]
+        references = ["A cat was sitting on a mat.", "A quick brown fox jumped over a lazy dog."]
+        documents = ["There was a cat on a mat.", "The quick brown fox jumped over the lazy dog."]
+        result = self.metric.compute(
+            predictions=predictions, references=references, documents=documents, debug=True
+        )
+        self._basic_assertion(result, debug=True)
+    def test_compute_invalid_input_format(self):
+        predictions = "The cat sat on the mat."
+        references = ["A cat was sitting on a mat."]
+        documents = ["There was a cat on a mat."]
+        with self.assertRaises(ValueError):
+            self.metric.compute(predictions=predictions, references=references, documents=documents)
+if __name__ == '__main__':
+    unittest.main(verbosity=2)

type_aliases.py ADDED Viewed

	@@ -0,0 +1,11 @@

+from typing import List, Union, Tuple
+from numpy.typing import NDArray
+NumSentencesType = Union[List[int], List[List[int]]]
+EmbeddingSlicesType = Union[List[NDArray], List[List[NDArray]]]
+DEVICE_TYPE = Union[bool, str, int, List[Union[str, int]]]
+ENCODER_DEVICE_TYPE = Union[str, int, List[Union[str, int]]]
+DOCUMENT_TYPE = Union[List[str], List[List[str]]]

utils.py ADDED Viewed

	@@ -0,0 +1,280 @@

+import string
+from typing import List, Tuple, Union
+import nltk
+import numpy as np
+from numpy.typing import NDArray
+import torch
+from .type_aliases import DEVICE_TYPE, ENCODER_DEVICE_TYPE, NumSentencesType, EmbeddingSlicesType
+def get_gpu(gpu: DEVICE_TYPE) -> ENCODER_DEVICE_TYPE:
+    """
+        Determine the correct GPU device based on the provided input. In the following, output 0 means CUDA device 0.
+        Args:
+            gpu (Union[bool, str, int, List[Union[str, int]]]): Input specifying the GPU device(s):
+                - bool: If True, returns 0 if CUDA is available, otherwise returns "cpu".
+                - str: Can be "cpu", "gpu", or "cuda" (case-insensitive). Returns 0 if CUDA is available
+                  and the input is not "cpu", otherwise returns "cpu".
+                - int: Should be a valid GPU index. Returns the index if CUDA is available and valid,
+                  otherwise returns "cpu".
+                - List[Union[str, int]]: List containing combinations of the str/int. Processes each
+                  element and returns a list of corresponding results.
+        Returns:
+            Union[str, int, List[Union[str, int]]]: Depending on the input type:
+                - str: Returns "cpu" if no GPU is available or the input is "cpu".
+                - int: Returns the GPU index if valid and CUDA is available.
+                - List[Union[str, int]]: Returns a list of strings and/or integers based on the input list.
+        Raises:
+            ValueError: If the input gpu type is not recognized or invalid.
+            ValueError: If a string input is not one of ["cpu", "gpu", "cuda"].
+            ValueError: If an integer input is outside the valid range of GPU indices.
+        Notes:
+            - This function checks CUDA availability using torch.cuda.is_available() and counts
+              available GPUs using torch.cuda.device_count().
+            - Case insensitivity is maintained for string inputs ("cpu", "gpu", "cuda").
+            - The function ensures robust error handling for invalid input types or out-of-range indices.
+        """
+    # Ensure gpu index is within the range of total available gpus
+    gpu_available = torch.cuda.is_available()
+    gpu_count = torch.cuda.device_count()
+    correct_strs = ["cpu", "gpu", "cuda"]
+    def _get_single_device(gpu_item):
+        if isinstance(gpu_item, bool):
+            return 0 if gpu_item and gpu_available else "cpu"
+        elif isinstance(gpu_item, str):
+            if gpu_item.lower() not in correct_strs:
+                raise ValueError(f"Wrong gpu type: {gpu_item}. Should be one of {correct_strs}")
+            return 0 if (gpu_item.lower() != "cpu") and gpu_available else "cpu"
+        elif isinstance(gpu_item, int):
+            if gpu_item >= gpu_count:
+                raise ValueError(
+                    f"There are {gpu_count} GPUs available. Provide a valid GPU index. You provided: {gpu_item}"
+                )
+            return gpu_item if gpu_available else "cpu"
+        else:
+            raise ValueError(f"Invalid gpu type: {type(gpu_item)}. Must be bool, str, or int.")
+    if isinstance(gpu, list):
+        seen_indices = set()
+        result = []
+        for item in gpu:
+            device = _get_single_device(item)
+            if isinstance(device, int):
+                if device not in seen_indices:
+                    seen_indices.add(device)
+                    result.append(device)
+            else:
+                result.append(device)
+        return result[0] if len(result) == 1 else result
+    else:
+        return _get_single_device(gpu)
+def prep_sentences(sentences: List[str]) -> List[str]:
+    """
+    Processes a list of sentences by stripping whitespace (at beginning and the end),
+    , filtering out empty sentences or sentences that only contains punctuations.
+    Args:
+        sentences (List[str]): A list of sentences to be processed.
+    Returns:
+        List[str]: A list of cleaned sentences
+    Raises:
+        ValueError: If the resulting list of sentences is empty.
+    Example:
+        >>> prep_sentences(["Hello, world!", " This is a test. ", "!!!"])
+        ['Hello, world!', 'This is a test.']
+        >>> prep_sentences(["!!!", "..."])
+        ValueError: Document can't be empty.
+    """
+    out = []
+    for sent in sentences:
+        sent = sent.strip()
+        sent_wo_punctuation = (
+            sent.translate(str.maketrans("", "", string.punctuation))
+        ).strip()
+        if sent_wo_punctuation:
+            out.append(sent)
+    if len(out) == 0:
+        raise ValueError("Document can't be empty.")
+    return out
+def tokenize_and_prep_document(document: Union[str, List[str]], tokenize: bool) -> List[str]:
+    """
+    Tokenizes and prepares a document by either tokenizing it into sentences and processing each sentence,
+    or directly processing each element if `tokenize` is False.
+    Args:
+        document (Union[str, List[str]]): The document to be processed. It can be a single string (enitre document) or a
+         list of strings (list of sentences).
+        tokenize (bool): If True, tokenizes `document` into sentences using NLTK's sentence tokenizer before processing.
+                         If False, processes each element of `document` directly as sentences.
+    Returns:
+        List[str]: A list of cleaned sentences.
+    Raises:
+        ValueError: If the resulting list of sentences is empty after processing.
+    Example:
+        >>> tokenize_and_prep_document("Hello, world! This is a test.", True)
+        ['Hello, world!', 'This is a test.']
+        >>> tokenize_and_prep_document(["Hello, world!", "This is a test."], False)
+        ['Hello, world!', 'This is a test.']
+        >>> tokenize_and_prep_document("!!! ...", True)
+        ValueError: Document can't be empty.
+    Note: Only the following two cases are possible.
+        tokenizer=True -> document: str
+        tokenizer=False -> document: List[str].
+    """
+    if tokenize:
+        return prep_sentences(nltk.tokenize.sent_tokenize(document))
+    return prep_sentences(document)
+def flatten_list(nested_list: list) -> list:
+    """
+    Recursively flattens a nested list of any depth.
+    Parameters:
+        nested_list (list): The nested list to flatten.
+    Returns:
+        list: A flat list containing all the elements of the nested list.
+    """
+    flat_list = []
+    for item in nested_list:
+        if isinstance(item, list):
+            flat_list.extend(flatten_list(item))
+        else:
+            flat_list.append(item)
+    return flat_list
+def is_nested_list_of_type(lst_obj, element_type, depth: int) -> bool:
+    """
+        Check if the given object is a nested list of a specific type up to a specified depth.
+        Args:
+        - lst_obj: The object to check, expected to be a list or a single element.
+        - element_type: The type that each element in the nested list should match.
+        - depth (int): The depth of nesting to check. Must be non-negative.
+        Returns:
+        - bool: True if lst_obj is a nested list of the specified type up to the given depth, False otherwise.
+        Raises:
+        - ValueError: If depth is negative.
+        Example:
+        ```python
+        # Test cases
+        is_nested_list_of_type("test", str, 0)  # Returns True
+        is_nested_list_of_type([1, 2, 3], str, 0)  # Returns False
+        is_nested_list_of_type(["apple", "banana"], str, 1)  # Returns True
+        is_nested_list_of_type([[1, 2], [3, 4]], int, 2)  # Returns True
+        is_nested_list_of_type([[1, 2], ["a", "b"]], int, 2)  # Returns False
+        is_nested_list_of_type([[[1], [2]], [[3], [4]]], int, 3)  # Returns True
+        ```
+        Explanation:
+        - The function checks if `lst_obj` is a nested list of elements of type `element_type` up to `depth` levels deep.
+        - If `depth` is 0, it checks if `lst_obj` itself is of type `element_type`.
+        - If `depth` is greater than 0, it recursively checks each level of nesting to ensure all elements match `element_type`.
+        - Raises a `ValueError` if `depth` is negative, as depth must be a non-negative integer.
+    """
+    if depth == 0:
+        return isinstance(lst_obj, element_type)
+    elif depth > 0:
+        return isinstance(lst_obj, list) and all(is_nested_list_of_type(item, element_type, depth - 1) for item in lst_obj)
+    else:
+        raise ValueError("Depth can't be negative")
+def slice_embeddings(embeddings: NDArray, num_sentences: NumSentencesType) -> EmbeddingSlicesType:
+    """
+        Slice embeddings into segments based on the provided number of sentences per segment.
+        Args:
+        - embeddings (np.ndarray): The array of embeddings to be sliced.
+        - num_sentences (Union[List[int], List[List[int]]]):
+            - If a list of integers: Specifies the number of embeddings to take in each slice.
+            - If a list of lists of integers: Specifies multiple nested levels of slicing.
+        Returns:
+        - List[np.ndarray]: A list of numpy arrays where each array represents a slice of embeddings.
+        Raises:
+        - TypeError: If `num_sentences` is not of type List[int] or List[List[int]].
+        Example Usage:
+        ```python
+        embeddings = np.random.rand(10, 5)
+        num_sentences = [3, 2, 5]
+        result = slice_embeddings(embeddings, num_sentences)
+        # `result` will be a list of numpy arrays:
+        # [embeddings[:3], embeddings[3:5], embeddings[5:]]
+        num_sentences_nested = [[2, 1], [3, 4]]
+        result_nested = slice_embeddings(embeddings, num_sentences_nested)
+        # `result_nested` will be a nested list of numpy arrays:
+        # [[embeddings[:2], embeddings[2:3]], [embeddings[3:6], embeddings[6:]]]
+        slice_embeddings(embeddings, "invalid")  # Raises a TypeError
+        ```
+    """
+    def _slice_embeddings(s_idx: int, n_sentences: List[int]):
+        """
+            Helper function to slice embeddings starting from index `s_idx`.
+            Args:
+            - s_idx (int): Starting index for slicing.
+            - n_sentences (List[int]): List specifying number of sentences in each slice.
+            Returns:
+            - Tuple[List[np.ndarray], int]: A tuple containing a list of sliced embeddings and the next starting index.
+        """
+        _result = []
+        for count in n_sentences:
+            _result.append(embeddings[s_idx:s_idx + count])
+            s_idx += count
+        return _result, s_idx
+    if isinstance(num_sentences, list) and all(isinstance(item, int) for item in num_sentences):
+        result, _ = _slice_embeddings(0, num_sentences)
+        return result
+    elif isinstance(num_sentences, list) and all(
+            isinstance(sublist, list) and all(
+                isinstance(item, int) for item in sublist
+            )
+            for sublist in num_sentences
+    ):
+        nested_result = []
+        start_idx = 0
+        for nested_num_sentences in num_sentences:
+            embedding_slice, start_idx = _slice_embeddings(start_idx, nested_num_sentences)
+            nested_result.append(embedding_slice)
+        return nested_result
+    else:
+        raise TypeError(f"Incorrect Type for {num_sentences=}")