Spaces:

Intel
/

adversarial_glue

Running

App Files Files Community

tybrs commited on Nov 17, 2023

Commit

2f9157a

1 Parent(s): 2d2a68f

Update Space (evaluate main: e5933120)

Browse files

Files changed (3) hide show

README.md +59 -1
adversarial_glue.py +202 -0
requirements.txt +1 -0

README.md CHANGED Viewed

@@ -8,4 +8,62 @@ pinned: false
 license: apache-2.0
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 license: apache-2.0
 ---
+# Adversarial GLUE Evaluation Suite
+## Description
+This evaluation suite compares the GLUE results with Adversarial GLUE (AdvGLUE), a multi-task benchmark that evaluates modern large-scale language models robustness with respect to various types of adversarial attacks.
+## How to use
+This suite requires installations of the following fork [IntelAI/evaluate](https://github.com/IntelAI/evaluate/tree/develop).
+After installation, there are two steps: (1) loading the Adversarial GLUE suite; and (2) calculating the metric.
+1. **Loading the relevant GLUE metric** : This suite loads an evaluation suite subtasks for the following tasks on both AdvGLUE and GLUE datasets: `sst2`,  `mnli`, `qnli`, `rte`, and `qqp`.
+More information about the different subsets of the GLUE dataset can be found on the [GLUE dataset page](https://huggingface.co/datasets/glue).
+2. **Calculating the metric**: the metric takes one input: the name of the model or pipeline
+```python
+from evaluate import EvaluationSuite
+suite = EvaluationSuite.load('intel/adversarial_glue')
+mc_results,  = suite.run("gpt2")
+```
+## Output results
+The output of the metric depends on the GLUE subset chosen, consisting of a dictionary that contains one or several of the following metrics:
+`accuracy`: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see [accuracy](https://huggingface.co/metrics/accuracy) for more information).
+### Values from popular papers
+The [original GLUE paper](https://huggingface.co/datasets/glue) reported average scores ranging from 58% to 64%, depending on the model used (with all evaluation values scaled by 100 to make computing the average possible).
+For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/glue).
+## Examples
+For full example see [HF Evaluate Adversarial Attacks.ipynb](https://github.com/IntelAI/evaluate/blob/develop/notebooks/HF%20Evaluate%20Adversarial%20Attacks.ipynb)
+## Limitations and bias
+This metric works only with datasets that have the same format as the [GLUE dataset](https://huggingface.co/datasets/glue).
+While the GLUE dataset is meant to represent "General Language Understanding", the tasks represented in it are not necessarily representative of language understanding, and should not be interpreted as such.
+## Citation
+```bibtex
+ @inproceedings{wang2021adversarial,
+  title={Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models},
+  author={Wang, Boxin and Xu, Chejian and Wang, Shuohang and Gan, Zhe and Cheng, Yu and Gao, Jianfeng and Awadallah, Ahmed Hassan and Li, Bo},
+  booktitle={Advances in Neural Information Processing Systems},
+  year={2021}
+}
+```

adversarial_glue.py ADDED Viewed

	@@ -0,0 +1,202 @@

+from evaluate.evaluation_suite import SubTask
+from evaluate.visualization import radar_plot
+from intel_evaluate_extension.evaluation_suite.model_card_suite import ModelCardSuiteResults
+_HEADER = "GLUE/AdvGlue Evaluation Results"
+_DESCRIPTION = """
+The suite compares the GLUE results with Adversarial GLUE (AdvGLUE), a
+multi-task benchmark that tests the vulnerability of modern large-scale
+language models againstvarious adversarial attacks."""
+class Suite(ModelCardSuiteResults):
+    def __init__(self, name):
+        super().__init__(name)
+        self.result_keys = ["accuracy", "f1"]
+        self.preprocessor = lambda x: {"text": x["text"].lower()}
+        self.suite = [
+            SubTask(
+                task_type="text-classification",
+                data="glue",
+                subset="sst2",
+                split="validation[:5]",
+                args_for_task={
+                    "metric": "glue",
+                    "input_column": "sentence",
+                    "label_column": "label",
+                    "config_name": "sst2",
+                    "label_mapping": {
+                        "LABEL_0": 0.0,
+                        "LABEL_1": 1.0
+                    }
+                }
+            ),
+            SubTask(
+                task_type="text-classification",
+                data="adv_glue",
+                subset="adv_sst2",
+                split="validation[:5]",
+                args_for_task={
+                    "metric": "glue",
+                    "input_column": "sentence",
+                    "label_column": "label",
+                    "config_name": "sst2",
+                    "label_mapping": {
+                        "LABEL_0": 0.0,
+                        "LABEL_1": 1.0
+                    }
+                }
+            ),
+            SubTask(
+                task_type="text-classification",
+                data="glue",
+                subset="qqp",
+                split="validation[:5]",
+                args_for_task={
+                    "metric": "glue",
+                    "input_column": "question1",
+                    "second_input_column": "question2",
+                    "label_column": "label",
+                    "config_name": "qqp",
+                    "label_mapping": {
+                        "LABEL_0": 0,
+                        "LABEL_1": 1
+                    }
+                }
+            ),
+            SubTask(
+                task_type="text-classification",
+                data="adv_glue",
+                subset="adv_qqp",
+                split="validation[:5]",
+                args_for_task={
+                    "metric": "glue",
+                    "input_column": "question1",
+                    "second_input_column": "question2",
+                    "label_column": "label",
+                    "config_name": "qqp",
+                    "label_mapping": {
+                        "LABEL_0": 0,
+                        "LABEL_1": 1
+                    }
+                }
+            ),
+            SubTask(
+                task_type="text-classification",
+                data="glue",
+                subset="qnli",
+                split="validation[:5]",
+                args_for_task={
+                    "metric": "glue",
+                    "input_column": "question",
+                    "second_input_column": "sentence",
+                    "label_column": "label",
+                    "config_name": "qnli",
+                    "label_mapping": {
+                        "LABEL_0": 0,
+                        "LABEL_1": 1
+                    }
+                }
+            ),
+            SubTask(
+                task_type="text-classification",
+                data="adv_glue",
+                subset="adv_qnli",
+                split="validation[:5]",
+                args_for_task={
+                    "metric": "glue",
+                    "input_column": "question",
+                    "second_input_column": "sentence",
+                    "label_column": "label",
+                    "config_name": "qnli",
+                    "label_mapping": {
+                        "LABEL_0": 0,
+                        "LABEL_1": 1
+                    }
+                }
+            ),
+            SubTask(
+                task_type="text-classification",
+                data="glue",
+                subset="rte",
+                split="validation[:5]",
+                args_for_task={
+                    "metric": "glue",
+                    "input_column": "sentence1",
+                    "second_input_column": "sentence2",
+                    "label_column": "label",
+                    "config_name": "rte",
+                    "label_mapping": {
+                        "LABEL_0": 0,
+                        "LABEL_1": 1
+                    }
+                }
+            ),
+            SubTask(
+                task_type="text-classification",
+                data="adv_glue",
+                subset="adv_rte",
+                split="validation[:5]",
+                args_for_task={
+                    "metric": "glue",
+                    "input_column": "sentence1",
+                    "second_input_column": "sentence2",
+                    "label_column": "label",
+                    "config_name": "rte",
+                    "label_mapping": {
+                        "LABEL_0": 0,
+                        "LABEL_1": 1
+                    }
+                }
+            ),
+            SubTask(
+                task_type="text-classification",
+                data="glue",
+                subset="mnli",
+                split="validation_mismatched[:5]",
+                args_for_task={
+                    "metric": "glue",
+                    "input_column": "premise",
+                    "second_input_column": "hypothesis",
+                    "config_name": "mnli",
+                    "label_mapping": {
+                        "LABEL_0": 0,
+                        "LABEL_1": 1,
+                        "LABEL_2": 2
+                    }
+                }
+            ),
+            SubTask(
+                task_type="text-classification",
+                data="adv_glue",
+                subset="adv_mnli",
+                split="validation[:5]",
+                args_for_task={
+                    "metric": "glue",
+                    "input_column": "premise",
+                    "second_input_column": "hypothesis",
+                    "config_name": "mnli",
+                    "label_mapping": {
+                        "LABEL_0": 0,
+                        "LABEL_1": 1,
+                        "LABEL_2": 2
+                    }
+                }
+            ),
+        ]
+    def process_results(self, results):
+        radar_data = [
+            {"accuracy " + result["task_name"].split("/")[-1]:
+             result["accuracy"] for result in results[::2]},
+            {"accuracy " + result["task_name"].replace("adv_", "").split("/")[-1]:
+             result["accuracy"] for result in results[1::2]}]
+        return radar_plot(radar_data, ['GLUE', 'AdvGLUE'])
+    def plot_results(self, results, model_or_pipeline):
+        radar_data = self.process_results(results)
+        graphic = radar_plot(radar_data, ['GLUE ' + model_or_pipeline,  'AdvGLUE ' + model_or_pipeline])
+        return graphic

requirements.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ git+https://github.com/IntelAI/evaluate@develop