Spaces:
Running
Running
Update Space (evaluate main: f805d6ec)
Browse files- index.html +35 -19
index.html
CHANGED
@@ -1,19 +1,35 @@
|
|
1 |
-
|
2 |
-
<
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<h1 id="adversarial-glue-evaluation-suite">Adversarial GLUE Evaluation Suite</h1>
|
2 |
+
<h2 id="description">Description</h2>
|
3 |
+
<p>This evaluation suite compares the GLUE results with Adversarial GLUE (AdvGLUE), a multi-task benchmark that evaluates modern large-scale language models robustness with respect to various types of adversarial attacks.</p>
|
4 |
+
<h2 id="how-to-use">How to use</h2>
|
5 |
+
<p>This suite requires installations of the following fork <a href="https://github.com/IntelAI/evaluate/tree/develop">IntelAI/evaluate</a>.</p>
|
6 |
+
<p>After installation, there are two steps: (1) loading the Adversarial GLUE suite; and (2) calculating the metric.</p>
|
7 |
+
<ol type="1">
|
8 |
+
<li><strong>Loading the relevant GLUE metric</strong> : This suite loads an evaluation suite subtasks for the following tasks on both AdvGLUE and GLUE datasets: <code>sst2</code>, <code>mnli</code>, <code>qnli</code>, <code>rte</code>, and <code>qqp</code>.</li>
|
9 |
+
</ol>
|
10 |
+
<p>More information about the different subsets of the GLUE dataset can be found on the <a href="https://huggingface.co/datasets/glue">GLUE dataset page</a>.</p>
|
11 |
+
<ol start="2" type="1">
|
12 |
+
<li><strong>Calculating the metric</strong>: the metric takes one input: the name of the model or pipeline</li>
|
13 |
+
</ol>
|
14 |
+
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> evaluate <span class="im">import</span> EvaluationSuite</span>
|
15 |
+
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a></span>
|
16 |
+
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a>suite <span class="op">=</span> EvaluationSuite.load(<span class="st">'intel/adversarial_glue'</span>)</span>
|
17 |
+
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a>mc_results, <span class="op">=</span> suite.run(<span class="st">"gpt2"</span>)</span></code></pre></div>
|
18 |
+
<h2 id="output-results">Output results</h2>
|
19 |
+
<p>The output of the metric depends on the GLUE subset chosen, consisting of a dictionary that contains one or several of the following metrics:</p>
|
20 |
+
<p><code>accuracy</code>: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see <a href="https://huggingface.co/metrics/accuracy">accuracy</a> for more information).</p>
|
21 |
+
<h3 id="values-from-popular-papers">Values from popular papers</h3>
|
22 |
+
<p>The <a href="https://huggingface.co/datasets/glue">original GLUE paper</a> reported average scores ranging from 58% to 64%, depending on the model used (with all evaluation values scaled by 100 to make computing the average possible).</p>
|
23 |
+
<p>For more recent model performance, see the <a href="https://paperswithcode.com/dataset/glue">dataset leaderboard</a>.</p>
|
24 |
+
<h2 id="examples">Examples</h2>
|
25 |
+
<p>For full example see <a href="https://github.com/IntelAI/evaluate/blob/develop/notebooks/HF%20Evaluate%20Adversarial%20Attacks.ipynb">HF Evaluate Adversarial Attacks.ipynb</a></p>
|
26 |
+
<h2 id="limitations-and-bias">Limitations and bias</h2>
|
27 |
+
<p>This metric works only with datasets that have the same format as the <a href="https://huggingface.co/datasets/glue">GLUE dataset</a>.</p>
|
28 |
+
<p>While the GLUE dataset is meant to represent “General Language Understanding”, the tasks represented in it are not necessarily representative of language understanding, and should not be interpreted as such.</p>
|
29 |
+
<h2 id="citation">Citation</h2>
|
30 |
+
<div class="sourceCode" id="cb2"><pre class="sourceCode bibtex"><code class="sourceCode bibtex"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="co"> </span><span class="va">@inproceedings</span>{<span class="ot">wang2021adversarial</span>,</span>
|
31 |
+
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a> <span class="dt">title</span>={Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models},</span>
|
32 |
+
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> <span class="dt">author</span>={Wang, Boxin and Xu, Chejian and Wang, Shuohang and Gan, Zhe and Cheng, Yu and Gao, Jianfeng and Awadallah, Ahmed Hassan and Li, Bo},</span>
|
33 |
+
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a> <span class="dt">booktitle</span>={Advances in Neural Information Processing Systems},</span>
|
34 |
+
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a> <span class="dt">year</span>={2021}</span>
|
35 |
+
<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a>}</span></code></pre></div>
|