BSC-LT
/

salamandra-7b-instruct

@@ -614,7 +614,338 @@ This instruction-tuned variant has been trained with a mixture of 276k English,
 ## Evaluation
-<span style="color:red">TODO</span>
 ## Ethical Considerations and Limitations

 ## Evaluation
+### Gold-standard benchmarks
+Evaluation is done using the Language Model Evaluation Harness (Gao et al., 2024). We evaluate on a set of tasks taken from [SpanishBench](https://github.com/EleutherAI/lm-evaluation-harness/pull/2157), [CatalanBench](https://github.com/EleutherAI/lm-evaluation-harness/pull/2154), [BasqueBench](https://github.com/EleutherAI/lm-evaluation-harness/pull/2153) and [GalicianBench](https://github.com/EleutherAI/lm-evaluation-harness/pull/2155). These benchmarks include both new and existing tasks and datasets. Given that this is an instructed model, we add LM Evaluation Harness's native feature of `chat-template` to the setup. In the tables below, we include the results in a selection of evaluation datasets that represent model's performance across a variety of tasks within these benchmarks.
+We only use tasks that are either human generated, human translated, or with a strong human-in-the-loop (i.e., machine translation followed by professional revision or machine generation followed by human revision and annotation). This is the reason behind the variety in number of tasks reported across languages. As more tasks that fulfill these requirements are published, we will update the presented results. We also intend to expand the evaluation to other languages, as long as the datasets meet our quality standards.
+During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the `transformers` library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.
+It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the models capabilities and potential. We thus advise caution when reading and interpreting the results.
+A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
+All results reported below are on a 0-shot setting.
+#### Spanish
+<table><thead>
+  <tr>
+    <th>Category</th>
+    <th>Task</th>
+    <th>Metric</th>
+    <th>Result</th>
+  </tr></thead>
+<tbody>
+  <tr>
+    <td>Commonsense Reasoning</td>
+    <td>xstorycloze_es</td>
+    <td>acc</td>
+    <td>00</td>
+  </tr>
+  <tr>
+    <td>Math</td>
+    <td>mgsm_direct_es</td>
+    <td>em</td>
+    <td>00</td>
+  </tr>
+  <tr>
+    <td rowspan="2">NLI</td>
+    <td>wnli_es</td>
+    <td>acc</td>
+    <td>45.07</td>
+  </tr>
+  <tr>
+    <td>xnli_es</td>
+    <td>acc</td>
+    <td>51.49</td>
+  </tr>
+  <tr>
+    <td>Paraphrasing</td>
+    <td>paws_es</td>
+    <td>acc</td>
+    <td>59.4</td>
+  </tr>
+  <tr>
+    <td>QA</td>
+    <td>xquad_es</td>
+    <td>acc</td>
+    <td>43.82</td>
+  </tr>
+  <tr>
+    <td>Translation</td>
+    <td>flores_es</td>
+    <td>bleu</td>
+    <td>22.98</td>
+  </tr>
+</tbody>
+</table>
+#### Catalan
+<table><thead>
+  <tr>
+    <th>Category</th>
+    <th>Task</th>
+    <th>Metric</th>
+    <th>Result</th>
+  </tr></thead>
+<tbody>
+  <tr>
+    <td rowspan="2">Commonsense Reasoning</td>
+    <td>copa_ca</td>
+    <td>acc</td>
+    <td>81.2</td>
+  </tr>
+  <tr>
+    <td>xstorycloze_ca</td>
+    <td>acc</td>
+    <td>70.68</td>
+  </tr>
+  <tr>
+    <td>Math</td>
+    <td>mgsm_direct_ca</td>
+    <td>em</td>
+    <td>0</td>
+  </tr>
+  <tr>
+    <td rowspan="2">NLI</td>
+    <td>wnli_ca</td>
+    <td>acc</td>
+    <td>50.7</td>
+  </tr>
+  <tr>
+    <td>xnli_ca</td>
+    <td>acc</td>
+    <td>55.14</td>
+  </tr>
+  <tr>
+    <td rowspan="2">Paraphrasing</td>
+    <td>parafraseja</td>
+    <td>acc</td>
+    <td>65.18</td>
+  </tr>
+  <tr>
+    <td> paws_ca</td>
+    <td>acc</td>
+    <td>62.95</td>
+  </tr>
+  <tr>
+    <td rowspan="5">QA</td>
+    <td>arc_ca_easy</td>
+    <td>acc</td>
+    <td>64.98</td>
+  </tr>
+  <tr>
+    <td> arc_ca_challenge</td>
+    <td>acc</td>
+    <td>41.89</td>
+  </tr>
+  <tr>
+    <td> openbookqa_ca</td>
+    <td>acc</td>
+    <td>35.2</td>
+  </tr>
+  <tr>
+    <td> piqa_ca</td>
+    <td>acc</td>
+    <td>69.53</td>
+  </tr>
+  <tr>
+    <td> siqa_ca</td>
+    <td>acc</td>
+    <td>48.62</td>
+  </tr>
+  <tr>
+    <td>Translation</td>
+    <td>flores_ca</td>
+    <td>bleu</td>
+    <td>28.65</td>
+  </tr>
+</tbody></table>
+#### Basque
+<table><thead>
+  <tr>
+    <th>Category</th>
+    <th>Task</th>
+    <th>Metric</th>
+    <th>Result</th>
+  </tr></thead>
+<tbody>
+  <tr>
+    <td rowspan="2">Commonsense Reasoning</td>
+    <td>xcopa_eu</td>
+    <td>acc</td>
+    <td>61.6</td>
+  </tr>
+  <tr>
+    <td>xstorycloze_eu</td>
+    <td>acc</td>
+    <td>61.15</td>
+  </tr>
+  <tr>
+    <td>Math</td>
+    <td>mgsm_direct_eu</td>
+    <td>em</td>
+    <td>1</td>
+  </tr>
+  <tr>
+    <td rowspan="2">NLI</td>
+    <td>wnli_eu</td>
+    <td>acc</td>
+    <td>45.07</td>
+  </tr>
+  <tr>
+    <td>xnli_eu</td>
+    <td>acc</td>
+    <td>46.81</td>
+  </tr>
+  <tr>
+    <td rowspan="3">QA</td>
+    <td>eus_exams</td>
+    <td>acc</td>
+    <td>39.09</td>
+  </tr>
+  <tr>
+    <td>eus_proficiency</td>
+    <td>acc</td>
+    <td>36.93</td>
+  </tr>
+  <tr>
+    <td>eus_trivia</td>
+    <td>acc</td>
+    <td>46.94</td>
+  </tr>
+  <tr>
+    <td>Reading Comprehension</td>
+    <td>eus_reading</td>
+    <td>acc</td>
+    <td>00</td>
+  </tr>
+  <tr>
+    <td>Translation</td>
+    <td>flores_eu</td>
+    <td>bleu</td>
+    <td>14.89</td>
+  </tr>
+</tbody></table>
+#### Galician
+<table><thead>
+  <tr>
+    <th>Category</th>
+    <th>Task</th>
+    <th>Metric</th>
+    <th>Result</th>
+  </tr></thead>
+<tbody>
+  <tr>
+    <td>Math</td>
+    <td>mgsm_direct_gl</td>
+    <td>em</td>
+    <td>0</td>
+  </tr>
+  <tr>
+    <td rowspan="2">Paraphrasing</td>
+    <td>parafrases_gl</td>
+    <td>acc</td>
+    <td>55.44</td>
+  </tr>
+  <tr>
+    <td>paws_gl</td>
+    <td>acc</td>
+    <td>56.55</td>
+  </tr>
+  <tr>
+    <td>QA</td>
+    <td>openbookqa_gl</td>
+    <td>acc</td>
+    <td>38.4</td>
+  </tr>
+  <tr>
+    <td>Translation</td>
+    <td>flores_gl</td>
+    <td>bleu</td>
+    <td>27.03</td>
+  </tr>
+</tbody>
+</table>
+### LLM-as-a-judge
+We use [Prometheus-2 8x7B](https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0) as a judge to evaluate the responses of the model. Tasks are created from existing multilingual evaluation datasets covering the same categories as the ones measured in our gold-standard benchmarks. We randomly select a subset of 250 instances per language from the `test` set of each source dataset. To evaluate the responses of our model, we use task-specific criteria developed in-house for the _LLM-judge_ to use. Each criterion is measured either as a 5-point Likert scale or as a binary task depending on the idiosyncrasy of the task and criterion.
+Prompts for each task are created in various ways to score the model's robustness in addition to these criteria. This is done by presenting the same source instance within three different prompts. We then calculate the variance between the scores assigned by the _LLM-judge_ to our model's responses to the three prompt styles and average it across all instances. Prompts are human translated to all languages measured. We do not provide the _LLM-judge_ with a reference answer.
+The _judge_ prompt we use during evaluation is the same used to fine tune the Prometheus-2 family. We keep the _judge_ prompt and criteria used to present the _LLM-judge_ with the task prompts and model responses in English for evaluation across languages. The _judge_ prompt used is:
+```python
+"You are a fair judge assistant tasked with providing clear, objective feedback based on specific criteria, ensuring each assessment reflects the absolute standards set for performance.
+###Task Description:
+An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
+1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
+2. After writing a feedback, write a score that is an integer between {a} and {b}. You should refer to the score rubric.
+3. The output format should look as follows: \"Feedback: (write a feedback for criteria) [RESULT] (an integer number between {a} and {b})\"
+4. Please do not generate any other opening, closing, and explanations.
+###The instruction to evaluate:
+{input}
+###Response to evaluate:
+{prediction}
+###Score Rubrics:
+{criteria}
+###Feedback:"
+```
+As an example, prompts for the Math task in English are based on instances from [MGSM](https://huggingface.co/datasets/juletxara/mgsm), and each instance is presented within these prompts:
+```python
+"en": [
+      ("I need help with this math problem: \"", "\" Give me the answer step by step and also the final result separately."),
+      ("Can you please help me answer this? \"", "\" Explain the answer and give me the final result as well. Thanks."),
+      ("Help me with this problem: \"", "\" I need the answer explained and the final result separately.")
+]
+```
+This task is then evaluated by the _LLM-judge_ using two criteria, reasoning capability (5-point Likert) and mathematical correctness (binary):
+```python
+reasoning_capability_criteria = {
+    "reasoning_capability": """
+[Does the model's answer demonstrate reasoning capability?]
+Score 1: The answer demonstrates poor reasoning, with illogical arguments or conclusions that do not follow from the provided information.
+Score 2: The answer shows weak reasoning, with some logical connections but also contains significant flaws or gaps in the argumentation.
+Score 3: The answer demonstrates adequate reasoning, with generally logical arguments, but may have minor flaws or a lack of depth in the reasoning process.
+Score 4: The answer shows strong reasoning, with well-structured arguments and conclusions that logically follow from the information provided.
+Score 5: The answer demonstrates exceptional reasoning, with clear, coherent, and insightful arguments that are logically sound and well-supported by the information provided."""
+}
+mathematical_correctness_binary_criteria = {
+    "mathematical_correctness_binary": """
+[Is the model's answer mathematically correct?]
+Score 0: The answer contains mathematical errors that render the solution incorrect or unreliable.
+Score 1: The answer is mathematically correct, with accurate calculations and appropriate use of mathematical concepts."""
+}
+```
+#### Multilingual results
+Here, we present results for seven categories of tasks in Spanish, Catalan, Basque, Galician, and English. Results are presented for each task, criterion and language. Criteria with a `(B)` after their name are binary criteria (i.e., numbers go from 0 to 1, where 1 is best). The rest of the criteria are measured using a 5-point Likert scale, where 5 is best. The first number of the pair of numbers separated by `/` shows the average score for the criterion (and language). The second number of each pair is the robustness score, where numbers closer to 0 mean that the model generates similar responses when comparing the three prompt varieties for a single instance.
+Further details on all tasks and criteria, a full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
+![](./images/results_eval_7b_judge.png)
 ## Ethical Considerations and Limitations