nicholasKluge
/

Aira-2-124M

@@ -115,10 +115,16 @@ The model will output something like:
 ## Evaluation
-| Model|Average|[ARC](https://arxiv.org/abs/1803.05457)|[HellaSwag](https://arxiv.org/abs/1905.07830)|[MMLU](https://arxiv.org/abs/2009.03300)|[TruthfulQA](https://arxiv.org/abs/2109.07958)|[ToxiGen](https://arxiv.org/abs/2203.09509)|
-|---|---|---|---|---|---|---|
-| [Aira-2-124M](https://huggingface.co/nicholasKluge/Aira-2-124M) |**34.15**|**24.57**|31.29|25.29|**41.02**|**48.62**|
-| GPT-2 | 32.71 | 21.84 | **31.6** | **25.86** | 40.67 | 43.62 |
 * Evaluations were performed using the [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) (by [EleutherAI](https://www.eleuther.ai/)). The notebook used to make these evaluations is available in the [this repo](lm_evaluation_harness.ipynb).

 ## Evaluation
+| Model (GPT-2)                                                   | Average   | [ARC](https://arxiv.org/abs/1803.05457) | [TruthfulQA](https://arxiv.org/abs/2109.07958) | [ToxiGen](https://arxiv.org/abs/2203.09509) |   |   |
+|-----------------------------------------------------------------|-----------|-----------------------------------------|------------------------------------------------|---------------------------------------------|---|---|
+| [Aira-2-124M](https://huggingface.co/nicholasKluge/Aira-2-124M) | **38.07** | **24.57**                               | **41.02**                                      | **48.62**                                   |   |   |
+| GPT-2                                                           | 35.37     | 21.84                                   | 40.67                                          | 43.62                                       |   |   |
+| [Aira-2-355M](https://huggingface.co/nicholasKluge/Aira-2-355M) | **39.68** | **27.56**                               | 38.53                                          | **53.19**                                   |   |   |
+| GPT-2-medium                                                    | 36.43     | 27.05                                   | **40.76**                                      | 41.49                                       |   |   |
+| [Aira-2-774M](https://huggingface.co/nicholasKluge/Aira-2-774M) | **42.26** | **28.75**                               | **41.33**                                      | **56.70**                                   |   |   |
+| GPT-2-large                                                     | 35.16     | 25.94                                   | 38.71                                          | 40.85                                       |   |   |
+| [Aira-2-1B5](https://huggingface.co/nicholasKluge/Aira-2-1B5)   | **42.22** | 28.92                                   | **41.16**                                      | **56.60**                                   |   |   |
+| GPT-2-xl                                                        | 36.84     | **30.29**                               | 38.54                                          | 41.70                                       |   |   |
 * Evaluations were performed using the [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) (by [EleutherAI](https://www.eleuther.ai/)). The notebook used to make these evaluations is available in the [this repo](lm_evaluation_harness.ipynb).