nicholasKluge
/

TeenyTinyLlama-460m-Chat-awq

@@ -160,10 +160,10 @@ Evaluations on benchmarks were performed using the [Language Model Evaluation Ha
 |                  | **ARC**   | **HellaSwag** | **MMLU**  | **TruthfulQA** | **Average** |
 |------------------|-----------|---------------|-----------|----------------|-------------|
-| Pythia-410m      | 24.83*    | **41.29***    | 25.99*    | 40.95*         | 33.26       |
-| **TTL-460m**     | **29.40** | 33.00         | **28.55** | 41.10          | 33.01       |
 | Bloom-560m       | 24.74*    | 37.15*        | 24.22*    | 42.44*         | 32.13       |
-| Xglm-564M        | 25.56     | 34.64*        | 25.18*    | **42.53**      | 31.97       |
 | OPT-350m         | 23.55*    | 36.73*        | 26.02*    | 40.83*         | 31.78       |
 | **TTL-160m**     | 26.15     | 29.29         | 28.11     | 41.12          | 31.16       |
 | Pythia-160m      | 24.06*    | 31.39*        | 24.86*    | 44.34*         | 31.16       |
@@ -172,6 +172,26 @@ Evaluations on benchmarks were performed using the [Language Model Evaluation Ha
 | Gpt2-small       | 21.48*    | 31.60*        | 25.79*    | 40.65*         | 29.97       |
 | Multilingual GPT | 23.81     | 26.37*        | 25.17*    | 39.62          | 28.73       |
 ## Fine-Tuning Comparisons
 To further evaluate the downstream capabilities of our models, we decided to employ a basic fine-tuning procedure for our TTL pair on a subset of tasks from the Poeta benchmark. We apply the same procedure for comparison purposes on both [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) models, given that they are also LLM trained from scratch in Brazilian Portuguese and have a similar size range to our models. We used these comparisons to assess if our pre-training runs produced LLM capable of producing good results ("good" here means "close to BERTimbau") when utilized for downstream applications.

 |                  | **ARC**   | **HellaSwag** | **MMLU**  | **TruthfulQA** | **Average** |
 |------------------|-----------|---------------|-----------|----------------|-------------|
+| Pythia-410m      | 24.83*    | 41.29*        | 25.99*    | 40.95*         | 33.26       |
+| **TTL-460m**     | 29.40     | 33.00         | 28.55     | 41.10          | 33.01       |
 | Bloom-560m       | 24.74*    | 37.15*        | 24.22*    | 42.44*         | 32.13       |
+| Xglm-564M        | 25.56     | 34.64*        | 25.18*    | 42.53          | 31.97       |
 | OPT-350m         | 23.55*    | 36.73*        | 26.02*    | 40.83*         | 31.78       |
 | **TTL-160m**     | 26.15     | 29.29         | 28.11     | 41.12          | 31.16       |
 | Pythia-160m      | 24.06*    | 31.39*        | 24.86*    | 44.34*         | 31.16       |
 | Gpt2-small       | 21.48*    | 31.60*        | 25.79*    | 40.65*         | 29.97       |
 | Multilingual GPT | 23.81     | 26.37*        | 25.17*    | 39.62          | 28.73       |
+Evaluations on Brazilian Portuguese benchmarks were performed using a [Portuguese implementation of the EleutherAI LM Evaluation Harness](https://github.com/eduagarcia/lm-evaluation-harness-pt) (created by [Eduardo Garcia](https://github.com/eduagarcia/lm-evaluation-harness-pt)).
+|                | **ASSIN2 RTE** | **ASSIN2 STS** | **BLUEX** | **ENEM** | **FAQUAD NLI** | **HateBR** | **OAB Exams** | **Average** |
+|----------------|----------------|----------------|-----------|----------|----------------|------------|---------------|-------------|
+| Qwen-1.8B      | 64.83          | 19.53          | 26.15     | 30.23    | 43.97          | 33.33      | 27.20         | 35.03       |
+| TinyLlama-1.1B | 58.93          | 13.57          | 22.81     | 22.25    | 43.97          | 36.92      | 23.64         | 31.72       |
+| **TTL-460m**   | 53.93          | 12.66          | 22.81     | 19.87    | 49.01          | 33.59      | 27.06         | 31.27       |
+| XGLM-564m      | 49.61          | 22.91          | 19.61     | 19.38    | 43.97          | 33.99      | 23.42         | 30.41       |
+| Bloom-1b7      | 53.60          | 4.81           | 21.42     | 18.96    | 43.97          | 34.89      | 23.05         | 28.67       |
+| **TTL-160m**   | 53.36          | 2.58           | 21.84     | 18.75    | 43.97          | 36.88      | 22.60         | 28.56       |
+| OPT-125m       | 39.77          | 2.00           | 21.84     | 17.42    | 43.97          | 47.04      | 22.78         | 27.83       |
+| Pythia-160     | 33.33          | 12.81          | 16.13     | 16.66    | 50.36          | 41.09      | 22.82         | 27.60       |
+| OLMo-1b        | 34.12          | 9.28           | 18.92     | 20.29    | 43.97          | 41.33      | 22.96         | 27.26       |
+| Bloom-560m     | 33.33          | 8.48           | 18.92     | 19.03    | 43.97          | 37.07      | 23.05         | 26.26       |
+| Pythia-410m    | 33.33          | 4.80           | 19.47     | 19.45    | 43.97          | 33.33      | 23.01         | 25.33       |
+| OPT-350m       | 33.33          | 3.65           | 20.72     | 17.35    | 44.71          | 33.33      | 23.01         | 25.15       |
+| GPT-2 small    | 33.26          | 0.00           | 10.43     | 11.20    | 43.52          | 33.68      | 13.12         | 20.74       |
+| GPorTuguese    | 33.33          | 3.85           | 14.74     | 3.01     | 28.81          | 33.33      | 21.23         | 19.75       |
+| Samba-1.1B     | 33.33          | 1.30           | 8.07      | 10.22    | 17.72          | 35.79      | 15.03         | 17.35       |
 ## Fine-Tuning Comparisons
 To further evaluate the downstream capabilities of our models, we decided to employ a basic fine-tuning procedure for our TTL pair on a subset of tasks from the Poeta benchmark. We apply the same procedure for comparison purposes on both [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) models, given that they are also LLM trained from scratch in Brazilian Portuguese and have a similar size range to our models. We used these comparisons to assess if our pre-training runs produced LLM capable of producing good results ("good" here means "close to BERTimbau") when utilized for downstream applications.