nicholasKluge
/

TeenyTinyLlama-160m-FaQuAD-NLI

@@ -141,26 +141,26 @@ trainer.train()
 ## Fine-Tuning Comparisons
-| Models                                                                                     | [FaQuAD-NLI](https://huggingface.co/datasets/ruanchaves/faquad-nli) |
-|--------------------------------------------------------------------------------------------|---------------------------------------------------------------------|
-| [Bert-base-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 93.07                                                               |
-| [Bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased)| 92.26                                                               |
-| [Teeny Tiny Llama 460m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m)          | 91.18                                                               |
-| [Teeny Tiny Llama 160m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-160m)          | 90.00                                                               |
-| [Gpt2-small-portuguese](https://huggingface.co/pierreguillou/gpt2-small-portuguese)        | 86.46                                                               |
 ## Cite as 🤗
 ```latex
-@misc{nicholas22llama,
-  doi = {10.5281/zenodo.6989727},
-  url = {https://huggingface.co/nicholasKluge/TeenyTinyLlama-160m},
-  author = {Nicholas Kluge Corrêa},
-  title = {TeenyTinyLlama},
-  year = {2023},
-  publisher = {HuggingFace},
-  journal = {HuggingFace repository},
 }
 ```

 ## Fine-Tuning Comparisons
+To further evaluate the downstream capabilities of our models, we decided to employ a basic fine-tuning procedure for our TTL pair on a subset of tasks from the Poeta benchmark. We apply the same procedure for comparison purposes on both [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) models, given that they are also LLM trained from scratch in Brazilian Portuguese and have a similar size range to our models. We used these comparisons to assess if our pre-training runs produced LLM capable of producing good results ("good" here means "close to BERTimbau") when utilized for downstream applications.
+| Models          | IMDB      | FaQuAD-NLI | HateBr    | Assin2    | AgNews    | Average |
+|-----------------|-----------|------------|-----------|-----------|-----------|---------|
+| BERTimbau-large | **93.58** | 92.26      | 91.57     | **88.97** | 94.11     | 92.10   |
+| BERTimbau-small | 92.22     | **93.07**  | 91.28     | 87.45     | 94.19     | 91.64   |
+| **TTL-460m**    | 91.64     | 91.18      | **92.28** | 86.43     | **94.42** | 91.19   |
+| **TTL-160m**    | 91.14     | 90.00      | 90.71     | 85.78     | 94.05     | 90.34   |
+All the shown results are the higher accuracy scores achieved on the respective task test sets after fine-tuning the models on the training sets. All fine-tuning runs used the same hyperparameters, and the code implementation can be found in the [model cards](https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m-HateBR) of our fine-tuned models.
 ## Cite as 🤗
 ```latex
+@misc{correa24ttllama,
+  title = {TeenyTinyLlama: a pair of open-source tiny language models trained in Brazilian Portuguese},
+  author = {Corr{\^e}a, Nicholas Kluge and Falk, Sophia and Fatimah, Shiza and Sen, Aniket and De Oliveira, Nythamar},
+  journal={arXiv},
+  year = {2024},
 }
 ```