nicholasKluge
/

TeenyTinyLlama-160m

@@ -33,26 +33,24 @@ co2_eq_emissions:
   geographical_location: Germany
   hardware_used: NVIDIA A100-SXM4-40GB
 ---
-# Teeny Tiny Llama 162m (Portuguese)
 <img src="./logo-round.png" alt="A little llama wearing a mushroom hat and a monocle." height="200">
-Teeny-tiny-llama-162m is a compact language model based on the Llama 2 architecture ([Tiny-llama implementation](https://huggingface.co/TinyLlama)). This model is designed to deliver efficient natural language processing capabilities (in Portuguese-BR) while being resource-conscious.
-Teeny-tiny-llama has been trained by leveraging scaling laws to determine the optimal number of tokens per parameter while incorporating preference pre-training.
-- **Compact Design:** Teeny-tiny-llama is a downsized version of the Llama 2 architecture, making it suitable for applications with limited computational resources.
-- **Optimized Scaling:** The model has been pre-trained using scaling laws to identify the ideal token-to-parameter ratio.
-- **Custom Portuguese Dataset:** Teeny-tiny-llama has been trained on a custom Portuguese dataset. This dataset includes diverse linguistic contexts and preference pre-training, allowing the model to better cater to Portuguese language nuances and be better suited for fine-tuning tasks like instruction-tuning.
-This repository has 21 checkpoints, saved as revisions, that were logged during the model's training.
 ## Details
 - **Size:** 162,417,408 million parameters
-- **Dataset:** [Portuguese-Corpus-v3](https://huggingface.co/datasets/nicholasKluge/portuguese-corpus-v3)
 - **Language:** Portuguese
 - **Number of steps:** 457,969 (3.7B tokens)
 - **GPU:** 1 NVIDIA A100-SXM4-40GB
@@ -60,7 +58,15 @@ This repository has 21 checkpoints, saved as revisions, that were logged during
 - **Emissions:** 5.6 KgCO2 (Germany)
 - **Total energy consumption:** 15.5 kWh
-This repository has the [source code](https://github.com/Nkluge-correa/Aira) used to train this model.
 ## Training Set-up

   geographical_location: Germany
   hardware_used: NVIDIA A100-SXM4-40GB
 ---
+# TeenyTinyLlama-162m
 <img src="./logo-round.png" alt="A little llama wearing a mushroom hat and a monocle." height="200">
+## Model Summary
+Given the lack of available monolingual foundational models in non-English languages and the fact that some of the most used and downloaded models by the community are those small enough to allow individual researchers and hobbyists to use them in low-resource environments, we developed the TeenyTinyLlama: _a series of small foundational models trained on Portuguese._
+TeenyTinyLlama is a compact language model based on the Llama 2 architecture ([TinyLlama implementation](https://huggingface.co/TinyLlama)). This model is designed to deliver efficient natural language processing capabilities while being resource-conscious.
+Also, these models were trained by leveraging [scaling laws](https://arxiv.org/abs/2203.15556) to determine the optimal number of tokens per parameter while incorporating [preference pre-training](https://arxiv.org/abs/2112.00861).
 ## Details
+- **Architecture:** a Transformer-based model pre-trained via causal language modeling
 - **Size:** 162,417,408 million parameters
+- **Context length:** 2048 tokens
+- **Dataset:** [Portuguese-Corpus-v3](https://huggingface.co/datasets/nicholasKluge/portuguese-corpus-v3) (6.2B tokens)
 - **Language:** Portuguese
 - **Number of steps:** 457,969 (3.7B tokens)
 - **GPU:** 1 NVIDIA A100-SXM4-40GB
 - **Emissions:** 5.6 KgCO2 (Germany)
 - **Total energy consumption:** 15.5 kWh
+This repository has the [source code](https://github.com/Nkluge-correa/Aira) used to train this model. The main libraries used are:
+- Transformers
+- PyTorch
+- Datasets
+- Tokenizers
+- Accelerate codecarbon sentencepiece
 ## Training Set-up