nicholasKluge commited on
Commit
444939e
1 Parent(s): eabd984

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -11
README.md CHANGED
@@ -33,26 +33,24 @@ co2_eq_emissions:
33
  geographical_location: Germany
34
  hardware_used: NVIDIA A100-SXM4-40GB
35
  ---
36
- # Teeny Tiny Llama 162m (Portuguese)
37
 
38
  <img src="./logo-round.png" alt="A little llama wearing a mushroom hat and a monocle." height="200">
39
 
40
- Teeny-tiny-llama-162m is a compact language model based on the Llama 2 architecture ([Tiny-llama implementation](https://huggingface.co/TinyLlama)). This model is designed to deliver efficient natural language processing capabilities (in Portuguese-BR) while being resource-conscious.
41
 
42
- Teeny-tiny-llama has been trained by leveraging scaling laws to determine the optimal number of tokens per parameter while incorporating preference pre-training.
43
 
44
- - **Compact Design:** Teeny-tiny-llama is a downsized version of the Llama 2 architecture, making it suitable for applications with limited computational resources.
45
 
46
- - **Optimized Scaling:** The model has been pre-trained using scaling laws to identify the ideal token-to-parameter ratio.
47
-
48
- - **Custom Portuguese Dataset:** Teeny-tiny-llama has been trained on a custom Portuguese dataset. This dataset includes diverse linguistic contexts and preference pre-training, allowing the model to better cater to Portuguese language nuances and be better suited for fine-tuning tasks like instruction-tuning.
49
-
50
- This repository has 21 checkpoints, saved as revisions, that were logged during the model's training.
51
 
52
  ## Details
53
 
 
54
  - **Size:** 162,417,408 million parameters
55
- - **Dataset:** [Portuguese-Corpus-v3](https://huggingface.co/datasets/nicholasKluge/portuguese-corpus-v3)
 
56
  - **Language:** Portuguese
57
  - **Number of steps:** 457,969 (3.7B tokens)
58
  - **GPU:** 1 NVIDIA A100-SXM4-40GB
@@ -60,7 +58,15 @@ This repository has 21 checkpoints, saved as revisions, that were logged during
60
  - **Emissions:** 5.6 KgCO2 (Germany)
61
  - **Total energy consumption:** 15.5 kWh
62
 
63
- This repository has the [source code](https://github.com/Nkluge-correa/Aira) used to train this model.
 
 
 
 
 
 
 
 
64
 
65
  ## Training Set-up
66
 
 
33
  geographical_location: Germany
34
  hardware_used: NVIDIA A100-SXM4-40GB
35
  ---
36
+ # TeenyTinyLlama-162m
37
 
38
  <img src="./logo-round.png" alt="A little llama wearing a mushroom hat and a monocle." height="200">
39
 
40
+ ## Model Summary
41
 
42
+ Given the lack of available monolingual foundational models in non-English languages and the fact that some of the most used and downloaded models by the community are those small enough to allow individual researchers and hobbyists to use them in low-resource environments, we developed the TeenyTinyLlama: _a series of small foundational models trained on Portuguese._
43
 
44
+ TeenyTinyLlama is a compact language model based on the Llama 2 architecture ([TinyLlama implementation](https://huggingface.co/TinyLlama)). This model is designed to deliver efficient natural language processing capabilities while being resource-conscious.
45
 
46
+ Also, these models were trained by leveraging [scaling laws](https://arxiv.org/abs/2203.15556) to determine the optimal number of tokens per parameter while incorporating [preference pre-training](https://arxiv.org/abs/2112.00861).
 
 
 
 
47
 
48
  ## Details
49
 
50
+ - **Architecture:** a Transformer-based model pre-trained via causal language modeling
51
  - **Size:** 162,417,408 million parameters
52
+ - **Context length:** 2048 tokens
53
+ - **Dataset:** [Portuguese-Corpus-v3](https://huggingface.co/datasets/nicholasKluge/portuguese-corpus-v3) (6.2B tokens)
54
  - **Language:** Portuguese
55
  - **Number of steps:** 457,969 (3.7B tokens)
56
  - **GPU:** 1 NVIDIA A100-SXM4-40GB
 
58
  - **Emissions:** 5.6 KgCO2 (Germany)
59
  - **Total energy consumption:** 15.5 kWh
60
 
61
+ This repository has the [source code](https://github.com/Nkluge-correa/Aira) used to train this model. The main libraries used are:
62
+
63
+ - Transformers
64
+ - PyTorch
65
+ - Datasets
66
+ - Tokenizers
67
+ - Accelerate codecarbon sentencepiece
68
+
69
+
70
 
71
  ## Training Set-up
72