File size: 15,586 Bytes
24f4359 c95afe2 24f4359 a205723 24f4359 67b0358 24f4359 67b0358 24f4359 d538c84 24f4359 98c4584 24f4359 b38dc02 24f4359 1f0fea5 24f4359 444939e 24f4359 f8157e8 24f4359 3c14459 24f4359 444939e 183333a 444939e c95afe2 24f4359 d2c63d9 24f4359 98c4584 7e9b032 24f4359 d9c90c9 444939e 39282c7 e02db48 39282c7 444939e 90255eb b1035cf 444939e 39282c7 24f4359 eb32620 eabd984 eb32620 550157b eb32620 32f8ebf b38dc02 24f4359 2d6b65e b80ed60 2d6b65e c00e618 2d6b65e 24f4359 4736d3c 24f4359 4736d3c b80ed60 24f4359 4736d3c 24f4359 39282c7 24f4359 39282c7 24f4359 183333a 5b701dc 98c4584 ae869ee 24f4359 074009c d0f5027 ea1e122 075aa74 8bf9847 b38dc02 ea1e122 24f4359 2610017 39282c7 2866abf 0644137 39282c7 24f4359 2866abf 24f4359 b38dc02 24f4359 4dce67f 24f4359 07190f6 24f4359 b38dc02 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 |
---
license: apache-2.0
datasets:
- nicholasKluge/Pt-Corpus-Instruct
language:
- pt
metrics:
- perplexity
library_name: transformers
pipeline_tag: text-generation
tags:
- text-generation-inference
widget:
- text: "A PUCRS é uma universidade "
example_title: Exemplo
- text: "A muitos anos atrás, em uma galáxia muito distante, vivia uma raça de"
example_title: Exemplo
- text: "Em meio a um escândalo, a frente parlamentar pediu ao Senador Silva para"
example_title: Exemplo
inference:
parameters:
repetition_penalty: 1.2
temperature: 0.2
top_k: 20
top_p: 0.2
max_new_tokens: 150
co2_eq_emissions:
emissions: 5.6
source: CodeCarbon
training_type: pre-training
geographical_location: Germany
hardware_used: NVIDIA A100-SXM4-40GB
---
# TeenyTinyLlama-160m
<img src="./logo.png" alt="A little llama wearing a mushroom hat and a monocle." height="200">
## Model Summary
Given the lack of available monolingual foundational models in non-English languages and the fact that some of the most used and downloaded models by the community are those small enough to allow individual researchers and hobbyists to use them in low-resource environments, we developed the TeenyTinyLlama: _a pair of small foundational models trained in Brazilian Portuguese._
## Details
- **Architecture:** a Transformer-based model pre-trained via causal language modeling
- **Size:** 162,417,408 parameters
- **Context length:** 2048 tokens
- **Dataset:** [Pt-Corpus Instruct](https://huggingface.co/datasets/nicholasKluge/Pt-Corpus-Instruct) (6.2B tokens)
- **Language:** Portuguese
- **Number of steps:** 458,000
- **GPU:** 1 NVIDIA A100-SXM4-40GB
- **Training time**: ~ 36 hours
- **Emissions:** 5.6 KgCO2 (Germany)
- **Total energy consumption:** 15.5 kWh
This repository has the [source code](https://github.com/Nkluge-correa/TeenyTinyLlama) used to train this model. The main libraries used are:
- [Transformers](https://github.com/huggingface/transformers)
- [PyTorch](https://github.com/pytorch/pytorch)
- [Datasets](https://github.com/huggingface/datasets)
- [Tokenizers](https://github.com/huggingface/tokenizers)
- [Sentencepiece](https://github.com/google/sentencepiece)
- [Accelerate](https://github.com/huggingface/accelerate)
- [Codecarbon](https://github.com/mlco2/codecarbon)
Check out the training logs in [Weights and Biases](https://api.wandb.ai/links/nkluge-correa/vws4g032).
## Training Set-up
These are the main arguments used in the training of this model:
| Arguments | Value |
|-------------------------------|--------------------------------------|
| vocabulary size | 32000 |
| hidden dimension size | 768 |
| intermediate dimension size | 3072 |
| context length | 2048 |
| nº attention heads | 12 |
| nº hidden layers | 12 |
| nº key value heads | 12 |
| nº training samples | 1831873 |
| nº validation samples | 18000 |
| nº epochs | 1 |
| evaluation steps | 100000 |
| train batch size | 4 |
| eval batch size | 4 |
| gradient accumulation steps | 1 |
| optimizer | torch.optim.AdamW |
| learning rate | 0.0006 |
| adam epsilon | 0.00000001 |
| weight decay | 0.01 |
| scheduler type | "cosine" |
| warmup steps | 5000 |
| gradient checkpointing | false |
| seed | 42 |
| mixed precision | 'no' |
| torch dtype | "float32" |
| tf32 | true |
## Intended Uses
The primary intended use of TeenyTinyLlama is to research the behavior, functionality, and limitations of large language models. Checkpoints saved during training are intended to provide a controlled setting for performing scientific experiments. You may also further fine-tune and adapt TeenyTinyLlama-160m for deployment, as long as your use is in accordance with the Apache 2.0 license. If you decide to use pre-trained TeenyTinyLlama-160m as a basis for your fine-tuned model, please conduct your own risk and bias assessment.
## Basic usage
Using the `pipeline`:
```python
from transformers import pipeline
generator = pipeline("text-generation", model="nicholasKluge/TeenyTinyLlama-160m")
completions = generator("Astronomia é a ciência", num_return_sequences=2, max_new_tokens=100)
for comp in completions:
print(f"🤖 {comp['generated_text']}")
```
Using the `AutoTokenizer` and `AutoModelForCausalLM`:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and the tokenizer
tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/TeenyTinyLlama-160m", revision='main')
model = AutoModelForCausalLM.from_pretrained("nicholasKluge/TeenyTinyLlama-160m", revision='main')
# Pass the model to your device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.eval()
model.to(device)
# Tokenize the inputs and pass them to the device
inputs = tokenizer("Astronomia é a ciência", return_tensors="pt").to(device)
# Generate some text
completions = model.generate(**inputs, num_return_sequences=2, max_new_tokens=100)
# Print the generated text
for i, completion in enumerate(completions):
print(f'🤖 {tokenizer.decode(completion)}')
```
## Limitations
- **Hallucinations:** This model can produce content that can be mistaken for truth but is, in fact, misleading or entirely false, i.e., hallucination.
- **Biases and Toxicity:** This model inherits the social and historical stereotypes from the data used to train it. Given these biases, the model can produce toxic content, i.e., harmful, offensive, or detrimental to individuals, groups, or communities.
- **Unreliable Code:** The model may produce incorrect code snippets and statements. These code generations should not be treated as suggestions or accurate solutions.
- **Language Limitations:** The model is primarily designed to understand standard Portuguese (BR). Other languages might challenge its comprehension, leading to potential misinterpretations or errors in response.
- **Repetition and Verbosity:** The model may get stuck on repetition loops (especially if the repetition penalty during generations is set to a meager value) or produce verbose responses unrelated to the prompt it was given.
## Evaluations
| Steps | Evaluation Loss | Perplexity | Total Energy Consumption | Emissions |
|---------|-----------------|------------|--------------------------|--------------|
| 100,000 | 3.19 | 24.52 | 3.75 kWh | 1.28 KgCO2eq |
| 200,000 | 3.02 | 20.58 | 7.51 kWh | 2.56 KgCO2eq |
| 300,000 | 2.83 | 16.98 | 11.25 kWh | 3.84 KgCO2eq |
| 400,000 | 2.79 | 16.41 | 14.52 kWh | 5.11 KgCO2eq |
## Benchmarks
Evaluations on benchmarks were performed using the [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) (by [EleutherAI](https://www.eleuther.ai/)). Thanks to [Laiviet](https://github.com/laiviet/lm-evaluation-harness) for translating some of the tasks in the LM-Evaluation-Harness. The results of models marked with an "*" were extracted from the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
| Models | Average | [ARC](https://arxiv.org/abs/1803.05457) | [Hellaswag](https://arxiv.org/abs/1905.07830) | [MMLU](https://arxiv.org/abs/2009.03300) | [TruthfulQA](https://arxiv.org/abs/2109.07958) |
|-------------------------------------------------------------------------------------|---------|-----------------------------------------|-----------------------------------------------|------------------------------------------|------------------------------------------------|
| [Pythia-410m](https://huggingface.co/EleutherAI/pythia-410m-deduped) | 33.26 | 24.83* | 41.29* | 25.99* | 40.95* |
| [TeenyTinyLlama-460m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m) | 33.01 | 29.40 | 33.00 | 28.55 | 41.10 |
| [Bloom-560m](https://huggingface.co/bigscience/bloom-560m) | 32.13 | 24.74* | 37.15* | 24.22* | 42.44* |
| [Xglm-564M](https://huggingface.co/facebook/xglm-564M) | 31.97 | 25.56 | 34.64* | 25.18* | 42.53 |
| [OPT-350m](https://huggingface.co/facebook/opt-350m) | 31.78 | 23.55* | 36.73* | 26.02* | 40.83* |
| [TeenyTinyLlama-160m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-160m) | 31.16 | 26.15 | 29.29 | 28.11 | 41.12 |
| [Pythia-160m](https://huggingface.co/EleutherAI/pythia-160m-deduped) | 31.16 | 24.06* | 31.39* | 24.86* | 44.34* |
| [OPT-125m](https://huggingface.co/facebook/opt-125m) | 30.80 | 22.87 | 31.47 | 26.02 | 42.87 |
| [Gpt2-portuguese-small](https://huggingface.co/pierreguillou/gpt2-small-portuguese) | 30.22 | 22.48* | 29.62* | 27.36* | 41.44* |
| [Gpt2-small](https://huggingface.co/gpt2) | 29.97 | 21.48* | 31.60* | 25.79* | 40.65* |
| [Multilingual GPT](https://huggingface.co/ai-forever/mGPT) | 29.45 | 24.79 | 26.37* | 25.17* | 41.50 |
## Fine-Tuning Comparisons
| Models | Average | [IMDB](https://huggingface.co/datasets/christykoh/imdb_pt) | [FaQuAD-NLI](https://huggingface.co/datasets/ruanchaves/faquad-nli) | [HateBr](https://huggingface.co/datasets/ruanchaves/hatebr) | [Assin2](https://huggingface.co/datasets/assin2) | [AgNews](https://huggingface.co/datasets/maritaca-ai/ag_news_pt) |
|---------------------------------------------------------------------------------------------|---------|------------------------------------------------------------|---------------------------------------------------------------------|-------------------------------------------------------------|--------------------------------------------------|------------------------------------------------------------------|
| [Bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 92.09 | 93.58 | 92.26 | 91.57 | 88.97 | 94.11 |
| [Bert-base-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 91.64 | 92.22 | 93.07 | 91.28 | 87.45 | 94.19 |
| [TeenyTinyLlama-460m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m) | 91.19 | 91.64 | 91.18 | 92.28 | 86.43 | 94.42 |
| [TeenyTinyLlama-160m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-160m) | 90.33 | 91.14 | 90.00 | 90.71 | 85.78 | 94.05 |
| [Gpt2-small-portuguese](https://huggingface.co/pierreguillou/gpt2-small-portuguese) | 89.13 | 91.60 | 86.46 | 87.42 | 86.11 | 94.07 |
## Cite as 🤗
```latex
@misc{nicholas22llama,
doi = {10.5281/zenodo.6989727},
url = {https://huggingface.co/nicholasKluge/TeenyTinyLlama-160m},
author = {Nicholas Kluge Corrêa},
title = {TeenyTinyLlama},
year = {2023},
publisher = {HuggingFace},
journal = {HuggingFace repository},
}
```
## Funding
This repository was built as part of the RAIES ([Rede de Inteligência Artificial Ética e Segura](https://www.raies.org/)) initiative, a project supported by FAPERGS - ([Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul](https://fapergs.rs.gov.br/inicial)), Brazil.
## License
TeenyTinyLlama-160m is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details. |