nicholasKluge
/

TeenyTinyLlama-460m-awq

@@ -31,16 +31,19 @@ co2_eq_emissions:
   geographical_location: Germany
   hardware_used: NVIDIA A100-SXM4-40GB
 ---
-# TeenyTinyLlama-460m
 <img src="./logo.png" alt="A curious llama exploring a mushroom forest." height="200">
 ## Model Summary
 Given the lack of available monolingual foundational models in non-English languages and the fact that some of the most used and downloaded models by the community are those small enough to allow individual researchers and hobbyists to use them in low-resource environments, we developed the TeenyTinyLlama: _a pair of small foundational models trained in Brazilian Portuguese._
 TeenyTinyLlama is a compact language model based on the Llama 2 architecture ([TinyLlama implementation](https://huggingface.co/TinyLlama)). This model is designed to deliver efficient natural language processing capabilities while being resource-conscious These models were trained by leveraging [scaling laws](https://arxiv.org/abs/2203.15556) to determine the optimal number of tokens per parameter while incorporating [preference pre-training](https://arxiv.org/abs/2112.00861).
 ## Details
 - **Architecture:** a Transformer-based model pre-trained via causal language modeling
@@ -53,6 +56,12 @@ TeenyTinyLlama is a compact language model based on the Llama 2 architecture ([T
 - **Training time**: ~ 280 hours
 - **Emissions:** 41.1 KgCO2 (Germany)
 - **Total energy consumption:** 115.69 kWh
 This repository has the [source code](https://github.com/Nkluge-correa/Aira) used to train this model. The main libraries used are:
@@ -63,6 +72,7 @@ This repository has the [source code](https://github.com/Nkluge-correa/Aira) use
 - [Sentencepiece](https://github.com/google/sentencepiece)
 - [Accelerate](https://github.com/huggingface/accelerate)
 - [Codecarbon](https://github.com/mlco2/codecarbon)
 Check out the training logs in [Weights and Biases](https://api.wandb.ai/links/nkluge-correa/vws4g032).
@@ -104,12 +114,16 @@ The primary intended use of TeenyTinyLlama is to research the behavior, function
 ## Basic usage
 Using the `pipeline`:
 ```python
 from transformers import pipeline
-generator = pipeline("text-generation", model="nicholasKluge/TeenyTinyLlama-460m")
 completions  = generator("Astronomia é a ciência", num_return_sequences=2, max_new_tokens=100)
@@ -120,12 +134,14 @@ for comp in completions:
 Using the `AutoTokenizer` and `AutoModelForCausalLM`:
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
 # Load model and the tokenizer
-tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/TeenyTinyLlama-460m", revision='main')
-model = AutoModelForCausalLM.from_pretrained("nicholasKluge/TeenyTinyLlama-460m", revision='main')
 # Pass the model to your device
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

   geographical_location: Germany
   hardware_used: NVIDIA A100-SXM4-40GB
 ---
+# TeenyTinyLlama-460m-awq
 <img src="./logo.png" alt="A curious llama exploring a mushroom forest." height="200">
 ## Model Summary
+**Note: This model is a quantized version of [TeenyTinyLlama-460m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m). Quantization was performed using [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), allowing this version to be 80% lighter, with almost no performance loss.**
 Given the lack of available monolingual foundational models in non-English languages and the fact that some of the most used and downloaded models by the community are those small enough to allow individual researchers and hobbyists to use them in low-resource environments, we developed the TeenyTinyLlama: _a pair of small foundational models trained in Brazilian Portuguese._
 TeenyTinyLlama is a compact language model based on the Llama 2 architecture ([TinyLlama implementation](https://huggingface.co/TinyLlama)). This model is designed to deliver efficient natural language processing capabilities while being resource-conscious These models were trained by leveraging [scaling laws](https://arxiv.org/abs/2203.15556) to determine the optimal number of tokens per parameter while incorporating [preference pre-training](https://arxiv.org/abs/2112.00861).
 ## Details
 - **Architecture:** a Transformer-based model pre-trained via causal language modeling
 - **Training time**: ~ 280 hours
 - **Emissions:** 41.1 KgCO2 (Germany)
 - **Total energy consumption:** 115.69 kWh
+- **Quantization Configuration:**
+  - `bits`: 4
+  - `group_size`: 128
+  - `quant_method`: "awq"
+  - `version`: "gemm"
+  - `zero_point`: True
 This repository has the [source code](https://github.com/Nkluge-correa/Aira) used to train this model. The main libraries used are:
 - [Sentencepiece](https://github.com/google/sentencepiece)
 - [Accelerate](https://github.com/huggingface/accelerate)
 - [Codecarbon](https://github.com/mlco2/codecarbon)
+- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
 Check out the training logs in [Weights and Biases](https://api.wandb.ai/links/nkluge-correa/vws4g032).
 ## Basic usage
+**Note: The use of quantized models required the installation of `autoawq==0.1.7`.**
 Using the `pipeline`:
 ```python
+!pip install autoawq==0.1.7 -q
 from transformers import pipeline
+generator = pipeline("text-generation", model="nicholasKluge/TeenyTinyLlama-460m-awq")
 completions  = generator("Astronomia é a ciência", num_return_sequences=2, max_new_tokens=100)
 Using the `AutoTokenizer` and `AutoModelForCausalLM`:
 ```python
+!pip install autoawq==0.1.7 -q
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
 # Load model and the tokenizer
+tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/TeenyTinyLlama-460m-awq", revision='main')
+model = AutoModelForCausalLM.from_pretrained("nicholasKluge/TeenyTinyLlama-460m-awq", revision='main')
 # Pass the model to your device
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")