nicholasKluge
/

TeenyTinyLlama-460m-Chat-awq

Text Generation

instruction tuned

text generation

Carbon Emissions

text-generation-inference

4-bit precision

Model card Files Files and versions Community

nicholasKluge commited on Jan 21

Commit

748c8f7

•

1 Parent(s): 69794df

Update README.md

Files changed (1) hide show

README.md +14 -3

README.md CHANGED Viewed

@@ -43,7 +43,9 @@ co2_eq_emissions:
   geographical_location: United States of America
   hardware_used: NVIDIA A100-SXM4-40GB
 ---
-# TeenyTinyLlama-460m-Chat
 TeenyTinyLlama is a pair of small foundational models trained in Brazilian Portuguese.
@@ -55,17 +57,26 @@ This repository contains a version of [TeenyTinyLlama-460m](https://huggingface.
 - **Batch size:** 4
 - **Optimizer:** `torch.optim.AdamW` (warmup_steps = 1e3, learning_rate = 1e-5, epsilon = 1e-8)
 - **GPU:** 1 NVIDIA A100-SXM4-40GB
-- **Carbon emissions** stats are logged in this [file](emissions.csv).
-This repository has the [source code](https://github.com/Nkluge-correa/Aira) used to train this model.
 ## Usage
 The following special tokens are used to mark the user side of the interaction and the model's response:
 `<instruction>`What is a language model?`</instruction>`A language model is a probability distribution over a vocabulary.`</s>`
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch

   geographical_location: United States of America
   hardware_used: NVIDIA A100-SXM4-40GB
 ---
+# TeenyTinyLlama-460m-Chat-awq
+**Note: This model is a quantized version of [TeenyTinyLlama-460m-Chat](https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m-Chat). Quantization was performed using [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), allowing this version to be 80% lighter with almost no performance loss. A GPU is required to run the AWQ-quantized models.**
 TeenyTinyLlama is a pair of small foundational models trained in Brazilian Portuguese.
 - **Batch size:** 4
 - **Optimizer:** `torch.optim.AdamW` (warmup_steps = 1e3, learning_rate = 1e-5, epsilon = 1e-8)
 - **GPU:** 1 NVIDIA A100-SXM4-40GB
+- **Quantization Configuration:**
+  - `bits`: 4
+  - `group_size`: 128
+  - `quant_method`: "awq"
+  - `version`: "gemm"
+  - `zero_point`: True
+This repository has the [source code](https://github.com/Nkluge-correa/TeenyTinyLlama) used to train this model.
 ## Usage
+**Note: Using quantized models required the installation of `autoawq==0.1.7`. A GPU is required to run the AWQ-quantized models.**
 The following special tokens are used to mark the user side of the interaction and the model's response:
 `<instruction>`What is a language model?`</instruction>`A language model is a probability distribution over a vocabulary.`</s>`
 ```python
+!pip install autoawq==0.1.7 -q
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch