File size: 8,873 Bytes
3d45f98 125511b 3d45f98 125511b 3d45f98 125511b ae57df3 125511b 3d45f98 125511b 3d45f98 125511b 3d45f98 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
---
base_model: nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF
library_name: transformers
language:
- en
tags:
- nvidia
- llama-3
- pytorch
license: other
license_name: nvidia-open-model-license
license_link: >-
https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
pipeline_tag: text-generation
quantized_by: ymcki
---
Original model: https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF
## Prompt Template
```
### System:
{system_prompt}
### User:
{user_prompt}
### Assistant:
```
***Important*** for people who wants to do their own quantitization. There is a typo in tokenizer_config.json of the original model that mistakenly set eos_token to '<|eot_id|>' when it should be '<|end_of_text|>'. Please fix it or overwrite with the [tokenizer_config.json](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/tokenizer_config.json) in this repository before you do the gguf conversion yourself.
Starting from [b4380](https://github.com/ggerganov/llama.cpp/archive/refs/tags/b4380.tar.gz) of llama.cpp, DeciLMForCausalLM's variable Grouped Query Attention is now supported.. Please download it and compile it to run the GGUFs in this repository.
This modification should support Llama-3_1-Nemotron 51B-Instruct fully. However, it may not support future DeciLMForCausalLM models that has no_op or linear ffn layers. Well, I suppose these support can be added when there are actually models using that types of layers.
Since I am a free user, so for the time being, I only upload models that might be of interest for most people.
## Download a file (not the whole branch) from below:
Perplexity for f16 gguf is 6.646565 ± 0.040986.
| Quant Type | imatrix | File Size | Delta Perplexity | KL Divergence | Description |
| ---------- | ------- | ----------| ---------------- | ------------- | ----------- |
| [Q6_K](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q6_K.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 42.26GB | -0.002436 ± 0.001565 | 0.003332 ± 0.000014 | Good for Nvidia cards or Apple Silicon with 48GB RAM. Should perform very close to the original |
| [Q5_K_M](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q5_K_M.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 36.47GB | 0.020310 ± 0.002052 | 0.005642 ± 0.000024 | Good for A100 40GB or dual 3090. Better than Q4_K_M but larger and slower. |
| [Q4_K_M](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q4_K_M.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 31.04GB | 0.055444 ± 0.002982 | 0.012021 ± 0.000052 | Good for A100 40GB or dual 3090. Higher cost performance ratio than Q5_K_M. |
| IQ4_NL | calibration_datav3 | 29.30GB | 0.088279 ± 0.003944 | 0.020314 ± 0.000093 | For 32GB cards, e.g. 5090. Minor performance gain doesn't justify its use over IQ4_XS |
| [IQ4_XS](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ4_XS.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 27.74GB | 0.095486 ± 0.004039 | 0.020962 ± 0.000097 | For 32GB cards, e.g. 5090. Too slow for CPU and Apple. Recommended. |
| Q4_0 | calibration_datav3 | 29.34GB | 0.543042 ± 0.009290 | 0.077602 ± 0.000389 | For 32GB cards, e.g. 5090. Too slow for CPU and Apple. |
| [Q4_0_4_8](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q4_0_4_8.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 29.25GB | Same as Q4_0 assumed | Same as Q4_0 assumed | For Apple Silicon |
| [IQ3_M](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_M.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 23.5GB | 0.313812 ± 0.006299 | 0.054266 ± 0.000205 | Largest model that can fit a single 3090 at 4k context. Not recommeneded for CPU or Apple Silicon due to high computational cost. |
| [IQ3_S](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_S.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 22.7GB | 0.434774 ± 0.007162 | 0.069264 ± 0.000242 | Largest model that can fit a single 3090 at 8k context. Not recommended for CPU or Apple Silicon due to high computational cost. |
| Q3_K_S | calibration_datav3 | 22.7GB | 0.698971 ± 0.010387 | 0.089605 ± 0.000443 | Largest model that can fit a single 3090 that performs well in all platforms |
| Q3_K_S | none | 22.7GB | 2.224537 ± 0.024868 | 0.283028 ± 0.001220 | Largest model that can fit a single 3090 without imatrix |
## How to check i8mm support for Apple devices
ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM architecture >= ARMv8.6-A supports i8mm. That means Apple Silicon from A15 and M2 works best with Q4_0_4_8.
For Apple devices,
```
sysctl hw
```
On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0 than the other ggufs. That means for GPU inference, you better off using Q4_0.
## Which Q4_0 model to use for Apple devices
| Brand | Series | Model | i8mm | sve | Quant Type |
| ----- | ------ | ----- | ---- | --- | -----------|
| Apple | A | A4 to A14 | No | No | Q4_0_4_4 |
| Apple | A | A15 to A18 | Yes | No | Q4_0_4_8 |
| Apple | M | M1 | No | No | Q4_0_4_4 |
| Apple | M | M2/M3/M4 | Yes | No | Q4_0_4_8 |
## Convert safetensors to f16 gguf
Make sure you have llama.cpp git cloned:
```
python3 convert_hf_to_gguf.py Llama-3_1-Nemotron 51B-Instruct/ --outfile Llama-3_1-Nemotron 51B-Instruct.f16.gguf --outtype f16
```
## Convert f16 gguf to Q4_0 gguf without imatrix
Make sure you have llama.cpp compiled:
```
./llama-quantize Llama-3_1-Nemotron 51B-Instruct.f16.gguf Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf q4_0
```
## Convert f16 gguf to Q4_0 gguf with imatrix
Make sure you have llama.cpp compiled. Then create an imatrix with a dataset.
```
./llama-imatrix -m Llama-3_1-Nemotron-51B-Instruct.f16.gguf -f calibration_datav3.txt -o Llama-3_1-Nemotron-51B-Instruct.imatrix --chunks 32
```
Then convert with the created imatrix.
```
./llama-quantize Llama-3_1-Nemotron-51B-Instruct.f16.gguf --imatrix Llama-3_1-Nemotron-51B-Instruct.imatrix Llama-3_1-Nemotron-51B-Instruct.imatrix.Q4_0.gguf q4_0
```
## Calculate perplexity and KL divergence
First, download wikitext.
```
bash ./scripts/get-wikitext-2.sh
```
Second, find the base values of F16 gguf. Please be warned that the generated base value file is about 10GB. Adjust GPU layers depending on your VRAM.
```
./llama-perplexity --kl-divergence-base Llama-3_1-Nemotron-51B-Instruct.f16.kld -m Llama-3_1-Nemotron-51B-Instruct.f16.gguf -f wikitext-2-raw/wiki.test.raw -ngl 100
```
Finally, calculate the perplexity and KL divergence of Q4_0 gguf. Adjust GPU layers depending on your VRAM.
```
./llama-perplexity --kl-divergence-base Llama-3_1-Nemotron-51B-Instruct.f16.kld --kl_divergence -m Llama-3_1-Nemotron-51B-Instruct.Q4_0.gguf -ngl 100 >& Llama-3_1-Nemotron-51B-Instruct.Q4_0.kld
```
## Downloading using huggingface-cli
First, make sure you have hugginface-cli installed:
```
pip install -U "huggingface_hub[cli]"
```
Then, you can target the specific file you want:
```
huggingface-cli download ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF --include "Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf" --local-dir ./
```
## Running the model using llama-cli
First, download and compile my [Modified llama.cpp-b4139](https://github.com/ymcki/llama.cpp-b4139) v0.2. Compile it, then run
```
./llama-cli -m ~/Llama-3_1-Nemotron-51B-Instruct.Q3_K_S.gguf -p 'You are a European History Professor named Professor Whitman.' -cnv -ngl 100
```
## Credits
Thank you bartowski for providing a README.md to get me started.
|