ymcki commited on
Commit
3d45f98
1 Parent(s): e97a1c7

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -6
README.md CHANGED
@@ -1,6 +1,104 @@
1
- ---
2
- license: other
3
- license_name: nvidia-open-model-license
4
- license_link: >-
5
- https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF
3
+ library_name: transformers
4
+ language:
5
+ - en
6
+ tags:
7
+ - nvidia
8
+ - llama-3
9
+ - pytorch
10
+ license: other
11
+ license_name: nvidia-open-model-license
12
+ license_link: >-
13
+ https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
14
+ pipeline_tag: text-generation
15
+ quantized_by: ymcki
16
+ ---
17
+
18
+ Original model: https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF
19
+
20
+ ## Prompt Template
21
+
22
+ ```
23
+ ### System:
24
+ {system_prompt}
25
+ ### User:
26
+ {user_prompt}
27
+ ### Assistant:
28
+
29
+ ```
30
+
31
+ [Modified llama.cpp](https://github.com/ymcki/llama.cpp-b4139) to support DeciLMForCausalLM's variable Grouped Query Attention. Please download it and compile it to run the GGUFs in this repository. I am in the process of talking to llama.cpp people to see if they can merge my code to their codebase.
32
+
33
+ This modification should support Llama-3_1-Nemotron 51B-Instruct fully. However, it may not support future DeciLMForCausalLM models that has no_op or linear ffn layers. Well, I suppose these support can be added when there are actually models using that types of layers.
34
+
35
+ Since I am a free user, so for the time being, I only upload models that might be of interest for most people.
36
+
37
+ ## Download a file (not the whole branch) from below:
38
+
39
+ | Filename | Quant type | File Size | Description |
40
+ | -------- | ---------- | --------- | ----------- |
41
+ | [Llama-3_1-Nemotron 51B-Instruct.Q4_K_M.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron 51B-Instruct.Q4_K_M.gguf) | Q4_K_M | 31GB | Good for A100 40GB or dual 3090 |
42
+ | [Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf) | Q4_0 | 29.3GB | For 32GB cards, e.g. 5090. |
43
+ | [Llama-3_1-Nemotron 51B-Instruct.Q4_0_4_8.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron 51B-Instruct.Q4_0_4_8.gguf) | Q4_0_4_8 | 29.3GB | For Apple Silicon |
44
+ | [Llama-3_1-Nemotron 51B-Instruct.Q3_K_S.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron 51B-Instruct.Q3_K_S.gguf) | Q3_K_S | 22.7GB | Largest model that can fit a single 3090 |
45
+
46
+ ## How to check i8mm support for Apple devices
47
+
48
+ ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM architecture >= ARMv8.6-A supports i8mm. That means Apple Silicon from A15 and M2 works best with Q4_0_4_8.
49
+
50
+ For Apple devices,
51
+
52
+ ```
53
+ sysctl hw
54
+ ```
55
+
56
+ On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0 than the other ggufs. That means for GPU inference, you better off using Q4_0.
57
+
58
+ ## Which Q4_0 model to use for Apple devices
59
+ | Brand | Series | Model | i8mm | sve | Quant Type |
60
+ | ----- | ------ | ----- | ---- | --- | -----------|
61
+ | Apple | A | A4 to A14 | No | No | Q4_0_4_4 |
62
+ | Apple | A | A15 to A18 | Yes | No | Q4_0_4_8 |
63
+ | Apple | M | M1 | No | No | Q4_0_4_4 |
64
+ | Apple | M | M2/M3/M4 | Yes | No | Q4_0_4_8 |
65
+
66
+ ## Convert safetensors to f16 gguf
67
+
68
+ Make sure you have llama.cpp git cloned:
69
+
70
+ ```
71
+ python3 convert_hf_to_gguf.py Llama-3_1-Nemotron 51B-Instruct/ --outfile Llama-3_1-Nemotron 51B-Instruct.f16.gguf --outtype f16
72
+ ```
73
+
74
+ ## Convert f16 gguf to Q4_0 gguf without imatrix
75
+ Make sure you have llama.cpp compiled:
76
+ ```
77
+ ./llama-quantize Llama-3_1-Nemotron 51B-Instruct.f16.gguf Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf q4_0
78
+ ```
79
+
80
+ ## Downloading using huggingface-cli
81
+
82
+ First, make sure you have hugginface-cli installed:
83
+
84
+ ```
85
+ pip install -U "huggingface_hub[cli]"
86
+ ```
87
+
88
+ Then, you can target the specific file you want:
89
+
90
+ ```
91
+ huggingface-cli download ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF --include "Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf" --local-dir ./
92
+ ```
93
+
94
+ ## Running the model using llama-cli
95
+
96
+ First, download and compile my [Modified llama.cpp-b4139](https://github.com/ymcki/llama.cpp-b4139) v0.2. Compile it, then run
97
+ ```
98
+ ./llama-cli -m ~/Llama-3_1-Nemotron-51B-Instruct.Q3_K_S.gguf -p 'You are a European History Professor named Professor Whitman.' -cnv -ngl 100
99
+ ```
100
+
101
+ ## Credits
102
+
103
+ Thank you bartowski for providing a README.md to get me started.
104
+