File size: 10,062 Bytes
3024ff2
 
 
 
 
 
 
 
 
 
9d31652
 
5778e4e
 
 
 
 
 
3024ff2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
218a659
3024ff2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d1d7e8
3024ff2
44732ee
 
 
 
3024ff2
 
 
 
 
 
 
 
 
9d1d7e8
3024ff2
 
 
44732ee
3024ff2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11bb038
3024ff2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44732ee
3024ff2
 
 
 
 
 
 
 
 
44732ee
3024ff2
 
 
 
 
77ce926
44732ee
77ce926
44732ee
77ce926
3024ff2
 
77ce926
 
3024ff2
77ce926
 
3024ff2
77ce926
3024ff2
 
44732ee
3024ff2
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
language:
- en
license: mit
tags:
- meta
- pytorch
- llama-3.3
- llama-3.3-instruct
- gguf
- ollama
- text-generation-inference
- Text-generation-webui
- llama
- llama_3.3
- instruct
- LLM
- 70b
model_name: Llama-3.3-70B-Instruct-GGUF
arxiv: 2407.21783
base_model: meta-llama/Llama-3.3-70b-instruct.hf
inference: false
model_creator: Meta Llama 3.3
model_type: llama
pipeline_tag: text-generation
prompt_template: >
  [INST] <<SYS>>

  You are a helpful, respectful and honest assistant. Always answer as helpfully
  as possible.If a question does not make any sense, or is not factually
  coherent,  explain why instead of answering something that is not correct.  If
  you don't know the answer to a question, do not answer it with false
  information.

  <</SYS>>

  {prompt}[/INST]
quantized_by: hierholzer
---

[![Hierholzer Banner](https://tvtime.us/static/images/LLAMA3.3.png)](#)

# GGUF Model
-----------------------------------


Here are Quantized versions of Llama-3.3-70B-Instruct using GGUF


## 🤔 What Is GGUF
GGUF is designed for use with GGML and other executors.
GGUF was developed by @ggerganov who is also the developer of llama.cpp, a popular C/C++ LLM inference framework.
Models initially developed in frameworks like PyTorch can be converted to GGUF format for use with those engines.


## ☑️Uploaded Quantization Types

Here are the quantized versions that I currently have available:

- [x] Q2_K
- [x] Q3_K_S
- [x] Q3_K_M
- [x] Q3_K_L
- [x] Q4_K_S
- [x] Q4_K_M ~ *Recommended*
- [x] Q5_K_S ~ *Recommended* 
- [x] Q5_K_M ~ *Recommended*
- [ ] Q6_K
- [ ] Q8_0 ~ *NOT Recommended*
- [ ] F16 ~ *NOT Recommended*
- [ ] F32 ~ *NOT Recommended*

*Feel Free to reach out to me if you need a specific Quantization Type that I do not currently offer.*


### 📈All Quantization Types Possible
Below is a table of all the Quantization Types that are possible as well as short descriptions.

| **#** | **or** | **Q#** | **:** | _Description Of Quantization Types_                            |
|-------|:------:|:------:|:-----:|----------------------------------------------------------------|
| 2     |   or   | Q4_0   |   :   | small, very high quality loss - legacy, prefer using Q3_K_M    |
| 3     |   or   | Q4_1   |   :   | small, substantial quality loss - legacy, prefer using Q3_K_L  |
| 8     |   or   | Q5_0   |   :   | medium, balanced quality - legacy, prefer using Q4_K_M         |
| 9     |   or   | Q5_1   |   :   | medium, low quality loss - legacy, prefer using Q5_K_M         |
| 10    |   or   | Q2_K   |   :   | smallest, extreme quality loss - *NOT Recommended*             |
| 12    |   or   | Q3_K   |   :   | alias for Q3_K_M                                               |
| 11    |   or   | Q3_K_S |   :   | very small, very high quality loss                             |
| 12    |   or   | Q3_K_M |   :   | very small, high quality loss                                  |
| 13    |   or   | Q3_K_L |   :   | small, high quality loss                                       |
| 15    |   or   | Q4_K   |   :   | alias for Q4_K_M                                               |
| 14    |   or   | Q4_K_S |   :   | small, some quality loss                                       |
| 15    |   or   | Q4_K_M |   :   | medium, balanced quality - *Recommended*                       |
| 17    |   or   | Q5_K   |   :   | alias for Q5_K_M                                               |
| 16    |   or   | Q5_K_S |   :   | large, low quality loss - *Recommended*                        |
| 17    |   or   | Q5_K_M |   :   | large, very low quality loss - *Recommended*                   |
| 18    |   or   | Q6_K   |   :   | very large, very low quality loss                              |
| 7     |   or   | Q8_0   |   :   | very large, extremely low quality loss                         |
| 1     |   or   | F16    |   :   | extremely large, virtually no quality loss - *NOT Recommended* |
| 0     |   or   | F32    |   :   | absolutely huge, lossless - *NOT Recommended*                  |

## 💪 Benefits of using GGUF

By using a GGUF version of Llama-3.3-70B-Instruct, you will be able to run this LLM while having to use significantly less resources than you would using the non quantized version.
This also allows you to run this 70B Model on a machine with less memory than a non quantized version. 

## ⚙️️Installation
--------------------------------------------
Here are 2 different methods you can use to run the quantized versions of Llama-3.3-70B-Instruct

### 1️⃣ Text-generation-webui

Text-generation-webui is a web UI for Large Language Models that you can run locally.

#### ☑️  How to install Text-generation-webui
*If you already have Text-generation-webui then skip this section*

| #  | Download Text-generation-webui                                                                                   |
|----|------------------------------------------------------------------------------------------------------------------|
| 1. | Clone the text-generation-webui repository from Github by copying the git clone snippet below:                   |
```shell
git clone https://github.com/oobabooga/text-generation-webui.git
```
| #  | Install Text-generation-webui                                                                                    |
|----|------------------------------------------------------------------------------------------------------------------|
| 1. | Run the `start_linux.sh`, `start_windows.bat`, `start_macos.sh`, or `start_wsl.bat` script depending on your OS. |
| 2. | Select your GPU vendor when asked.                                                                               |
| 3. | Once the installation script ends, browse to `http://localhost:7860`.                                            |

#### ✅Using Llama-3.3-70B-Instruct-GGUF with Text-generation-webui
| #  | Using Llama-3.3-70B-Instruct-GGUF with Text-generation-webui                                                                                                                                             |
|----|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1. | Once you are running text-generation-webui in your browser, click on the 'Model' Tab at the top of your window.                                                                                          |
| 2. | In the Download Model section, you need to enter the model repo: *hierholzer/Llama-3.3-70B-Instruct-GGUF* and below it, the specific filename to download, such as: *Llama-3.1-70B-Instruct-Q4_K_M.gguf* |
| 3. | Click Download and wait for the download to complete. NOTE: you can see the download progress back in your terminal window.                                                                              |
| 4. | Once the download is finished, click the blue refresh icon within the Model tab that you are in.                                                                                                         |
| 5. | Select your newly downloaded GGUF file in the Model drop-down. once selected, change the settings to best match your system.                                                                             |

### 2️⃣ Ollama
Ollama runs as a local service. 
Although it technically works using a command-line interface, Ollama's best attribute is their REST API. 
Being able to utilize your locally ran LLMs through the use of this API can give you almost endless possibilities!

*Feel free to reach out to me if you would like to know some examples that I use this API for*

#### ☑️  How to install Ollama
Go To the URL below, and then select which OS you are using
```shell
https://ollama.com/download
```
Using Windows, or Mac you will then download a file and run it. 
If you are using linux it will just provide a single command that you need to run in your terminal window.
*That's about it for installing Ollama*
#### ✅Using Llama-3.3-70B-Instruct-GGUF with  Ollama
Ollama does have a Model Library where you can download models:
```shell
https://ollama.com/library
```
This Model Library offers many different LLM versions that you can use.
However at the time of writing this, there is no version of Llama-3.3-Instruct offered in the Ollama library.

If you would like to use Llama-3.3-Instruct (70B), do the following:

| #  | Running the 70B quantized version of Llama 3.3-Instruct with Ollama                          |
|----|----------------------------------------------------------------------------------------------|
| 1. | Open up your terminal that you have Ollama Installed on.                                     |
| 2. | Paste the following command:                                                                 |
```shell
ollama run hf.co/hierholzer/Llama-3.3-70B-Instruct-GGUF:Q4_K_M

```
*Replace Q4_K_M with whatever version you would like to use from this repository.*
| #  | Running the 70B quantized version of Llama 3.3-Instruct with Ollama - *continued* |
|----|-----------------------------------------------------------------------------------|
| 3. | This will download & run the model. It will also be saved for future use.         |

-------------------------------------------------

[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-FFD21E?logo=huggingface&logoColor=000)](#)
[![OS](https://img.shields.io/badge/OS-linux%2C%20windows%2C%20macOS-0078D4)](https://docs.abblix.com/docs/technical-requirements)
[![CPU](https://img.shields.io/badge/CPU-x86%2C%20x64%2C%20ARM%2C%20ARM64-FF8C00)](https://docs.abblix.com/docs/technical-requirements)
[![forthebadge](https://forthebadge.com/images/badges/license-mit.svg)](https://forthebadge.com) 
[![forthebadge](https://forthebadge.com/images/badges/made-with-python.svg)](https://forthebadge.com)
[![forthebadge](https://forthebadge.com/images/badges/powered-by-electricity.svg)](https://forthebadge.com)