File size: 4,878 Bytes
3fd0224
 
d578be4
 
 
 
 
 
3fd0224
32d703a
0f7005a
ff89e0c
 
853f72f
34485f9
ff89e0c
8eb143d
853f72f
 
e947685
 
4e61cc9
acc1906
853f72f
ff89e0c
e947685
ff89e0c
e947685
e1a7f5d
e947685
ff89e0c
853f72f
ff89e0c
 
ed8f1ed
ff89e0c
 
8eb143d
3278c0b
34485f9
 
32d703a
7bde159
32d703a
e947685
e086b09
a921998
fc46d14
947720a
fc46d14
a921998
a465cbf
34485f9
 
 
 
2ab8465
 
 
 
 
 
6be6a27
70a5722
fc46d14
0d92d86
1274c18
 
 
 
 
cc01c1e
34485f9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: cc-by-nc-4.0
inference: false
pipeline_tag: text-generation
tags:
- gguf
- quantized
- text-generation-inference
---

> [!TIP]
> **Credits:** <br>
> Made with love by [**@Lewdiculous**](https://huggingface.co/Lewdiculous) with the handy contributions by [**@SolidSnacke**](https://huggingface.co/SolidSnacke) and [**@Virt-io**](https://huggingface.co/Virt-io). <br>
> If this proves useful for you, feel free to credit and share the repository and authors.

<!--
> [!WARNING]
> **[Important] Llama-3:**
> 
> For those converting LLama-3 BPE models, you might need have to read [**llama.cpp/#6920**](https://github.com/ggerganov/llama.cpp/pull/6920#issue-2265280504) for more context. <br>
> Try and if you have issues try the tips bwllow.
>
> Basically, make sure you're in the latest llama.cpp repo commit, then run the new `convert-hf-to-gguf-update.py` script inside the repo (you will need to provide a huggingface-read-token, and you need to have access to the Meta-Llama-3 repositories – [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B) and [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) – to be sure, so fill the access request forms right away to be able to fetch the necessary files, you also might need to refresh the tokens if it stops working after some time), afterwards you need to manually copy the config files from `llama.cpp\models\tokenizers\llama-bpe` into your downloaded **model** folder, replacing the existing ones. <br>
> Try again and the conversion procress should work as expected.
-->

<!--
> [!WARNING]
> **Experimental:** <br>
> There is a new experimental script added, `gguf-imat-lossless-for-BF16.py`, which performs the conversions directly from a BF16 GGUF to hopefully generate lossless, or as close to that for now, Llama-3 model quantizations avoiding the recent talked about issues on that topic, it is more resource intensive and will generate more writes in the drive as there's a whole additional conversion step that isn't performed in the previous version. This should only be necessary until we have GPU support for BF16 to run directly without conversion.
-->

> [!NOTE]
> **Linux support (experimental):** <br>
> There's an experimental script for Linux, `gguf-imat-lossless-for-BF16-linux.py` [**[context]**](https://huggingface.co/FantasiaFoundry/GGUF-Quantization-Script/discussions/32#66b476238bbe6a86a0228553). <br>
> While I personally can't attest for it, it's worth trying and you can report how well it worked, or not, in your case. <br>
> Improvements are very welcome!

Pull Requests with your own features and improvements to this script are always welcome.

# GGUF-IQ-Imatrix-Quantization-Script:

![image/png](https://cdn-uploads.huggingface.co/production/uploads/65ddabb9bbffb280f4b45d8e/vwlPdqxrSdILCHM24n_M2.png)

Simple python script (`gguf-imat.py` - I recommend using the specific "for-FP16" or "for-BF16" scripts) to generate various GGUF-IQ-Imatrix quantizations from a Hugging Face `author/model` input, for Windows and NVIDIA hardware.

This is setup for a Windows machine with 8GB of VRAM, assuming use with an NVIDIA GPU. If you want to change the `-ngl` (number of GPU layers) amount, you can do so at [**line 124**](https://huggingface.co/FantasiaFoundry/GGUF-Quantization-Script/blob/main/gguf-imat.py#L124). This is only relevant during the `--imatrix` data generation. If you don't have enough VRAM you can decrease the `-ngl` amount or set it to 0 to only use your System RAM instead for all layers, this will make the imatrix data generation take longer, so it's a good idea to find the number that gives your own machine the best results.

Your `imatrix.txt` is expected to be located inside the `imatrix` folder. I have already included a file that is considered a good starting option, [this discussion](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384) is where it came from. If you have suggestions or other imatrix data to recommend, please do so.

Adjust `quantization_options` in [**line 138**](https://huggingface.co/FantasiaFoundry/GGUF-Quantization-Script/blob/main/gguf-imat.py#L138).

> [!NOTE]  
> Models downloaded to be used for quantization are cached at `C:\Users\{{User}}\.cache\huggingface\hub`. You can delete these files manually as needed after you're done with your quantizations, you can do it directly from your Terminal if you prefer with the `rmdir "C:\Users\{{User}}\.cache\huggingface\hub"` command. You can put it into another script or alias it to a convenient command if you prefer. 


**Hardware:**

- NVIDIA GPU with 8GB of VRAM.
- 32GB of system RAM.

**Software Requirements:**
- Windows 10/11
- Git
- Python 3.11
  - `pip install huggingface_hub`
 
**Usage:**
```
python .\gguf-imat.py 
```
Quantizations will be output into the created `models\{model-name}-GGUF` folder.
<br><br>