AetherArchitectural
/

GGUF-Quantization-Script

Text Generation

GGUF

quantized

text-generation-inference

Model card Files Files and versions Community

FantasiaFoundry commited on May 26, 2024

Commit

e947685

verified ·

1 Parent(s): 1f5317e

Update README.md

Browse files

Files changed (1) hide show

README.md +6 -5

README.md CHANGED Viewed

@@ -17,14 +17,15 @@ tags:
 > [!WARNING]
 > **[Important] Llama-3:**
 >
-> For those converting LLama-3 BPE models, you'll have to read [**llama.cpp/#6920**](https://github.com/ggerganov/llama.cpp/pull/6920#issue-2265280504) for more context. <br>
 >
 > Basically, make sure you're in the latest llama.cpp repo commit, then run the new `convert-hf-to-gguf-update.py` script inside the repo (you will need to provide a huggingface-read-token, and you need to have access to the Meta-Llama-3 repositories – [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B) and [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) – to be sure, so fill the access request forms right away to be able to fetch the necessary files, you also might need to refresh the tokens if it stops working after some time), afterwards you need to manually copy the config files from `llama.cpp\models\tokenizers\llama-bpe` into your downloaded **model** folder, replacing the existing ones. <br>
 > Try again and the conversion procress should work as expected.
->
 > **Experimental:**
->
-> There is a new experimental script added, `gguf-imat-llama-3-lossless.py`, which performs the conversions directly from a BF16 GGUF to hopefully generate lossless, or as close to that for now, Llama-3 model quantizations avoiding the recent talked about issues on that topic, it is more resource intensive and will generate more writes in the drive as there's a whole additional conversion step that isn't performed in the previous version. This should only be necessary until we have GPU support for BF16 to run directly without conversion.
 Pull Requests with your own features and improvements to this script are always welcome.
@@ -33,7 +34,7 @@ Pull Requests with your own features and improvements to this script are always
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65ddabb9bbffb280f4b45d8e/vwlPdqxrSdILCHM24n_M2.png)
-Simple python script (`gguf-imat.py`) to generate various GGUF-IQ-Imatrix quantizations from a Hugging Face `author/model` input, for Windows and NVIDIA hardware.
 This is setup for a Windows machine with 8GB of VRAM, assuming use with an NVIDIA GPU. If you want to change the `-ngl` (number of GPU layers) amount, you can do so at [**line 124**](https://huggingface.co/FantasiaFoundry/GGUF-Quantization-Script/blob/main/gguf-imat.py#L124). This is only relevant during the `--imatrix` data generation. If you don't have enough VRAM you can decrease the `-ngl` amount or set it to 0 to only use your System RAM instead for all layers, this will make the imatrix data generation take longer, so it's a good idea to find the number that gives your own machine the best results.

 > [!WARNING]
 > **[Important] Llama-3:**
 >
+> For those converting LLama-3 BPE models, you might need have to read [**llama.cpp/#6920**](https://github.com/ggerganov/llama.cpp/pull/6920#issue-2265280504) for more context. <br>
+> Try and if you have issues try the tips bwllow.
 >
 > Basically, make sure you're in the latest llama.cpp repo commit, then run the new `convert-hf-to-gguf-update.py` script inside the repo (you will need to provide a huggingface-read-token, and you need to have access to the Meta-Llama-3 repositories – [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B) and [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) – to be sure, so fill the access request forms right away to be able to fetch the necessary files, you also might need to refresh the tokens if it stops working after some time), afterwards you need to manually copy the config files from `llama.cpp\models\tokenizers\llama-bpe` into your downloaded **model** folder, replacing the existing ones. <br>
 > Try again and the conversion procress should work as expected.
+> [!WARNING]
 > **Experimental:**
+> There is a new experimental script added, `gguf-imat-lossless-for-BF16.py`, which performs the conversions directly from a BF16 GGUF to hopefully generate lossless, or as close to that for now, Llama-3 model quantizations avoiding the recent talked about issues on that topic, it is more resource intensive and will generate more writes in the drive as there's a whole additional conversion step that isn't performed in the previous version. This should only be necessary until we have GPU support for BF16 to run directly without conversion.
 Pull Requests with your own features and improvements to this script are always welcome.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65ddabb9bbffb280f4b45d8e/vwlPdqxrSdILCHM24n_M2.png)
+Simple python script (`gguf-imat.py` - I recommend using the specific "for-FP16" or "for-BF16" scripts) to generate various GGUF-IQ-Imatrix quantizations from a Hugging Face `author/model` input, for Windows and NVIDIA hardware.
 This is setup for a Windows machine with 8GB of VRAM, assuming use with an NVIDIA GPU. If you want to change the `-ngl` (number of GPU layers) amount, you can do so at [**line 124**](https://huggingface.co/FantasiaFoundry/GGUF-Quantization-Script/blob/main/gguf-imat.py#L124). This is only relevant during the `--imatrix` data generation. If you don't have enough VRAM you can decrease the `-ngl` amount or set it to 0 to only use your System RAM instead for all layers, this will make the imatrix data generation take longer, so it's a good idea to find the number that gives your own machine the best results.