GGUF conversion

#3
by compilade - opened

For anyone else who wants to try this on CPU, it's possible to convert this to GGUF with the ternary types from https://github.com/ggerganov/llama.cpp/pull/8151 relatively easily:

From https://github.com/ggerganov/llama.cpp/tree/8b836ae731bbb2c5640bc47df5b0a78ffcb129cb, or any version of the master branch of llama.cpp where this patch still applies, apply this to convert_hf_to_gguf.py

diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index ff4c9226..c37dcbcc 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -164,8 +164,19 @@ class Model:
                 for name in model_part.keys():
                     if self.is_safetensors:
                         if self.lazy:
+                            if (name.endswith("_scale") and name.removesuffix("_scale") in model_part.keys()):
+                                continue
                             data = model_part.get_slice(name)
                             data = LazyTorchTensor.from_safetensors_slice(data)
+                            if (name + "_scale" in model_part.keys()):
+                                orig_shape = data.shape
+                                scale = model_part.get_slice(name + "_scale")
+                                shift = torch.tensor([0, 2, 4, 6], dtype=torch.uint8).reshape((4, *(1 for _ in range(len(orig_shape)))))
+                                data = data.unsqueeze(0).expand((4, *orig_shape)) >> shift
+                                data = data & 3
+                                data = (data.float() - 1).reshape((orig_shape[0] * 4, *orig_shape[1:]))
+                                # The scale is inverted
+                                data = data / LazyTorchTensor.from_safetensors_slice(scale).float()
                         else:
                             data = model_part.get_tensor(name)
                     else:

(very ad-hoc patch, probably only works for this model)

First, copy tokenizer.json and tokenzier_config.json from https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct (or https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct if you don't have access) to the directory where this repository (the ternarized llama-3 model) was cloned.

Then, assuming you have this model cloned in models/Llama3-8B-1.58-100B-tokens/ (but in can be anywhere, use the appropriate path if it differs), and assuming that you have patched convert_hf_to_gguf.py with the above patch, you can run:

$ python3 convert_hf_to_gguf.py models/Llama3-8B-1.58-100B-tokens --outfile models/llama-3-8B-HF1BitLLM-TQ1_0-big.gguf --outtype tq1_0

(note the -big suffix for the output, which I've added (purely decoratively) because by default the conversion uses 16-bit floats for the token embeddings and the output tensor)

This only requires around 4GB of free RAM to convert (at least on Linux).

Then, you can quantize the token embeddings to Q4_K and the output tensor to Q6_K with llama-quantize:

$ ./build/bin/llama-quantize models/llama-3-8B-HF1BitLLM-TQ1_0-big.gguf models/llama-3-8B-HF1BitLLM-TQ1_0.gguf tq1_0

(the resulting TQ1_0-converted model takes 2.06 GiB)

If you want to use (the potentially faster, but slightly bigger) TQ2_0, you can run

$ ./build/bin/llama-quantize --allow-requantize models/llama-3-8B-HF1BitLLM-TQ1_0.gguf models/llama-3-8B-HF1BitLLM-TQ2_0.gguf tq2_0

(the resulting TQ2_0-converted model takes 2.36 GiB)

Enjoy!

This comment has been hidden

The blog article was an interesting read. I'm all for reducing hardware requirements to make running local LLMs more accessible. I followed the directions to convert and quantize then ran with llama-cli -cnv --color -ngl 0 -m llama-3-8B-HF1BitLLM-TQ1_0.gguf. It tries to autocomplete what I say like a base model would so I'm looking forward to seeing how an instruct version performs.

I made the patch and can convert to gguf but cannot quantize:

Conversion:

(venv) C:\Users\Admin\Desktop\llama-cpp-python\llama.cpp>python convert_hf_to_gguf.py "models\Llama3-8B-1.58-100B-tokens" --outfile "models\Llama3-8B-1.58-100B-tokens-big.gguf" --outtype tq1_0
INFO:hf-to-gguf:Loading model: Llama3-8B-1.58-100B-tokens
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:output.weight, torch.bfloat16 --> F16, shape = {4096, 128256}
INFO:hf-to-gguf:token_embd.weight, torch.bfloat16 --> F16, shape = {4096, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.0.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.0.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.0.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.0.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.1.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.1.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.1.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.1.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.1.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.1.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.1.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.1.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.1.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.10.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.10.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.10.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.10.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.10.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.10.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.10.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.10.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.10.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.11.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.11.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.11.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.11.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.11.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.11.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.11.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.11.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.11.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.12.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.12.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.12.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.12.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.12.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.12.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.12.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.12.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.12.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.13.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.13.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.13.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.13.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.13.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.13.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.13.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.13.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.13.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.14.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.14.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.14.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.14.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.14.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.14.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.14.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.14.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.14.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.15.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.15.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.15.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.15.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.15.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.15.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.15.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.15.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.15.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.16.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.16.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.16.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.16.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.16.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.16.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.16.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.16.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.16.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.17.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.17.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.17.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.17.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.17.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.17.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.17.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.17.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.17.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.18.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.18.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.18.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.18.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.18.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.18.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.18.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.18.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.18.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.19.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.19.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.19.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.19.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.19.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.19.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.19.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.19.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.19.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.2.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.2.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.2.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.2.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.2.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.2.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.2.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.2.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.2.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.20.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.20.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.20.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.20.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.20.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.20.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.20.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.20.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.20.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.21.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.21.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.21.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.21.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.21.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.21.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.21.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.21.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.21.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.22.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.22.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.22.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.22.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.22.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.22.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.22.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.22.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.22.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.23.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.23.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.23.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.23.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.23.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.23.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.23.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.23.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.23.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.24.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.24.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.24.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.24.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.24.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.24.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.24.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.24.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.24.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.25.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.25.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.25.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.25.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.25.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.25.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.25.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.25.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.25.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.26.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.26.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.26.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.26.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.26.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.26.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.26.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.26.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.26.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.27.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.27.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.27.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.27.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.27.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.27.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.27.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.27.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.27.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.28.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.28.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.28.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.28.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.28.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.28.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.28.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.28.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.28.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.29.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.29.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.29.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.29.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.29.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.29.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.29.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.29.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.29.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.3.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.3.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.3.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.3.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.3.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.3.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.3.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.3.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.3.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.30.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.30.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.30.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.30.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.30.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.30.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.30.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.30.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.30.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.31.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.31.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.31.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.31.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.31.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.31.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.31.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.31.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.31.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.4.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.4.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.4.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.4.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.4.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.4.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.4.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.4.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.4.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.5.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.5.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.5.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.5.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.5.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.5.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.5.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.5.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.5.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.6.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.6.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.6.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.6.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.6.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.6.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.6.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.6.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.6.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.7.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.7.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.7.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.7.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.7.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.7.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.7.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.7.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.7.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.8.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.8.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.8.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.8.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.8.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.8.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.8.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.8.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.8.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.9.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.9.ffn_down.weight, torch.float32 --> TQ1_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.9.ffn_gate.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.9.ffn_up.weight, torch.float32 --> TQ1_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.9.ffn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.9.attn_k.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:blk.9.attn_output.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.9.attn_q.weight, torch.float32 --> TQ1_0, shape = {4096, 4096}
INFO:hf-to-gguf:blk.9.attn_v.weight, torch.float32 --> TQ1_0, shape = {4096, 1024}
INFO:hf-to-gguf:output_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 8192
INFO:hf-to-gguf:gguf: embedding length = 4096
INFO:hf-to-gguf:gguf: feed forward length = 14336
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 8
INFO:hf-to-gguf:gguf: rope theta = 500000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05
INFO:hf-to-gguf:gguf: file type = 36
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Adding 280147 merge(s).
INFO:gguf.vocab:Setting special token type bos to 128000
INFO:gguf.vocab:Setting special token type eos to 128009
INFO:gguf.vocab:Setting chat_template to {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %}
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:models\Llama3-8B-1.58-100B-tokens-big.gguf: n_tensors = 291, total_size = 3.6G
Writing: 100%|█████████████████████████████████████████████████████████████████| 3.57G/3.57G [04:27<00:00, 13.3Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to models\Llama3-8B-1.58-100B-tokens-big.gguf

Quantize:

(venv) C:\Users\Admin\Desktop\llama-cpp-python\llama.cpp>.\build\bin\Release\llama-quantize models\Llama3-8B-1.58-100B-tokens-big.gguf \models\Llama3-8B-1.58-100B-tokens-tq1_0.gguf tq1_0
main: build = 3786 (6f9d1275)
main: built with MSVC 19.41.34120.0 for x64
main: quantizing 'models\Llama3-8B-1.58-100B-tokens-big.gguf' to '\models\Llama3-8B-1.58-100B-tokens-tq1_0.gguf' as TQ1_0
llama_model_loader: loaded meta data with 31 key-value pairs and 291 tensors from models\Llama3-8B-1.58-100B-tokens-big.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama3 8B 1.58 100B Tokens
llama_model_loader: - kv 3: general.version str = 1.58
llama_model_loader: - kv 4: general.finetune str = 100b-tokens
llama_model_loader: - kv 5: general.basename str = Llama3
llama_model_loader: - kv 6: general.size_label str = 8B
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Meta Llama 3 8B Instruct
llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Met...
llama_model_loader: - kv 11: llama.block_count u32 = 32
llama_model_loader: - kv 12: llama.context_length u32 = 8192
llama_model_loader: - kv 13: llama.embedding_length u32 = 4096
llama_model_loader: - kv 14: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 15: llama.attention.head_count u32 = 32
llama_model_loader: - kv 16: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 17: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 18: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 19: general.file_type u32 = 36
llama_model_loader: - kv 20: llama.vocab_size u32 = 128256
llama_model_loader: - kv 21: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,280147] = ["─á ─á", "─á ─á─á─á", "─á─á ─á─á", "...
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 29: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type f16: 2 tensors
llama_model_loader: - type tq1_0: 224 tensors
llama_model_quantize: failed to quantize: ios_base::failbit set: iostream stream error
main: failed to quantize model from 'models\Llama3-8B-1.58-100B-tokens-big.gguf'

@brunopio you've got an extra slash at the start of your output path given to llama-quantize which makes it try to write to a path which doesn't exist.

@brunopio you've got an extra slash at the start of your output path given to llama-quantize which makes it try to write to a path which doesn't exist.

Thank you!
My GGUFs: https://huggingface.co/brunopio/Llama3-8B-1.58-100B-tokens-GGUF

@brunopio i am attempting to use your repo: https://github.com/blap/llama.cpp .

i run in to the following issue:

(.venv) musclez@NSA:~/blap-llama.cpp$ python convert_hf_to_gguf.py /home/musclez/.cache/huggingface/hub/models--HF1BitLLM--Llama3-8B-1.58-100B-tokens/snapshots/5c35ae1f2c622b75a9c28e3603074863d74e4792/ --outfile ./100b.gguf --outtype tq1_0
INFO:hf-to-gguf:Loading model: 5c35ae1f2c622b75a9c28e3603074863d74e4792
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:output.weight,               torch.bfloat16 --> F16, shape = {4096, 128256}
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> F16, shape = {4096, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.uint8 --> TQ1_0, shape = {14336, 1024}
Traceback (most recent call last):
  File "/home/musclez/blap-llama.cpp/convert_hf_to_gguf.py", line 4330, in <module>
    main()
  File "/home/musclez/blap-llama.cpp/convert_hf_to_gguf.py", line 4324, in main
    model_instance.write()
  File "/home/musclez/blap-llama.cpp/convert_hf_to_gguf.py", line 425, in write
    self.prepare_tensors()
  File "/home/musclez/blap-llama.cpp/convert_hf_to_gguf.py", line 1639, in prepare_tensors
    super().prepare_tensors()
  File "/home/musclez/blap-llama.cpp/convert_hf_to_gguf.py", line 294, in prepare_tensors
    for new_name, data in ((n, d.squeeze().numpy()) for n, d in self.modify_tensors(data_torch, name, bid)):
                                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/blap-llama.cpp/convert_hf_to_gguf.py", line 1607, in modify_tensors
    return [(self.map_tensor_name(name), data_torch)]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/blap-llama.cpp/convert_hf_to_gguf.py", line 214, in map_tensor_name
    raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'model.layers.0.mlp.down_proj.weight_scale'

I was attempting to convert this to Ollama format, but i'm not sure if your conversions are dependent on a custom version of the llama.cpp executable or the conversion scripts.

would you be willing to push this to ollama if possible? i'd like to test this at different quants https://github.com/unclemusclez/ollama-toolkit

@unclemusclez The conversion requires a custom version of convert_hf_to_gguf.py (patched with the diff included in the top-level comment of this discussion), BUT the resulting models should be usable with any llama.cpp build which supports TQ1_0 and TQ2_0 (most likely ollama supports it since it has been 3 weeks since these types were merged into llama.cpp's master branch). Note that GPU support is still WIP for TQ1_0 and TQ2_0; for now they only work on CPU (but nothing technically prevents GPU support, so eventually it will work, once I find enough time) .

would you be willing to push this to ollama if possible?

I won't, but @brunopio might; they already have published their conversions: https://huggingface.co/brunopio/Llama3-8B-1.58-100B-tokens-GGUF

i'd like to test this at different quants https://github.com/unclemusclez/ollama-toolkit

TQ1_0 and TQ2_0 can encode ternary weights losslessly (but are very bad at anything else), so I don't expect other quants to yield substantially better results for this ternarized model. What could still be tweaked however are the quant types of tok_embd.weight and output.weight.

@compilade I appreciate you getting back to me. It's been very difficult to wrap my head around some of this stuff. Any explanation goes a long way.

TQ1_0 and TQ2_0 can encode ternary weights losslessly (but are very bad at anything else), so I don't expect other quants to yield substantially better results for this ternarized model. What could still be tweaked however are the quant types of tok_embd.weight and output.weight.

Unfortunately, I run an older version of Ollama due to AMD compatibility at the moment. I will look into compiling the older version of ollama with the newer version of llama.cpp.

The conversion requires a custom version of convert_hf_to_gguf.py

So, in theory, i would just need the updated script to make my own conversion locally? then I can run this on my new and improved frankollama?

Note that GPU support is still WIP for TQ1_0 and TQ2_0; for now they only work on CPU (but nothing technically prevents GPU support, so eventually it will work, once I find enough time) .

this would be the most important part for me at the moment. And to be honest, I'd really like to try something bigger for personal use, but I would definitely push this to Ollama if i figure it out. I don't want to confuse people with the CPU only support, however, does llama.cpp automatically parse the model between CPU and GPU. If i understand correctly, it seems like the TQ1_0 and TQ2_0 part are separate from the rest, hence the multi-layer conversion.

it seems like i might be jumping the gun on ollama support, but i'm really glad i know about this and I'm follow along. This is super interesting. I'm doing a lot of exploring on training and inference with incredibly limited resources.

Sign up or log in to comment