Can you please share the code to convert phi 1.5 to the GGUF format?

#2
by diablo98 - opened

First, I'd like to thank you for your work. The quantized phi 1.5 model is the only LLM that I dared to run on my 10 years old laptop with an i3, 4gb ddr3 and an HDD. Using the candle ML framework, it runs at an impressive ≈6 token/s and only eat around 1 GB of RAM. It's beyond impressive.

While the phi 1.5 model is impressive for its size, it was still lacking. After searching and experimenting with some finetunings of the model, I found this one : https://huggingface.co/teknium/Puffin-Phi-v2/tree/main, and it was quite frankly astonishing. Sadly, the model seems to be in the PyTorch format and unquantized. Can you please share the code to convert phi 1.5 to the GGUF format? I tried the convert.py file available on llama.cpp but no luck.

I don't know if the model can be quantized using the free tier of Google colab, but at such small size I think it hopefully will. Alternatively and while I honestly hope it's possible to achieve it by myself, you could, if it's not much to ask, do the conversion and quantization and share them on hugging face. Thanks in advance.

You can first convert the pytorch checkpoint to a safetensor file using the following python script.

from safetensors.torch import save_file
import torch
tensors = torch.load("pytorch_model.bin")
save_file(tensors, "pytorch_model.safetensors")

And then quantize the resulting safetensor files via candle tensor-tools utility.

cargo run --release --example tensor-tools -- quantize pytorch_model.safetensors --out-file model-q4k.gguf --quantization q4k

Note that the resulting gguf file will only work with candle and not with llama.cpp.

Thanks for your help. Sadly, after following the steps, when I run the model via candle I get Error: shape mismatch for weight, got [50304, 2048], expected [51200, 2048]
Here are the resulting files : https://huggingface.co/diabolo96/candle-quantized-Puffin-Phi-v2/tree/main

Looks like the tokenizers are supposed to be different from the phi ones? One has 50304 tokens in the vocabulary and the other 51200. Currently we use the tokenizer.json file from the main repo but maybe you want to tweak this to use a different tokenizer file? Let me know if you need some help for this.

I looked at the config.json of the two repos, and it seems that it's Microsoft phi 1.5 that has a vocab size of 51200 while the Puffin-Phi-v2 fine-tune has only 50304.
Frankly, I know nothing about rust and even less ML, but I downloaded the tokenize.json file from the https://huggingface.co/teknium/Puffin-Phi-v2 repo and after looking at the code in /content/candle/candle-examples/examples/phi and asking bing I changed line number 241 from
let tokenizer_filename = repo.get("tokenizer.json")?;
to :
let tokenizer_filename = std::path::PathBuf::from("path/to/Puffin-Phi-v2 tokenizer.json");
Mirroring the line under it :
let filename = match args.weight_file {
Some(weight_file) => std::path::PathBuf::from(weight_file)

I get this :
Error: unknown magic 0x73726576

This is the limit of what I can do, and I honestly don't know if what I am doing is completely wrong, so I apologize in advance if it's nonsense.

Ah sorry about that, this turns out to require slighly more changes. I've actually made them and merged them in the main branch so it should now be easy to run this model. E.g. from the github tip of the candle repo you can run the following

$ cargo run --example phi --release  -- \
    --prompt "USER: What would you do on a sunny day in Paris?\nASSISTANT:" \
    --sample-len 200 --model puffin-phi-v2 --quantized 
USER: What would you do on a sunny day in Paris?
ASSISTANT: On a sunny day in Paris, you could visit the Musée du Louvre to admire the famous
painting "Mona Lisa" by Leonardo da Vinci. You might also want to stroll along the Champs-Élysées
and enjoy the beautiful architecture of the buildings around you. Don't forget to stop by a café
for a cup of coffee and to soak up the sun!"

Let me know if you run into any issues or see any model that you would want to see supported - also don't hesitate to look at the PR if you want to be able to make such changes.

First,I would like to apologize for the late response, my internet was cut off.

I am deeply grateful that you took some of your time to adapt the Candle framework to be able to use the model. The PR is very valuable. If i understand correctly I need to add the config.json inside the mixformer.rs file. If I find another interesting model I definitely will try.

diablo98 changed discussion status to closed

Sign up or log in to comment