Quantized model of the larger versions?

#7
by Runo888 - opened

Will you create a ggml version of the larger 7B and 10B versions like you did with the 3B version here? I'd try converting it myself but I can't figure out how to do it.

I used this candle tool to generate the quantized files: https://github.com/huggingface/candle/blob/main/candle-examples/examples/quantized-t5/README.md#generating-quantized-weight-files

Here are all the commands to run it in a Google Colab (in a high-ram instance):

! wget https://static.rust-lang.org/rustup/dist/x86_64-unknown-linux-gnu/rustup-init
! chmod a+x rustup-init
! ./rustup-init -y

import os
os.environ['PATH'] += ':/root/.cargo/bin'

! rustup toolchain install nightly --component rust-src
!git clone https://github.com/huggingface/candle

model_name = "madlad400-7b-mt"

! git lfs install
! git clone https://huggingface.co/jbochi/{model_name}

files = " ".join([
    f"/content/{model_name}/{f}"
    for f in os.listdir(model_name) if f.endswith(".safetensors")])

quantization_format = "q4k"

!cd candle; cargo run --example tensor-tools --release -- \
  quantize --quantization {quantization_format} \
  {files} \
  --out-file ../{model_name}/model-{quantization_format}.gguf

!ls {model_name}/*.gguf

I'll try to upload some later.

I uploaded q4k and q6k weights for the 7b and 10b models. Let me know if you are interested in a specific quantization type and model size that's missing.

jbochi changed discussion status to closed

Thanks man, you're a hero!

Hello, that you for quatization! Could you maybe show example/modified version of the original code bellow, to use the 4bit quatized version?

from transformers import T5ForConditionalGeneration, T5Tokenizer

model_name = 'jbochi/madlad400-10b-mt'
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map="auto")
tokenizer = T5Tokenizer.from_pretrained(model_name)

text = "<2pt> I love pizza!"
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
outputs = model.generate(input_ids=input_ids)

tokenizer.decode(outputs[0], skip_special_tokens=True)

Eu adoro pizza!

Hello,

I'm unsure if you can load the weights directly from the GGUF files with HF transformers, sorry. I have only tested them with candle:

cargo run --example quantized-t5 --release  -- \
  --model-id "jbochi/madlad400-3b-mt" --weight-file "model-q4k.gguf" \
  --prompt "<2de> How are you, my friend?" \
  --temperature 0

Thank you, sorry for bothering you, but if I am not wrong this would load and reload the model again and again with each request, I have no experience with cargo, but is it possible to make interface that can run iteratively, like running the following code repeatedly in the jupyter cell for example.

text = "<2pt> I love pizza!"
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
outputs = model.generate(input_ids=input_ids)

tokenizer.decode(outputs[0], skip_special_tokens=True)

Again thank you for your time, I should probably just google it...

You can definitely do that with candle. You just need to modify the t5 example and reuse the model once it's initialized.

Loading the model is surprisingly relatively cheap when the sentences are long, so it's possible the speed up is not going to be that big.

Sign up or log in to comment