Quantized model of the larger versions?

by Runo888 - opened Nov 18, 2023

Nov 18, 2023

Will you create a ggml version of the larger 7B and 10B versions like you did with the 3B version here? I'd try converting it myself but I can't figure out how to do it.

jbochi

Owner Nov 19, 2023

I used this candle tool to generate the quantized files: https://github.com/huggingface/candle/blob/main/candle-examples/examples/quantized-t5/README.md#generating-quantized-weight-files

Here are all the commands to run it in a Google Colab (in a high-ram instance):

! wget https://static.rust-lang.org/rustup/dist/x86_64-unknown-linux-gnu/rustup-init
! chmod a+x rustup-init
! ./rustup-init -y

import os
os.environ['PATH'] += ':/root/.cargo/bin'

! rustup toolchain install nightly --component rust-src
!git clone https://github.com/huggingface/candle

model_name = "madlad400-7b-mt"

! git lfs install
! git clone https://huggingface.co/jbochi/{model_name}

files = " ".join([
    f"/content/{model_name}/{f}"
    for f in os.listdir(model_name) if f.endswith(".safetensors")])

quantization_format = "q4k"

!cd candle; cargo run --example tensor-tools --release -- \
  quantize --quantization {quantization_format} \
  {files} \
  --out-file ../{model_name}/model-{quantization_format}.gguf

!ls {model_name}/*.gguf

I'll try to upload some later.

jbochi

Owner Nov 19, 2023

I uploaded q4k and q6k weights for the 7b and 10b models. Let me know if you are interested in a specific quantization type and model size that's missing.

jbochi changed discussion status to closed Nov 19, 2023

Runo888

Nov 19, 2023

Thanks man, you're a hero!

Viewegger

Dec 3, 2023

Hello, that you for quatization! Could you maybe show example/modified version of the original code bellow, to use the 4bit quatized version?

from transformers import T5ForConditionalGeneration, T5Tokenizer

model_name = 'jbochi/madlad400-10b-mt'
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map="auto")
tokenizer = T5Tokenizer.from_pretrained(model_name)

text = "<2pt> I love pizza!"
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
outputs = model.generate(input_ids=input_ids)

tokenizer.decode(outputs[0], skip_special_tokens=True)

Eu adoro pizza!

jbochi

Owner Dec 4, 2023

Hello,

I'm unsure if you can load the weights directly from the GGUF files with HF transformers, sorry. I have only tested them with candle:

cargo run --example quantized-t5 --release  -- \
  --model-id "jbochi/madlad400-3b-mt" --weight-file "model-q4k.gguf" \
  --prompt "<2de> How are you, my friend?" \
  --temperature 0

Viewegger

Dec 4, 2023

Thank you, sorry for bothering you, but if I am not wrong this would load and reload the model again and again with each request, I have no experience with cargo, but is it possible to make interface that can run iteratively, like running the following code repeatedly in the jupyter cell for example.

text = "<2pt> I love pizza!"
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
outputs = model.generate(input_ids=input_ids)

tokenizer.decode(outputs[0], skip_special_tokens=True)

Again thank you for your time, I should probably just google it...

jbochi

Owner Dec 5, 2023

You can definitely do that with candle. You just need to modify the t5 example and reuse the model once it's initialized.

Loading the model is surprisingly relatively cheap when the sentences are long, so it's possible the speed up is not going to be that big.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment