contains:

  • config-flan-t5-xl.json
  • model-flan-t5-xl.gguf

quantization: q6k

Looks good, could you mention the command line / code change that you needed to be able to test this and how I can run it to try it out?

  1. Quantization:
cargo run --example tensor-tools --release -- quantize --quantization q6k PATH/TO/T5/model.safetensors /tmp/model.gguf
  1. Testing:
    From Candle, I called my repo deepfile/flan-t5-xl-gguf instead of lmz/candle-quantized-t5, because it contains model-flan-t5-xl.gguf file in the main branch.
cargo run --example quantized-t5 --release -- --prompt "translate to German: I'm living in Paris." --model-id "deepfile/flan-t5-xl-gguf" --which "flan-t5-xl"
...
 Ich wohne in Paris.
8 tokens generated (7.76 token/s)

( @lmz But this xl quantized model is worse than the quantized large one on open-domain questions. I haven't test it yet on context-based QA )

lmz changed pull request status to merged

Sign up or log in to comment