How to quantize these models?

#30
by supercharge19 - opened

Any idea for quantization? It would be nice to compare performance and speed of various quantized versions with full f32 original model.

@Locutusque @TheBloke

can these/such models be quantized? how if so? even if you don't want to do it yourself, can you just guide how can this be done? best strategy share as well, like Sebastian first converts to 16fp then quantize (newer models, since I'm not sure if these old once can even be quantized), or a different strategy would be optimal given that these models have different arch (not decoder only)?

https://github.com/ggerganov/llama.cpp/discussions/2948

I typically only convert to FP16 because I don't really need lower precision. I have enough computational resources to deploy the models that I am publishing for my purposes. This one should be able to be quantized if llama.cpp supports the bart architecture.

i tried but getting error while inference .model able to convert


#!pip install torch==2.1.2
#!pip install --upgrade-strategy eager install optimum[onnxruntime]


!optimum-cli export onnx  --task zero-shot-classification --model facebook/bart-large-mnli bart-large-mnli_onnx_zs_model/


from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer,pipeline
# for sentiment 
tokenizer = AutoTokenizer.from_pretrained("bart-large-mnli_onnx_newmodel")
model = ORTModelForQuestionAnswering.from_pretrained("bart-large-mnli_onnx_newmodel")

onnx_z0 = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)

sequence_to_classify = "Who are you voting for in 2020?"
candidate_labels = ["Europe", "public health", "politics", "elections"]
pred = onnx_z0(sequence_to_classify, candidate_labels)
pred

Sign up or log in to comment