upload int4 onnx model

9d48e9c about 1 year ago

5.52 kB

	---
	license: apache-2.0
	datasets:
	- librispeech_asr
	metrics:
	- wer
	pipeline_tag: automatic-speech-recognition
	tags:
	- automatic-speech-recognition
	- ONNX
	- Intel® Neural Compressor
	- neural-compressor
	library_name: transformers
	---
	## INT4 Whisper small ONNX Model

	Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning.

	This INT4 ONNX model is generated by [neural-compressor](https://github.com/intel/neural-compressor).


	\| Model Detail \| Description \|
	\| ----------- \| ----------- \|
	\| Model Authors - Company \| Intel \|
	\| Date \| October 8, 2023 \|
	\| Version \| 1 \|
	\| Type \| Speech Recognition \|
	\| Paper or Other Resources \| - \|
	\| License \| Apache 2.0 \|
	\| Questions or Comments \| [Community Tab](https://huggingface.co/Intel/whisper-small-onnx-int4/discussions)\|

	\| Intended Use \| Description \|
	\| ----------- \| ----------- \|
	\| Primary intended uses \| You can use the raw model for automatic speech recognition inference \|
	\| Primary intended users \| Anyone doing automatic speech recognition inference \|
	\| Out-of-scope uses \| This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.\|

	### Export to ONNX Model

	The FP32 model is exported with openai/whisper-small:

	```shell
	optimum-cli export onnx --model openai/whisper-small whisper-small-with-past/ --task automatic-speech-recognition-with-past --opset 13
	```

	### Install ONNX Runtime

	Install `onnxruntime>=1.16.0` to support [`MatMulFpQ4`](https://github.com/microsoft/onnxruntime/blob/v1.16.0/docs/ContribOperators.md#com.microsoft.MatMulFpQ4) operator.

	### Run Quantization

	Run INT4 weight-only quantization with [Intel® Neural Compressor](https://github.com/intel/neural-compressor/tree/master).

	The weight-only quantization cofiguration is as below:
	\| dtype \| group_size \| scheme \| algorithm \|
	\| :----- \| :---------- \| :------ \| :--------- \|
	\| INT4 \| 32 \| sym \| RTN \|

	We provide the key code below. For the complete script, please refer to [whisper example](https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/onnxruntime/speech-recognition/quantization).

	```python
	from neural_compressor import quantization, PostTrainingQuantConfig
	from neural_compressor.utils.constant import FP32

	model_list = ['encoder_model.onnx', 'decoder_model.onnx', 'decoder_with_past_model.onnx']
	for model in model_list:
	config = PostTrainingQuantConfig(
	approach="weight_only",
	calibration_sampling_size=[8],
	op_type_dict={".*": {"weight": {"bits": 4,
	"algorithm": ["RTN"],
	"scheme": ["sym"],
	"group_size": 32}}},)
	q_model = quantization.fit(
	os.path.join("/path/to/whisper-small-with-past", model), # FP32 model path
	config,
	calib_dataloader=dataloader)
	q_model.save(os.path.join("/path/to/whisper-small-onnx-int4", model)) # INT4 model path
	```

	### Evaluation

	Operator Statistics

	Below shows the operator statistics in the INT4 ONNX model:
	\|Model\| Op Type \| Total \| INT4 weight \| FP32 weight \|
	\|:-------:\|:-------:\|:-------:\|:-------:\|:-------:\|
	\|encoder_model\| MatMul \| 96 \| 72 \| 24 \|
	\|decoder_model\| MatMul \| 169 \| 121 \| 48 \|
	\|decoder_with_past_model\| MatMul \| 145 \| 97 \| 48 \|

	Evaluation of wer

	Evaluate the model on `librispeech_asr` dataset with below code:

	```python
	import os
	from evaluate import load
	from datasets import load_dataset
	from transformers import WhisperForConditionalGeneration, WhisperProcessor, AutoConfig
	model_name = 'openai/whisper-small'
	model_path = 'whisper-small-onnx-int4'
	processor = WhisperProcessor.from_pretrained(model_name)
	model = WhisperForConditionalGeneration.from_pretrained(model_name)
	config = AutoConfig.from_pretrained(model_name)
	wer = load("wer")
	librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

	from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
	from transformers import PretrainedConfig
	model_config = PretrainedConfig.from_pretrained(model_name)
	predictions = []
	references = []
	sessions = ORTModelForSpeechSeq2Seq.load_model(
	os.path.join(model_path, 'encoder_model.onnx'),
	os.path.join(model_path, 'decoder_model.onnx'),
	os.path.join(model_path, 'decoder_with_past_model.onnx'))
	model = ORTModelForSpeechSeq2Seq(sessions[0], sessions[1], model_config, model_path, sessions[2])
	for idx, batch in enumerate(librispeech_test_clean):
	audio = batch["audio"]
	input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
	reference = processor.tokenizer._normalize(batch['text'])
	references.append(reference)
	predicted_ids = model.generate(input_features)[0]
	transcription = processor.decode(predicted_ids)
	prediction = processor.tokenizer._normalize(transcription)
	predictions.append(prediction)
	wer_result = wer.compute(references=references, predictions=predictions)
	print(f"Result wer: {wer_result * 100}")
	```

	## Metrics (Model Performance):
	\| Model \| Model Size (GB) \| wer \|
	\|---\|:---:\|:---:\|
	\| FP32 \|2.00\|3.45\|
	\| INT8 \|0.53\|3.57\|