File size: 5,867 Bytes
74cf980
 
9d48e9c
 
 
 
 
 
 
 
 
 
 
74cf980
9d48e9c
 
98973a5
9d48e9c
98973a5
9d48e9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dc4aa95
9d48e9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0f0781d
d4656bb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: apache-2.0
datasets:
- librispeech_asr
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- automatic-speech-recognition
- ONNX
- Intel® Neural Compressor
- neural-compressor
library_name: transformers
---
## INT4 Whisper small ONNX Model

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. This is the repository of INT4 weight only quantization for the Whisper small model in ONNX format, powered by [Intel® Neural Compressor](https://github.com/intel/neural-compressor) and [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers).

This INT4 ONNX model is generated by [Intel® Neural Compressor](https://github.com/intel/neural-compressor)'s weight-only quantization method.


| Model Detail | Description |
| ----------- | ----------- | 
| Model Authors - Company | Intel | 
| Date | October 8, 2023 | 
| Version | 1 | 
| Type | Speech Recognition | 
| Paper or Other Resources | - | 
| License | Apache 2.0 |
| Questions or Comments | [Community Tab](https://huggingface.co/Intel/whisper-small-onnx-int4/discussions)|

| Intended Use | Description |
| ----------- | ----------- | 
| Primary intended uses | You can use the raw model for automatic speech recognition inference | 
| Primary intended users | Anyone doing automatic speech recognition inference | 
| Out-of-scope uses | This model in most cases will need to be fine-tuned for your particular task.  The model should not be used to intentionally create hostile or alienating environments for people.|

### Export to ONNX Model

The FP32 model is exported with openai/whisper-small:

```shell
optimum-cli export onnx --model openai/whisper-small whisper-small-with-past/ --task automatic-speech-recognition-with-past --opset 13
```

### Install ONNX Runtime

Install `onnxruntime>=1.16.0` to support [`MatMulFpQ4`](https://github.com/microsoft/onnxruntime/blob/v1.16.0/docs/ContribOperators.md#com.microsoft.MatMulFpQ4) operator.

### Run Quantization

Build [Intel® Neural Compressor](https://github.com/intel/neural-compressor/tree/master) from master branch and run INT4 weight-only quantization.

The weight-only quantization cofiguration is as below:
| dtype | group_size | scheme | algorithm |
| :----- | :---------- | :------ | :--------- |
| INT4  | 32        | sym   | RTN       |

We provide the key code below. For the complete script, please refer to [whisper example](https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/onnxruntime/speech-recognition/quantization).

```python
from neural_compressor import quantization, PostTrainingQuantConfig
from neural_compressor.utils.constant import FP32

model_list = ['encoder_model.onnx', 'decoder_model.onnx', 'decoder_with_past_model.onnx']
for model in model_list:
    config = PostTrainingQuantConfig(
        approach="weight_only",
        calibration_sampling_size=[8],
        op_type_dict={".*": {"weight": {"bits": 4, 
                                        "algorithm": ["RTN"], 
                                        "scheme": ["sym"], 
                                        "group_size": 32}}},)
    q_model = quantization.fit(
        os.path.join("/path/to/whisper-small-with-past", model), # FP32 model path
        config,
        calib_dataloader=dataloader)
    q_model.save(os.path.join("/path/to/whisper-small-onnx-int4", model)) # INT4 model path
```

### Evaluation

**Operator Statistics**

Below shows the operator statistics in the INT4 ONNX model:
|Model| Op Type | Total |  INT4 weight |  FP32 weight |
|:-------:|:-------:|:-------:|:-------:|:-------:|
|encoder_model|  MatMul |  96  |    72    |   24   |
|decoder_model|  MatMul |  169  |    121    |   48   |
|decoder_with_past_model|  MatMul |  145  |    97    |   48   |

**Evaluation of wer**

Evaluate the model on `librispeech_asr` dataset with below code:

```python
import os
from evaluate import load
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor, AutoConfig
model_name = 'openai/whisper-small'
model_path = 'whisper-small-onnx-int4'
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
wer = load("wer")
librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import PretrainedConfig
model_config = PretrainedConfig.from_pretrained(model_name)
predictions = []
references = []
sessions = ORTModelForSpeechSeq2Seq.load_model(
            os.path.join(model_path, 'encoder_model.onnx'),
            os.path.join(model_path, 'decoder_model.onnx'),
            os.path.join(model_path, 'decoder_with_past_model.onnx'))
model = ORTModelForSpeechSeq2Seq(sessions[0], sessions[1], model_config, model_path, sessions[2])
for idx, batch in enumerate(librispeech_test_clean):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    reference = processor.tokenizer._normalize(batch['text'])
    references.append(reference)
    predicted_ids = model.generate(input_features)[0]
    transcription = processor.decode(predicted_ids)
    prediction = processor.tokenizer._normalize(transcription)
    predictions.append(prediction)
wer_result = wer.compute(references=references, predictions=predictions)
print(f"Result wer: {wer_result * 100}")
```

## Metrics (Model Performance):
| Model  | Model Size (GB) | wer |
|---|:---:|:---:|
| FP32 |1.42|3.45|
| INT4 |0.53|3.57|