File size: 5,607 Bytes
b9f2078
 
84f398b
 
 
 
 
 
 
 
 
 
 
 
b9f2078
84f398b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
license: apache-2.0
datasets:
- librispeech_asr
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- automatic-speech-recognition
- ONNX
- PostTrainingStatic
- Intel® Neural Compressor
- neural-compressor
library_name: transformers
---
## INT4 Whisper tiny ONNX Model

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning.

This INT4 ONNX model is generated by [neural-compressor](https://github.com/intel/neural-compressor).


| Model Detail | Description |
| ----------- | ----------- | 
| Model Authors - Company | Intel | 
| Date | October 8, 2023 | 
| Version | 1 | 
| Type | Speech Recognition | 
| Paper or Other Resources | - | 
| License | Apache 2.0 |
| Questions or Comments | [Community Tab](https://huggingface.co/Intel/whisper-tiny-onnx-int4/discussions)|

| Intended Use | Description |
| ----------- | ----------- | 
| Primary intended uses | You can use the raw model for automatic speech recognition inference | 
| Primary intended users | Anyone doing automatic speech recognition inference | 
| Out-of-scope uses | This model in most cases will need to be fine-tuned for your particular task.  The model should not be used to intentionally create hostile or alienating environments for people.|

### Export to ONNX Model

The FP32 model is exported with openai/whisper-tiny:

```shell
optimum-cli export onnx --model openai/whisper-tiny whisper-tiny-with-past/ --task automatic-speech-recognition-with-past --opset 13
```

### Install ONNX Runtime

Install `onnxruntime>=1.16.0` to support [`MatMulFpQ4`](https://github.com/microsoft/onnxruntime/blob/v1.16.0/docs/ContribOperators.md#com.microsoft.MatMulFpQ4) operator.

### Run Quantization

Run INT4 weight-only quantization with [Intel® Neural Compressor](https://github.com/intel/neural-compressor/tree/master).

The weight-only quantization cofiguration is as below:
| dtype | group_size | scheme | algorithm |
| :----- | :---------- | :------ | :--------- |
| INT4  | 32        | asym   | RTN       |

We provide the key code below. For the complete script, please refer to [whisper example](https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/onnxruntime/speech-recognition/quantization).

```python
from neural_compressor import quantization, PostTrainingQuantConfig
from neural_compressor.utils.constant import FP32

model_list = ['encoder_model.onnx', 'decoder_model.onnx', 'decoder_with_past_model.onnx']
for model in model_list:
    config = PostTrainingQuantConfig(
        approach="weight_only",
        calibration_sampling_size=[8],
        op_type_dict={".*": {"weight": {"bits": 4, 
                                        "algorithm": ["RTN"], 
                                        "scheme": ["asym"], 
                                        "group_size": 32}}},
        op_name_dict={'/proj_out/MatMul': FP32},) # fallback last matmul in decoder to FP32
    q_model = quantization.fit(
        os.path.join("/path/to/whisper-tiny", model), # FP32 model path
        config,
        calib_dataloader=dataloader)
    q_model.save(os.path.join("/path/to/whisper-tiny-onnx-int4", model)) # INT4 model path
```

### Evaluation

**Operator Statistics**
Below shows the operator statistics in the INT4 ONNX model:
|Model| Op Type | Total |  INT4 weight |  FP32 weight |
|:-------:|:-------:|:-------:|:-------:|:-------:|
|encoder_model|  MatMul |  32  |    24    |   8   |
|decoder_model|  MatMul |  57  |    40    |   17   |
|decoder_with_past_model|  MatMul |  49  |    32    |   17   |

**Evaluation of wer**

Evaluate the model on `librispeech_asr` dataset with below code:

```python
import os
from evaluate import load
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor, AutoConfig
model_name = 'openai/whisper-tiny'
model_path = 'whisper-tiny-onnx-int4'
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
wer = load("wer")
librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import PretrainedConfig
model_config = PretrainedConfig.from_pretrained(model_name)
predictions = []
references = []
sessions = ORTModelForSpeechSeq2Seq.load_model(
            os.path.join(model_path, 'encoder_model.onnx'),
            os.path.join(model_path, 'decoder_model.onnx'),
            os.path.join(model_path, 'decoder_with_past_model.onnx'))
model = ORTModelForSpeechSeq2Seq(sessions[0], sessions[1], model_config, model_path, sessions[2])
for idx, batch in enumerate(librispeech_test_clean):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    reference = processor.tokenizer._normalize(batch['text'])
    references.append(reference)
    predicted_ids = model.generate(input_features)[0]
    transcription = processor.decode(predicted_ids)
    prediction = processor.tokenizer._normalize(transcription)
    predictions.append(prediction)
wer_result = wer.compute(references=references, predictions=predictions)
print(f"Result wer: {wer_result * 100}")
```

## Metrics (Model Performance):
| Model  | Model Size (GB) | wer |
|---|:---:|:---:|
| FP32 |0.36|7.56|
| INT8 |0.32|9.94|