README.md · Intel/whisper-large-int8-static-inc at 705a1ad50ea004d660b0e856ecbfc0836b67a1cc

metadata

license: apache-2.0
datasets:
  - librispeech_asr
metrics:
  - wer
pipeline_tag: automatic-speech-recognition
tags:
  - automatic-speech-recognition
  - int8
  - ONNX
  - PostTrainingStatic
  - Intel® Neural Compressor
  - neural-compressor
library_name: transformers

Model Details: INT8 Whisper large

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning.

This int8 ONNX model is generated by neural-compressor and the fp32 model can be exported with below command:

optimum-cli export onnx --model openai/whisper-large whisper-large-with-past/ --task automatic-speech-recognition-with-past --opset 13

Model Detail	Description
Model Authors - Company	Intel
Date	May 15, 2022
Version	1
Type	Speech Recognition
Paper or Other Resources	-
License	Apache 2.0
Questions or Comments	Community Tab

Intended Use	Description
Primary intended uses	You can use the raw model for automatic speech recognition inference
Primary intended users	Anyone doing automatic speech recognition inference
Out-of-scope uses	This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.

How to use

Download the model by cloning the repository:

git clone https://huggingface.co/Intel/whisper-large-int8-static

Evaluate the model with below code:

import os
from evaluate import load
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor, AutoConfig

model_name = 'openai/whisper-large'
model_path = 'whisper-large-int8-static'
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
wer = load("wer")
librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import PretrainedConfig
model_config = PretrainedConfig.from_pretrained(model_name)
predictions = []
references = []
sessions = ORTModelForSpeechSeq2Seq.load_model(
            os.path.join(model_path, 'encoder_model.onnx'),
            os.path.join(model_path, 'decoder_model.onnx'),
            os.path.join(model_path, 'decoder_with_past_model.onnx'))
model = ORTModelForSpeechSeq2Seq(sessions[0], sessions[1], model_config, model_path, sessions[2])
for idx, batch in enumerate(librispeech_test_clean):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    reference = processor.tokenizer._normalize(batch['text'])
    references.append(reference)
    predicted_ids = model.generate(input_features)[0]
    transcription = processor.decode(predicted_ids)
    prediction = processor.tokenizer._normalize(transcription)
    predictions.append(prediction)
wer_result = wer.compute(references=references, predictions=predictions)
print(f"Result wer: {wer_result * 100}")
accuracy = 1 - wer_result
print("Accuracy: %.5f" % accuracy)

Metrics (Model Performance):

Model	Model Size (GB)	wer
FP32	9.4	3.04
INT8	2.4	2.94