INT8 Quantized Whisper model for TensorRT-LLM

This repository contains an INT8 quantized version of the Whisper model from jharshraj/whisper-indian-names, specifically optimized for TensorRT-LLM.

Optimization details

Original model size: 0.00 MB
Quantized model size: 0.00 MB
Size reduction: 0.00%
Precision: float16 with INT8 weight-only quantization
Max batch size: 8
Max beam width: 4

Building TensorRT engines

To use this model, you need TensorRT-LLM installed. These are the INT8 quantized weights, which you need to build into TensorRT engines for your specific hardware:

# Install TensorRT-LLM
# See: https://github.com/NVIDIA/TensorRT-LLM

# Clone TensorRT-LLM repository
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/whisper

# Build the encoder engine
trtllm-build --checkpoint_dir /path/to/this/repo/encoder \
            --output_dir /path/to/output/encoder \
            --moe_plugin disable \
            --max_batch_size 8 \
            --gemm_plugin disable \
            --bert_attention_plugin float16 \
            --max_input_len 3000 --max_seq_len=3000

# Build the decoder engine
trtllm-build --checkpoint_dir /path/to/this/repo/decoder \
            --output_dir /path/to/output/decoder \
            --moe_plugin disable \
            --max_beam_width 4 \
            --max_batch_size 8 \
            --max_seq_len 114 \
            --max_input_len 14 \
            --max_encoder_input_len 3000 \
            --gemm_plugin float16 \
            --bert_attention_plugin float16 \
            --gpt_attention_plugin float16

Performance Benefits

INT8 quantization typically provides:

2-4x faster inference speed
Reduced memory usage
Lower latency

While maintaining comparable accuracy to the full precision model.

Please refer to the TensorRT-LLM Whisper documentation for more usage instructions.