YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

INT8 Quantized Whisper model for TensorRT-LLM

This repository contains an INT8 quantized version of the Whisper model from jharshraj/whisper-indian-names, specifically optimized for TensorRT-LLM.

Optimization details

  • Original model size: 0.00 MB
  • Quantized model size: 0.00 MB
  • Size reduction: 0.00%
  • Precision: float16 with INT8 weight-only quantization
  • Max batch size: 8
  • Max beam width: 4

Building TensorRT engines

To use this model, you need TensorRT-LLM installed. These are the INT8 quantized weights, which you need to build into TensorRT engines for your specific hardware:

# Install TensorRT-LLM
# See: https://github.com/NVIDIA/TensorRT-LLM

# Clone TensorRT-LLM repository
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/whisper

# Build the encoder engine
trtllm-build --checkpoint_dir /path/to/this/repo/encoder \
            --output_dir /path/to/output/encoder \
            --moe_plugin disable \
            --max_batch_size 8 \
            --gemm_plugin disable \
            --bert_attention_plugin float16 \
            --max_input_len 3000 --max_seq_len=3000

# Build the decoder engine
trtllm-build --checkpoint_dir /path/to/this/repo/decoder \
            --output_dir /path/to/output/decoder \
            --moe_plugin disable \
            --max_beam_width 4 \
            --max_batch_size 8 \
            --max_seq_len 114 \
            --max_input_len 14 \
            --max_encoder_input_len 3000 \
            --gemm_plugin float16 \
            --bert_attention_plugin float16 \
            --gpt_attention_plugin float16

Performance Benefits

INT8 quantization typically provides:

  1. 2-4x faster inference speed
  2. Reduced memory usage
  3. Lower latency

While maintaining comparable accuracy to the full precision model.

Please refer to the TensorRT-LLM Whisper documentation for more usage instructions.

Downloads last month
1
Safetensors
Model size
242M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support