YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
INT8 Quantized Whisper model for TensorRT-LLM
This repository contains an INT8 quantized version of the Whisper model from jharshraj/whisper-indian-names, specifically optimized for TensorRT-LLM.
Optimization details
- Original model size: 0.00 MB
- Quantized model size: 0.00 MB
- Size reduction: 0.00%
- Precision: float16 with INT8 weight-only quantization
- Max batch size: 8
- Max beam width: 4
Building TensorRT engines
To use this model, you need TensorRT-LLM installed. These are the INT8 quantized weights, which you need to build into TensorRT engines for your specific hardware:
# Install TensorRT-LLM
# See: https://github.com/NVIDIA/TensorRT-LLM
# Clone TensorRT-LLM repository
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/whisper
# Build the encoder engine
trtllm-build --checkpoint_dir /path/to/this/repo/encoder \
--output_dir /path/to/output/encoder \
--moe_plugin disable \
--max_batch_size 8 \
--gemm_plugin disable \
--bert_attention_plugin float16 \
--max_input_len 3000 --max_seq_len=3000
# Build the decoder engine
trtllm-build --checkpoint_dir /path/to/this/repo/decoder \
--output_dir /path/to/output/decoder \
--moe_plugin disable \
--max_beam_width 4 \
--max_batch_size 8 \
--max_seq_len 114 \
--max_input_len 14 \
--max_encoder_input_len 3000 \
--gemm_plugin float16 \
--bert_attention_plugin float16 \
--gpt_attention_plugin float16
Performance Benefits
INT8 quantization typically provides:
- 2-4x faster inference speed
- Reduced memory usage
- Lower latency
While maintaining comparable accuracy to the full precision model.
Please refer to the TensorRT-LLM Whisper documentation for more usage instructions.
- Downloads last month
- 1
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
HF Inference deployability: The model has no library tag.