metadata

license: mit
language:
  - th
base_model: biodatlab/whisper-th-small-combined
tags:
  - whisper
  - Pytorch

Whisper-th-small-ct2

whisper-th-small-ct2 is the CTranslate2 format of biodatlab/whisper-th-small-combined, comparable with WhisperX and faster-whisper, which enables:

🤏 Half the size of original Huggingface format.
⚡️ Batched inference for 70x real-time transcription Whisper large-v2.
🪶 A faster-whisper backend, requiring <8GB GPU memory for large-v2 with beam_size=5.
🎯 Accurate word-level timestamps using wav2vec2 alignment.
👯‍♂️ Multispeaker ASR using speaker diarization(includes speaker ID labels).
🗣️ VAD preprocessing, reducing hallucinations and allowing batching with no WER degradation.

Usage

!pip install git+https://github.com/m-bain/whisperx.git

import whisperx 
import time 

# Setting
device = "cuda" 
audio_file = "audio.mp3"
batch_size = 16 
compute_type = "float16" 

"""
Your Hugging Face token for the Diarization model is required. 
Additionally, you need to accept the terms and conditions before use. 
Please visit the model page here.
https://huggingface.co/pyannote/segmentation-3.0
"""
HF_TOKEN = ""


# load model and transcript
model = whisperx.load_model("Thaweewat/whisper-th-small-ct2", device, compute_type=compute_type)
st_time = time.time()
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)

# Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token=HF_TOKEN, device=device)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)

# Combine pure text if needed
combined_text = ' '.join(segment['text'] for segment in result['segments'])

print(f"Response time: {time.time() - st_time} seconds")
print(diarize_segments)
print(result)
print(combined_text)