Ratchada-Fang-Thon-Whisper
Model Description
Ratchada-Fang-Thon-Whisper is a fine-tuned version of the Whisper model, specifically adapted for Thai speech recognition in financial contexts. This model is designed to transcribe Thai audio with high accuracy, particularly for financial terminology and discussions.
Whisper is a state-of-the-art transformer model that can transcribe speech signals into text with high accuracy and low latency. We will use the huggingface's whisper implementation to fine-tune the model on our own GPU infrastructure, using a various custom dataset of audio recordings and transcripts.
We will also monitor the training process and evaluate the model performance with tensorboard, a visualization tool for machine learning experiments.
Key Features
- Specialized in Thai language transcription
- Fine-tuned for financial domain vocabulary
- Based on the Whisper medium model architecture
- Supports long-form transcription
Model Details
- Model Type: WhisperForConditionalGeneration
- Language: Thai
- Task: Automatic Speech Recognition (ASR)
- License: MIT
Usage
Standard Pipeline (Recommended)
You can use this model with the standard Transformers pipeline:
from transformers import pipeline
device = 0 if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model="ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper",
device=device,
generate_kwargs={"language": "th", "task": "transcribe"}
)
result = pipe("path/to/audio/file.wav") # path to audio file or numpy array of wave
print(result["text"])
Note: It is recommended that audio input should have sample_rate=16_000 before hand !
Transformer Directly
You can use this model from Transfomers module driectly:
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
processor = AutoProcessor.from_pretrained("ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper")
model = AutoModelForSpeechSeq2Seq.from_pretrained("ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper").to(device)
# waveform is numpy that obtain from Audio processor lib i.e. librosa, torchaudio
input_features = processor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt").input_features.to(device)
with torch.no_grad():
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] # best choice of batches
from ratchada_processor import tokenize_text # strongly recommend post-processor
processed_text = tokenize_text(transcription) # cut the text into splited component and process it (see github)
result = "".join(processed_text)
print(result)
Note: Using this method required own manually post-processor at the output of the model. The post-processor can be found in this lib on pypi project:
python3 -m pip install ratchada-util
Training
Training Data
This model was fine-tuned on a proprietary dataset: ThinkingMachinesDataScience/Ratchada-STT. The dataset contains Thai speech audio from financial contexts.
Training Procedure
The model was fine-tuned from the biodatlab/whisper-th-medium-combined checkpoint, which is a Thai-specific version of the Whisper medium model. After each model prediction, a post-processor code is applied to refine the results.
Limitations and Bias
- The model is specifically trained on Thai financial audio data and may not perform as well on general Thai speech or other domains.
- There might be biases present in the training data, which could affect the model's performance on certain types of speech or accents.
Evaluation Results
Using our own evaluation algorithm, these are the performance of this model:
- Lower is better
models | wer | cer (jiwer) | deletions | substitutions | insertions |
---|---|---|---|---|---|
RATFT-WHISPER | 0.332685 | 0.272674 | 1884 | 1806 | 5466 |
WHISPER-LARGE-V3 | 0.392162 | 0.318666 | 2499 | 1489 | 6752 |
THON-WHISPER | 0.474360 | 0.405920 | 1722 | 2603 | 8597 |
WHISPER-LARGE | 0.593637 | 0.578926 | 5441 | 1500 | 9433 |
WHISPER-LARGE-V2 | 0.595292 | 0.652592 | 4924 | 1866 | 9580 |
WHISPER-MEDIUM | 0.643084 | 0.66565 | 7471 | 1312 | 9090 |
WHISPER-SMALL | 0.667453 | 0.603361 | 4397 | 1817 | 12028 |
WHISPER-BASE | 0.791954 | 0.73896 | 3362 | 1906 | 16252 |
Note: CER, Using Jiwer, to evaluate an automatic speech recognition system.
Ethical Considerations
Users should be aware that this model is designed for transcribing Thai speech in financial contexts. It should not be used for making financial decisions without human verification. Always cross-check important financial information obtained from this model.
Citations
If you use this model in your research, please cite:
Copy@misc{Ratchada-Fang-Thon-Whisper,
author = {ThinkingMachinesDataScience},
title = {Ratchada-Fang-Thon-Whisper: Thai Financial Speech Recognition Model},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://huggingface.co/ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper}}
}
Contacts
For questions and feedback about this model, please make a contact ThinkingMachinesDataScience Github repository for this project.
- Downloads last month
- 0