metadata
language:
- zh
license: apache-2.0
tags:
- whisper-event
datasets:
- mozilla-foundation/common_voice_11_0
model-index:
- name: Whisper Small zh-HK - Alvin
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: mozilla-foundation/common_voice_11_0 zh-HK
type: mozilla-foundation/common_voice_11_0
config: zh-HK
split: test
args: zh-HK
metrics:
- name: Normalized CER
type: cer
value: 7.766
metrics:
- cer
pipeline_tag: automatic-speech-recognition
Whisper Large V2 zh-HK - Alvin
This model is a fine-tuned version of openai/whisper-large-v2 on the Common Voice 11.0 dataset. This is trained with PEFT LoRA+BNB INT8 with a Normalized CER of 7.77%
To use the model, use the following code. It should be able to inference with less than 4GB VRAM (batch size of 1).
from peft import PeftModel, PeftConfig
from transformers import WhisperForConditionalGeneration, Seq2SeqTrainer, WhisperTokenizer, WhisperProcessor
peft_model_id = "alvanlii/whisper-largev2-cantonese-peft-lora"
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto"
)
model = PeftModel.from_pretrained(model, peft_model_id)
task = "transcribe"
tokenizer = WhisperTokenizer.from_pretrained(peft_config.base_model_name_or_path, task=task)
processor = WhisperProcessor.from_pretrained(peft_config.base_model_name_or_path, task=task)
feature_extractor = processor.feature_extractor
forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task=task)
pipe = AutomaticSpeechRecognitionPipeline(model=model, tokenizer=tokenizer, feature_extractor=feature_extractor)
audio = # load audio here
text = pipe(audio, generate_kwargs={"forced_decoder_ids": forced_decoder_ids}, max_new_tokens=255)["text"]
Training and evaluation data
For training, three datasets were used:
- Common Voice 11 Canto Train Set
- CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899-2906.
- Cantonse-ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf
Training Hyperparameters
- learning_rate: 1e-3
- train_batch_size: 60 (on 1 3090 GPU)
- eval_batch_size: 10
- gradient_accumulation_steps: 1
- total_train_batch_size: 60x1x1=60
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 12000
- augmentation: SpecAugment
Training Results
Training Loss | Epoch | Step | Validation Loss | Normalized CER |
---|---|---|---|---|
0.8604 | 1.99 | 12000 | 0.2129 | 0.07766 |