--- license: apache-2.0 datasets: - google/fleurs - mozilla-foundation/common_voice_16_1 - vivos - doof-ferb/vlsp2020_vinai_100h - doof-ferb/fpt_fosd - doof-ferb/infore1_25hours language: ["vi"] library_name: peft base_model: openai/whisper-large-v3 pipeline_tag: automatic-speech-recognition metrics: ["wer"] model-index: - name: doof-ferb/whisper-large-peft-lora-vi results: - task: type: automatic-speech-recognition dataset: type: mozilla-foundation/common_voice_16_1 name: Mozilla CommonVoice (Vietnamese) v16.1 config: vi split: test metrics: - type: wer value: 14.7 verified: false - task: type: automatic-speech-recognition dataset: type: google/fleurs name: Google FLEURS (Vietnamese) config: vi_vn split: test metrics: - type: wer value: 14.7 verified: false - task: type: automatic-speech-recognition dataset: type: vivos name: ĐHQG TPHCM VIVOS split: test metrics: - type: wer value: 9.4 verified: false --- whisper large v3 PEFT LoRA trained on a big collection of vietnamese speech datasets TODO: - [x] training then publish checkpoint - [x] evaluate WER on Common Voice & FLEURS & VIVOS 3.6k steps, warm-up 5%, batch size 16×2 (kaggle free T4×2), train 3.6% of 1.6B params manually evaluate WER on test set - vietnamese part: | @ `float16` | `CommonVoice v16.1` | `FLEURS` | `VIVOS` | |---|---|---|---| | original `whisper-large-v3` | 16.2% | 8.3% | 12.3% | | this LoRA | 14.7% | 14.7% | 9.4% | all training + evaluation scripts are on my repo: https://github.com/phineas-pta/fine-tune-whisper-vi usage example: ```python # pip install peft accelerate bitsandbytes import torch import torchaudio from peft import PeftModel, PeftConfig from transformers import WhisperForConditionalGeneration, WhisperFeatureExtractor, WhisperTokenizer PEFT_MODEL_ID = "doof-ferb/whisper-large-peft-lora-vi" BASE_MODEL_ID = PeftConfig.from_pretrained(PEFT_MODEL_ID).base_model_name_or_path FEATURE_EXTRACTOR = WhisperFeatureExtractor.from_pretrained(BASE_MODEL_ID) TOKENIZER = WhisperTokenizer.from_pretrained(BASE_MODEL_ID) MODEL = PeftModel.from_pretrained( WhisperForConditionalGeneration.from_pretrained(BASE_MODEL_ID, torch_dtype=torch.float16).to("cuda:0"), PEFT_MODEL_ID ).merge_and_unload(progressbar=True) DECODER_ID = torch.tensor( TOKENIZER.convert_tokens_to_ids(["<|startoftranscript|>", "<|vi|>", "<|transcribe|>", "<|notimestamps|>"]), device=MODEL.device ).unsqueeze(dim=0) waveform, sampling_rate = torchaudio.load("audio.mp3") if waveform.size(0) > 1: # convert dual to mono channel waveform = waveform.mean(dim=0, keepdim=True) inputs = FEATURE_EXTRACTOR(waveform, sampling_rate=sampling_rate, return_tensors="pt").to(MODEL.device) with torch.inference_mode(), torch.autocast(device_type="cuda"): # required by PEFT predicted_ids = MODEL.generate(input_features=inputs.input_features, decoder_input_ids=DECODER_ID) TOKENIZER.batch_decode(predicted_ids, skip_special_tokens=True)[0] ```