--- license: apache-2.0 base_model: openai/whisper-large-v3 tags: - generated_from_trainer metrics: - wer model-index: - name: Hibiki_ASR_Phonemizer results: [] language: - ja --- # Hibiki ASR Phonemizer This model is a Phoneme Level Speech Recognition network, originally a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) on a mixture of Different Japanese datasets. it can detect, transcribe and do the following: - non-speech sounds such as gasp, erotic moans, etc. - adding punctuations more faithfully. a Grapheme (aka normal Japanese output) will probably be trained as well. evaluation set: - Loss: 0.2186 - Wer: 21.6707 ## Inference and Post-proc ```python # this function was borrowed and modified from Aaron Yinghao Li, the Author of StyleTTS paper. kana_mapper = dict([ ("ゔぁ","ba"), ("ゔぃ","bi"), ("ゔぇ","be"), ("ゔぉ","bo"), ("ゔゃ","bʲa"), ("ゔゅ","bʲɯ"), ("ゔゃ","bʲa"), ("ゔょ","bʲo"), ("ゔ","bɯ"), ("あぁ"," aː"), ("いぃ"," iː"), ("いぇ"," je"), ("いゃ"," ja"), ("うぅ"," ɯː"), ("えぇ"," eː"), ("おぉ"," oː"), ("かぁ"," kaː"), ("きぃ"," kiː"), ("くぅ","kɯː"), ("くゃ","ka"), ("くゅ","kʲɯ"), ("くょ","kʲo"), ("けぇ","keː"), ("こぉ","koː"), ("がぁ","gaː"), ("ぎぃ","giː"), ("ぐぅ","gɯː"), ("ぐゃ","gʲa"), ("ぐゅ","gʲɯ"), ("ぐょ","gʲo"), ("げぇ","geː"), ("ごぉ","goː"), ("さぁ","saː"), ("しぃ","ɕiː"), ("すぅ","sɯː"), ("すゃ","sʲa"), ("すゅ","sʲɯ"), ("すょ","sʲo"), ("せぇ","seː"), ("そぉ","soː"), ("ざぁ","zaː"), ("じぃ","dʑiː"), ("ずぅ","zɯː"), ("ずゃ","zʲa"), ("ずゅ","zʲɯ"), ("ずょ","zʲo"), ("ぜぇ","zeː"), ("ぞぉ","zeː"), ("たぁ","taː"), ("ちぃ","tɕiː"), ("つぁ","tsa"), ("つぃ","tsi"), ("つぅ","tsɯː"), ("つゃ","tɕa"), ("つゅ","tɕɯ"), ("つょ","tɕo"), ("つぇ","tse"), ("つぉ","tso"), ("てぇ","teː"), ("とぉ","toː"), ("だぁ","daː"), ("ぢぃ","dʑiː"), ("づぅ","dɯː"), ("づゃ","zʲa"), ("づゅ","zʲɯ"), ("づょ","zʲo"), ("でぇ","deː"), ("どぉ","doː"), ("なぁ","naː"), ("にぃ","niː"), ("ぬぅ","nɯː"), ("ぬゃ","nʲa"), ("ぬゅ","nʲɯ"), ("ぬょ","nʲo"), ("ねぇ","neː"), ("のぉ","noː"), ("はぁ","haː"), ("ひぃ","çiː"), ("ふぅ","ɸɯː"), ("ふゃ","ɸʲa"), ("ふゅ","ɸʲɯ"), ("ふょ","ɸʲo"), ("へぇ","heː"), ("ほぉ","hoː"), ("ばぁ","baː"), ("びぃ","biː"), ("ぶぅ","bɯː"), ("ふゃ","ɸʲa"), ("ぶゅ","bʲɯ"), ("ふょ","ɸʲo"), ("べぇ","beː"), ("ぼぉ","boː"), ("ぱぁ","paː"), ("ぴぃ","piː"), ("ぷぅ","pɯː"), ("ぷゃ","pʲa"), ("ぷゅ","pʲɯ"), ("ぷょ","pʲo"), ("ぺぇ","peː"), ("ぽぉ","poː"), ("まぁ","maː"), ("みぃ","miː"), ("むぅ","mɯː"), ("むゃ","mʲa"), ("むゅ","mʲɯ"), ("むょ","mʲo"), ("めぇ","meː"), ("もぉ","moː"), ("やぁ","jaː"), ("ゆぅ","jɯː"), ("ゆゃ","jaː"), ("ゆゅ","jɯː"), ("ゆょ","joː"), ("よぉ","joː"), ("らぁ","ɽaː"), ("りぃ","ɽiː"), ("るぅ","ɽɯː"), ("るゃ","ɽʲa"), ("るゅ","ɽʲɯ"), ("るょ","ɽʲo"), ("れぇ","ɽeː"), ("ろぉ","ɽoː"), ("わぁ","ɯaː"), ("をぉ","oː"), ("う゛","bɯ"), ("でぃ","di"), ("でぇ","deː"), ("でゃ","dʲa"), ("でゅ","dʲɯ"), ("でょ","dʲo"), ("てぃ","ti"), ("てぇ","teː"), ("てゃ","tʲa"), ("てゅ","tʲɯ"), ("てょ","tʲo"), ("すぃ","si"), ("ずぁ","zɯa"), ("ずぃ","zi"), ("ずぅ","zɯ"), ("ずゃ","zʲa"), ("ずゅ","zʲɯ"), ("ずょ","zʲo"), ("ずぇ","ze"), ("ずぉ","zo"), ("きゃ","kʲa"), ("きゅ","kʲɯ"), ("きょ","kʲo"), ("しゃ","ɕʲa"), ("しゅ","ɕʲɯ"), ("しぇ","ɕʲe"), ("しょ","ɕʲo"), ("ちゃ","tɕa"), ("ちゅ","tɕɯ"), ("ちぇ","tɕe"), ("ちょ","tɕo"), ("とぅ","tɯ"), ("とゃ","tʲa"), ("とゅ","tʲɯ"), ("とょ","tʲo"), ("どぁ","doa"), ("どぅ","dɯ"), ("どゃ","dʲa"), ("どゅ","dʲɯ"), ("どょ","dʲo"), ("どぉ","doː"), ("にゃ","nʲa"), ("にゅ","nʲɯ"), ("にょ","nʲo"), ("ひゃ","çʲa"), ("ひゅ","çʲɯ"), ("ひょ","çʲo"), ("みゃ","mʲa"), ("みゅ","mʲɯ"), ("みょ","mʲo"), ("りゃ","ɽʲa"), ("りぇ","ɽʲe"), ("りゅ","ɽʲɯ"), ("りょ","ɽʲo"), ("ぎゃ","gʲa"), ("ぎゅ","gʲɯ"), ("ぎょ","gʲo"), ("ぢぇ","dʑe"), ("ぢゃ","dʑa"), ("ぢゅ","dʑɯ"), ("ぢょ","dʑo"), ("じぇ","dʑe"), ("じゃ","dʑa"), ("じゅ","dʑɯ"), ("じょ","dʑo"), ("びゃ","bʲa"), ("びゅ","bʲɯ"), ("びょ","bʲo"), ("ぴゃ","pʲa"), ("ぴゅ","pʲɯ"), ("ぴょ","pʲo"), ("うぁ","ɯa"), ("うぃ","ɯi"), ("うぇ","ɯe"), ("うぉ","ɯo"), ("うゃ","ɯʲa"), ("うゅ","ɯʲɯ"), ("うょ","ɯʲo"), ("ふぁ","ɸa"), ("ふぃ","ɸi"), ("ふぅ","ɸɯ"), ("ふゃ","ɸʲa"), ("ふゅ","ɸʲɯ"), ("ふょ","ɸʲo"), ("ふぇ","ɸe"), ("ふぉ","ɸo"), ("あ"," a"), ("い"," i"), ("う","ɯ"), ("え"," e"), ("お"," o"), ("か"," ka"), ("き"," ki"), ("く"," kɯ"), ("け"," ke"), ("こ"," ko"), ("さ"," sa"), ("し"," ɕi"), ("す"," sɯ"), ("せ"," se"), ("そ"," so"), ("た"," ta"), ("ち"," tɕi"), ("つ"," tsɯ"), ("て"," te"), ("と"," to"), ("な"," na"), ("に"," ni"), ("ぬ"," nɯ"), ("ね"," ne"), ("の"," no"), ("は"," ha"), ("ひ"," çi"), ("ふ"," ɸɯ"), ("へ"," he"), ("ほ"," ho"), ("ま"," ma"), ("み"," mi"), ("む"," mɯ"), ("め"," me"), ("も"," mo"), ("ら"," ɽa"), ("り"," ɽi"), ("る"," ɽɯ"), ("れ"," ɽe"), ("ろ"," ɽo"), ("が"," ga"), ("ぎ"," gi"), ("ぐ"," gɯ"), ("げ"," ge"), ("ご"," go"), ("ざ"," za"), ("じ"," dʑi"), ("ず"," zɯ"), ("ぜ"," ze"), ("ぞ"," zo"), ("だ"," da"), ("ぢ"," dʑi"), ("づ"," zɯ"), ("で"," de"), ("ど"," do"), ("ば"," ba"), ("び"," bi"), ("ぶ"," bɯ"), ("べ"," be"), ("ぼ"," bo"), ("ぱ"," pa"), ("ぴ"," pi"), ("ぷ"," pɯ"), ("ぺ"," pe"), ("ぽ"," po"), ("や"," ja"), ("ゆ"," jɯ"), ("よ"," jo"), ("わ"," ɯa"), ("ゐ"," i"), ("ゑ"," e"), ("ん"," ɴ"), ("っ"," ʔ"), ("ー"," ː"), ("ぁ"," a"), ("ぃ"," i"), ("ぅ"," ɯ"), ("ぇ"," e"), ("ぉ"," o"), ("ゎ"," ɯa"), ("ぉ"," o"), ("を","o") ]) def post_fix(text): orig = text for k, v in kana_mapper.items(): text = text.replace(k, v) return text from datasets import Dataset, Audio from transformers import WhisperProcessor, WhisperForConditionalGeneration processor = WhisperProcessor.from_pretrained("openai/whisper-large-v3") model = WhisperForConditionalGeneration.from_pretrained("Respair/Hibiki_ASR_Phonemizer").to("cuda:0") forced_decoder_ids = processor.get_decoder_prompt_ids(task="transcribe", language='japanese') sample = Dataset.from_dict({"audio": ["/content/kl_chunk1987.wav"]}).cast_column("audio", Audio(16000)) sample = sample[0]['audio'] # Ensure the input features are on the same device as the model input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features.to("cuda:0") # generate token ids predicted_ids = model.generate(input_features,forced_decoder_ids=forced_decoder_ids, repetition_penalty=1.2) # decode token ids to text transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) if "ki ni ɕinai" in transcription[0]: transcription[0] = transcription[0].replace("ki ni ɕinai", " ki ni ɕinai") if ' ʔt' in transcription[0]: transcription[0] = transcription[0].replace(' ʔt', "ʔt") if ' neɽitai ' in transcription[0]: transcription[0] = transcription[0].replace(' neɽitai ', "naɽitai") if 'harɯdʑisama' in transcription[0]: transcription[0] = transcription[0].replace('harɯdʑisama', "arɯdʑisama") if 'de aɽoɯ' in transcription[0]: transcription[0] = transcription[0].replace('de aɽoɯ', " de aɽoɯ") post_fix(transcription[0].lstrip()) ``` ## Intended uses & limitations More information needed ## Training and evaluation data - Japanese Common Voice 17 - ehehe Corpus - Custom Game and Anime dataset (around 8 hours) ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1e-05 - train_batch_size: 24 - eval_batch_size: 8 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 500 - training_steps: 6000 ### Training results | Training Loss | Epoch | Step | Validation Loss | Wer | |:-------------:|:------:|:----:|:---------------:|:-------:| | 0.2101 | 0.8058 | 1000 | 0.2090 | 30.1840 | | 0.1369 | 1.6116 | 2000 | 0.1837 | 27.6756 | | 0.0838 | 2.4174 | 3000 | 0.1829 | 26.4036 | | 0.0454 | 3.2232 | 4000 | 0.1922 | 20.9549 | | 0.0434 | 4.0290 | 5000 | 0.2072 | 20.8898 | | 0.021 | 4.8348 | 6000 | 0.2186 | 21.6707 | ### Framework versions - Transformers 4.41.1 - Pytorch 2.4.0+cu121 - Datasets 2.19.1 - Tokenizers 0.19.1