kotoba-tech
/

kotoba-whisper-v1.0

@@ -63,16 +63,15 @@ model-index:
 _Kotoba-Whisper_ is a collection of distilled [Whisper](https://arxiv.org/abs/2212.04356) models for Japanese ASR. Following the original work of distil-whisper ([Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430)),
 we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model, and the student model that consists the full encoder of the
 teacher whisper model, and a decoder with two layers initialized from the first and last layer of the whisper model.
 As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
 which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average) after
-those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter)).
 The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the raining and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
-Kotoba-whisper-v1.0 achieves better CER and WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set from ReazonSpeech, and
-achieves competitive CER and WER on the out-of-domain test set including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
-the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice).
 - ***CER***
@@ -302,12 +301,8 @@ See [https://huggingface.co/distil-whisper/distil-large-v3#model-details](https:
 ## Evaluation
-The following code-snippets demonstrates how to evaluate the Distil-Whisper model on the LibriSpeech validation-clean
-dataset with [streaming mode](https://huggingface.co/blog/audio-datasets#streaming-mode-the-silver-bullet), meaning no
-audio data has to be downloaded to your local device.
-First, we need to install the required packages, including 🤗 Datasets to stream and load the audio data, and 🤗 Evaluate to
 perform the WER calculation:
 ```bash
@@ -326,6 +321,7 @@ from tqdm import tqdm
 # config
 model_id = "kotoba-tech/kotoba-whisper-v1.0"
 torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 audio_column = 'audio'
@@ -338,7 +334,7 @@ model.to(device)
 processor = AutoProcessor.from_pretrained(model_id)
 # load the dataset and sample the audio with 16kHz
-dataset = load_dataset("japanese-asr/ja_asr.common_voice_8_0", split="test")
 dataset = dataset.cast_column(audio_column, features.Audio(sampling_rate=processor.feature_extractor.sampling_rate))
 dataset = dataset.select([0, 1, 2, 3, 4, 5, 6])
@@ -375,6 +371,13 @@ cer = 100 * cer_metric.compute(predictions=all_transcriptions, references=all_re
 print(cer)
 ```
 ## Acknowledgements
 * OpenAI for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).

 _Kotoba-Whisper_ is a collection of distilled [Whisper](https://arxiv.org/abs/2212.04356) models for Japanese ASR. Following the original work of distil-whisper ([Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430)),
 we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model, and the student model that consists the full encoder of the
 teacher whisper model, and a decoder with two layers initialized from the first and last layer of the whisper model.
 As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
 which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average) after
+those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
 The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the raining and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
+Kotoba-whisper-v1.0 achieves better CER and WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set
+from ReazonSpeech, and achieves competitive CER and WER on the out-of-domain test sets including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
+the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice) (see [Evaluation](#evaluation) for detail).
 - ***CER***
 ## Evaluation
+The following code-snippets demonstrates how to evaluate the kotoba-whisper model on the Japanese subset of the CommonVoice 8.0.
+First, we need to install the required packages, including 🤗 Datasets to load the audio data, and 🤗 Evaluate to
 perform the WER calculation:
 ```bash
 # config
 model_id = "kotoba-tech/kotoba-whisper-v1.0"
+dataset_name = "japanese-asr/ja_asr.common_voice_8_0"
 torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 audio_column = 'audio'
 processor = AutoProcessor.from_pretrained(model_id)
 # load the dataset and sample the audio with 16kHz
+dataset = load_dataset(dataset_name, split="test")
 dataset = dataset.cast_column(audio_column, features.Audio(sampling_rate=processor.feature_extractor.sampling_rate))
 dataset = dataset.select([0, 1, 2, 3, 4, 5, 6])
 print(cer)
 ```
+The huggingface links to the major Japanese ASR datasets for evaluation are summarized at [here](https://huggingface.co/collections/japanese-asr/japanese-asr-evaluation-dataset-66051a03d6ca494d40baaa26).
+For example, to evaluate the model on JSUT Basic5000, change the `dataset_name`:
+```diff
+- dataset_name = "japanese-asr/ja_asr.common_voice_8_0"
++ dataset_name = "japanese-asr/ja_asr.jsut_basic5000"
+```
 ## Acknowledgements
 * OpenAI for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).