kotoba-tech
/

kotoba-whisper-v1.0

@@ -24,30 +24,39 @@ model-index:
       - task:
           type: automatic-speech-recognition
         dataset:
-          name: japanese-asr/ja_asr.common_voice_8_0
           type: japanese-asr/ja_asr.common_voice_8_0
         metrics:
           - name: WER
             type: WER
-            value:
       - task:
           type: automatic-speech-recognition
         dataset:
-          name: japanese-asr/ja_asr.reazonspeech_test
           type: japanese-asr/ja_asr.reazonspeech_test
         metrics:
           - name: WER
             type: WER
-            value:
       - task:
           type: automatic-speech-recognition
         dataset:
-          name: japanese-asr/ja_asr.reazonspeech_test
-          type: japanese-asr/ja_asr.reazonspeech_test
         metrics:
           - name: WER
             type: WER
-            value:
 ---
 # Kotoba-Whisper
@@ -56,54 +65,19 @@ we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-larg
 teacher whisper model, and a decoder with two layers initialized from the first and last layer of the whisper model.
 As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
 which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average).
-### Benchmark
-***kotoba-whisper-v1.0*** achieves better WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set from ReazonSpeech, and
-achieves competitive WER on the out-of-domain test set including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
-the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice).
-- CER
-| Model                                                                                             | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
-|:--------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
-| [***kotoba-tech/kotoba-whisper-v1.0***](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)   |               9.44 |             8.48 |               12.6  |
-| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                         |               8.52 |             7.18 |               15.18 |
-| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                             |              11.34 |             9.87 |               29.56 |
-| [openai/whisper-small](https://huggingface.co/openai/whisper-small)                               |              15.26 |            14.22 |               34.29 |
-| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                                 |              46.86 |            35.69 |               96.69 |
-### Latency
-Compared to previous Distil-Whisper models, the distillation procedure for distil-large-v3 has been adapted to give
-**superior long-form transcription accuracy** with OpenAI's **sequential long-form algorithm**.
-The result is a distilled model that performs to within 1% WER of large-v3 on long-form audio using both the sequential
-and chunked algorithms, and outperforms distil-large-v2 by 4.8% using the sequential algorithm. The model is also faster
-than previous Distil-Whisper models: **6.3x faster than large-v3**, and 1.1x faster than distil-large-v2.
-| Model                                                                        | Params / M | Rel. Latency |
-|------------------------------------------------------------------------------|------------|--------------|
-| [large-v3](https://huggingface.co/openai/whisper-large-v3)                   | 1550       | 1.0          |
-| **[distil-large-v3](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)**| **756**    | **6.3**      |
 ## Table of Contents
-The models are shared via huggingfacae, and the distillation code was adapted from [official distil-whisper training scripts](https://github.com/huggingface/distil-whisper/tree/main/training),
-which we release in this repository. As the training dataset, we employ [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
-one of the largest speech and text paired dataset in Japanese, and we evaluate our ASR models on [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
-the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice), as well as a held-out test set from ReazonSpeech.
 Since the sequential algorithm is the "de-facto" transcription algorithm across the most popular Whisper libraries
 (Whisper cpp, Faster-Whisper, OpenAI Whisper), this distilled model is designed to be compatible with these libraries.
 You can expect significant performance gains by switching from previous Distil-Whisper checkpoints to distil-large-v3
 when using these libraries. For convenience, the weights for the most popular libraries are already converted,
 with instructions for getting started below.
-1. [Transformers Usage](#transformers-usage)
    * [Short-Form Transcription](#short-form-transcription)
    * [Sequential Long-Form](#sequential-long-form)
    * [Chunked Long-Form](#chunked-long-form)
@@ -115,6 +89,42 @@ with instructions for getting started below.
 3. [Model Details](#model-details)
 ## Transformers Usage
 distil-large-v3 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first

       - task:
           type: automatic-speech-recognition
         dataset:
+          name: CommonVoice_8.0 (Japanese)
           type: japanese-asr/ja_asr.common_voice_8_0
         metrics:
           - name: WER
             type: WER
+            value: 59.27
+          - name: CER
+            type: CER
+            value: 9.44
       - task:
           type: automatic-speech-recognition
         dataset:
+          name: ReazonSpeech (Test)
           type: japanese-asr/ja_asr.reazonspeech_test
         metrics:
           - name: WER
             type: WER
+            value: 56.62
+          - name: CER
+            type: CER
+            value: 12.60
       - task:
           type: automatic-speech-recognition
         dataset:
+          name: JSUT Basic5000
+          type: japanese-asr/ja_asr.jsut_basic5000
         metrics:
           - name: WER
             type: WER
+            value: 64.36
+          - name: CER
+            type: CER
+            value: 8.48
 ---
 # Kotoba-Whisper
 teacher whisper model, and a decoder with two layers initialized from the first and last layer of the whisper model.
 As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
 which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average).
+Kotoba-whisper-v1.0 is competitive or even outpeform the largest whisper model in Japanese ASR benchmarks, while being 6.3 times faster than the whisper model.
 ## Table of Contents
 Since the sequential algorithm is the "de-facto" transcription algorithm across the most popular Whisper libraries
 (Whisper cpp, Faster-Whisper, OpenAI Whisper), this distilled model is designed to be compatible with these libraries.
 You can expect significant performance gains by switching from previous Distil-Whisper checkpoints to distil-large-v3
 when using these libraries. For convenience, the weights for the most popular libraries are already converted,
 with instructions for getting started below.
+1. [Evaluation Results](#evaluation-results)
+2. [Transformers Usage](#transformers-usage)
    * [Short-Form Transcription](#short-form-transcription)
    * [Sequential Long-Form](#sequential-long-form)
    * [Chunked Long-Form](#chunked-long-form)
 3. [Model Details](#model-details)
+## Evaluation Results
+***kotoba-whisper-v1.0*** achieves better CER and WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set from ReazonSpeech, and
+achieves competitive CER and WER on the out-of-domain test set including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
+the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice).
+### CER
+| Model                                                                                           | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
+|:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
+| [***kotoba-tech/kotoba-whisper-v1.0***](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) |                       9.44 |            8.48 |             12.60  |
+| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                       |                       8.52 |            7.18 |             15.18 |
+| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                           |                      11.34 |            9.87 |             29.56 |
+| [openai/whisper-small](https://huggingface.co/openai/whisper-small)                             |                      15.26 |           14.22 |             34.29 |
+| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                               |                      46.86 |           35.69 |             96.69 |
+### WER
+| Model                                                                                           | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
+|:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
+| [***kotoba-tech/kotoba-whisper-v1.0***](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) |                      59.27 |           64.36 |             56.62 |
+| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                       |                      55.41 |           59.34 |             60.23 |
+| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                           |                      63.64 |           69.52 |             76.04 |
+| [openai/whisper-small](https://huggingface.co/openai/whisper-small)                             |                      74.21 |           82.02 |             82.99 |
+| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                               |                      93.78 |           97.72 |             94.85 |
+###  Latency
+As kotoba-whisper uses the same architecture as [distil-whisper/distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3),
+it inherits the benefit of the improved latency compared to [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)
+(**6.3x faster than large-v3**, see the table below taken from [distil-whisper/distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)).
+| Model                                                                        | Params / M | Rel. Latency |
+|------------------------------------------------------------------------------|------------|--------------|
+| **[kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)**| **756**    | **6.3**      |
+| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                   | 1550       | 1.0          |
 ## Transformers Usage
 distil-large-v3 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first