kotoba-tech
/

kotoba-whisper-v1.0

@@ -48,18 +48,30 @@ model-index:
           - name: WER
             type: WER
             value:
 ---
 # Kotoba-Whisper
-# Distil-Whisper: distil-large-v3
-Distil-Whisper was proposed in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430).
-This is the third and final installment of the Distil-Whisper English series. It the knowledge distilled version of
-OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3), the latest and most performant Whisper model
-to date.
 Compared to previous Distil-Whisper models, the distillation procedure for distil-large-v3 has been adapted to give
 **superior long-form transcription accuracy** with OpenAI's **sequential long-form algorithm**.
@@ -68,11 +80,20 @@ The result is a distilled model that performs to within 1% WER of large-v3 on lo
 and chunked algorithms, and outperforms distil-large-v2 by 4.8% using the sequential algorithm. The model is also faster
 than previous Distil-Whisper models: **6.3x faster than large-v3**, and 1.1x faster than distil-large-v2.
-| Model                                                                        | Params / M | Rel. Latency | Short-Form | Sequential Long-Form | Chunked Long-Form |
-|------------------------------------------------------------------------------|------------|--------------|------------|----------------------|-------------------|
-| [large-v3](https://huggingface.co/openai/whisper-large-v3)                   | 1550       | 1.0          | 8.4        | 10.0                 | 11.0              |
-| **[distil-large-v3](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)** | **756**    | **6.3**      | **9.7**    | **10.8**             | **10.9**          |
-| [distil-large-v2](https://huggingface.co/distil-whisper/distil-large-v2)     | 756        | 5.8          | 10.1       | 15.6                 | 11.6              |
 Since the sequential algorithm is the "de-facto" transcription algorithm across the most popular Whisper libraries
 (Whisper cpp, Faster-Whisper, OpenAI Whisper), this distilled model is designed to be compatible with these libraries.
@@ -80,7 +101,6 @@ You can expect significant performance gains by switching from previous Distil-W
 when using these libraries. For convenience, the weights for the most popular libraries are already converted,
 with instructions for getting started below.
-## Table of Contents
 1. [Transformers Usage](#transformers-usage)
    * [Short-Form Transcription](#short-form-transcription)

           - name: WER
             type: WER
             value:
 ---
 # Kotoba-Whisper
+_Kotoba-Whisper_ is a collection of distilled [Whisper](https://arxiv.org/abs/2212.04356) models for Japanese ASR. Following the original work of distil-whisper ([Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430)),
+we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model, and the student model that consists the full encoder of the
+teacher whisper model, and a decoder with two layers initialized from the first and last layer of the whisper model.
+As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
+which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average).
+### Benchmark
+***kotoba-whisper-v1.0*** achieves better WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set from ReazonSpeech, and
+achieves competitive WER on the out-of-domain test set including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
+the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice).
+| model                                                                                            | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
+|:-------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
+| [***kotoba-tech/kotoba-whisper-v1.0***](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)  |                      59.27 |           64.36 |             56.62 |
+| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                        |                      55.41 |           59.34 |             60.23 |
+| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                            |                      63.64 |           69.52 |             76.04 |
+| [openai/whisper-small](https://huggingface.co/openai/whisper-small)                              |                      74.21 |           82.02 |             82.99 |
+| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                                |                      93.78 |           97.72 |             94.85 |
+### Latency
 Compared to previous Distil-Whisper models, the distillation procedure for distil-large-v3 has been adapted to give
 **superior long-form transcription accuracy** with OpenAI's **sequential long-form algorithm**.
 and chunked algorithms, and outperforms distil-large-v2 by 4.8% using the sequential algorithm. The model is also faster
 than previous Distil-Whisper models: **6.3x faster than large-v3**, and 1.1x faster than distil-large-v2.
+| Model                                                                        | Params / M | Rel. Latency |
+|------------------------------------------------------------------------------|------------|--------------|
+| [large-v3](https://huggingface.co/openai/whisper-large-v3)                   | 1550       | 1.0          |
+| **[distil-large-v3](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)**| **756**    | **6.3**      |
+## Table of Contents
+The models are shared via huggingfacae, and the distillation code was adapted from [official distil-whisper training scripts](https://github.com/huggingface/distil-whisper/tree/main/training),
+which we release in this repository. As the training dataset, we employ [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
+one of the largest speech and text paired dataset in Japanese, and we evaluate our ASR models on [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
+the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice), as well as a held-out test set from ReazonSpeech.
 Since the sequential algorithm is the "de-facto" transcription algorithm across the most popular Whisper libraries
 (Whisper cpp, Faster-Whisper, OpenAI Whisper), this distilled model is designed to be compatible with these libraries.
 when using these libraries. For convenience, the weights for the most popular libraries are already converted,
 with instructions for getting started below.
 1. [Transformers Usage](#transformers-usage)
    * [Short-Form Transcription](#short-form-transcription)