Automatic Speech Recognition
Transformers
Safetensors
Japanese
whisper
audio
hf-asr-leaderboard
Eval Results
Inference Endpoints
asahi417 commited on
Commit
8f11afe
1 Parent(s): 00f981e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -12
README.md CHANGED
@@ -48,18 +48,30 @@ model-index:
48
  - name: WER
49
  type: WER
50
  value:
51
-
52
  ---
53
 
54
  # Kotoba-Whisper
 
 
 
 
 
 
 
 
 
 
55
 
56
- # Distil-Whisper: distil-large-v3
 
 
 
 
 
 
57
 
58
- Distil-Whisper was proposed in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430).
59
 
60
- This is the third and final installment of the Distil-Whisper English series. It the knowledge distilled version of
61
- OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3), the latest and most performant Whisper model
62
- to date.
63
 
64
  Compared to previous Distil-Whisper models, the distillation procedure for distil-large-v3 has been adapted to give
65
  **superior long-form transcription accuracy** with OpenAI's **sequential long-form algorithm**.
@@ -68,11 +80,20 @@ The result is a distilled model that performs to within 1% WER of large-v3 on lo
68
  and chunked algorithms, and outperforms distil-large-v2 by 4.8% using the sequential algorithm. The model is also faster
69
  than previous Distil-Whisper models: **6.3x faster than large-v3**, and 1.1x faster than distil-large-v2.
70
 
71
- | Model | Params / M | Rel. Latency | Short-Form | Sequential Long-Form | Chunked Long-Form |
72
- |------------------------------------------------------------------------------|------------|--------------|------------|----------------------|-------------------|
73
- | [large-v3](https://huggingface.co/openai/whisper-large-v3) | 1550 | 1.0 | 8.4 | 10.0 | 11.0 |
74
- | **[distil-large-v3](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)** | **756** | **6.3** | **9.7** | **10.8** | **10.9** |
75
- | [distil-large-v2](https://huggingface.co/distil-whisper/distil-large-v2) | 756 | 5.8 | 10.1 | 15.6 | 11.6 |
 
 
 
 
 
 
 
 
 
76
 
77
  Since the sequential algorithm is the "de-facto" transcription algorithm across the most popular Whisper libraries
78
  (Whisper cpp, Faster-Whisper, OpenAI Whisper), this distilled model is designed to be compatible with these libraries.
@@ -80,7 +101,6 @@ You can expect significant performance gains by switching from previous Distil-W
80
  when using these libraries. For convenience, the weights for the most popular libraries are already converted,
81
  with instructions for getting started below.
82
 
83
- ## Table of Contents
84
 
85
  1. [Transformers Usage](#transformers-usage)
86
  * [Short-Form Transcription](#short-form-transcription)
 
48
  - name: WER
49
  type: WER
50
  value:
 
51
  ---
52
 
53
  # Kotoba-Whisper
54
+ _Kotoba-Whisper_ is a collection of distilled [Whisper](https://arxiv.org/abs/2212.04356) models for Japanese ASR. Following the original work of distil-whisper ([Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430)),
55
+ we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model, and the student model that consists the full encoder of the
56
+ teacher whisper model, and a decoder with two layers initialized from the first and last layer of the whisper model.
57
+ As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
58
+ which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average).
59
+
60
+ ### Benchmark
61
+ ***kotoba-whisper-v1.0*** achieves better WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set from ReazonSpeech, and
62
+ achieves competitive WER on the out-of-domain test set including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
63
+ the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice).
64
 
65
+ | model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
66
+ |:-------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
67
+ | [***kotoba-tech/kotoba-whisper-v1.0***](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 59.27 | 64.36 | 56.62 |
68
+ | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 55.41 | 59.34 | 60.23 |
69
+ | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 63.64 | 69.52 | 76.04 |
70
+ | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 74.21 | 82.02 | 82.99 |
71
+ | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 93.78 | 97.72 | 94.85 |
72
 
 
73
 
74
+ ### Latency
 
 
75
 
76
  Compared to previous Distil-Whisper models, the distillation procedure for distil-large-v3 has been adapted to give
77
  **superior long-form transcription accuracy** with OpenAI's **sequential long-form algorithm**.
 
80
  and chunked algorithms, and outperforms distil-large-v2 by 4.8% using the sequential algorithm. The model is also faster
81
  than previous Distil-Whisper models: **6.3x faster than large-v3**, and 1.1x faster than distil-large-v2.
82
 
83
+ | Model | Params / M | Rel. Latency |
84
+ |------------------------------------------------------------------------------|------------|--------------|
85
+ | [large-v3](https://huggingface.co/openai/whisper-large-v3) | 1550 | 1.0 |
86
+ | **[distil-large-v3](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)**| **756** | **6.3** |
87
+
88
+
89
+
90
+ ## Table of Contents
91
+
92
+ The models are shared via huggingfacae, and the distillation code was adapted from [official distil-whisper training scripts](https://github.com/huggingface/distil-whisper/tree/main/training),
93
+ which we release in this repository. As the training dataset, we employ [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
94
+ one of the largest speech and text paired dataset in Japanese, and we evaluate our ASR models on [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
95
+ the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice), as well as a held-out test set from ReazonSpeech.
96
+
97
 
98
  Since the sequential algorithm is the "de-facto" transcription algorithm across the most popular Whisper libraries
99
  (Whisper cpp, Faster-Whisper, OpenAI Whisper), this distilled model is designed to be compatible with these libraries.
 
101
  when using these libraries. For convenience, the weights for the most popular libraries are already converted,
102
  with instructions for getting started below.
103
 
 
104
 
105
  1. [Transformers Usage](#transformers-usage)
106
  * [Short-Form Transcription](#short-form-transcription)