File size: 10,100 Bytes
a4a9954
c855343
d90b22e
c855343
 
 
 
d90b22e
 
c855343
 
d90b22e
c855343
d90b22e
c855343
d90b22e
c855343
 
27efd41
d90b22e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a4a9954
 
c855343
 
 
 
 
 
 
1b7749a
1cb5c30
 
1b7749a
1cb5c30
 
 
bd9e759
 
 
 
 
1b7749a
f89e76b
1b7749a
ce2d123
 
1c228b5
 
 
8fccc61
 
 
 
 
 
 
 
 
ce2d123
 
98b3fc0
 
c855343
 
 
 
 
 
b03ec50
 
 
c855343
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b13ab57
1cb5c30
ce2d123
1cb5c30
c855343
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce2d123
1cb5c30
ce2d123
 
1cb5c30
 
ce2d123
1cb5c30
 
 
 
 
c855343
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b13ab57
 
c855343
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
language: ja
license: apache-2.0
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
metrics:
- wer
widget:
- example_title: CommonVoice 8.0 (Test Split)
  src: https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0/resolve/main/sample.flac
- example_title: JSUT Basic 5000
  src: https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000/resolve/main/sample.flac
- example_title: ReazonSpeech (Test Split)
  src: https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test/resolve/main/sample.flac
pipeline_tag: automatic-speech-recognition
model-index:
- name: kotoba-tech/kotoba-whisper-v1.1
  results:
  - task:
      type: automatic-speech-recognition
    dataset:
      name: CommonVoice_8.0 (Japanese)
      type: japanese-asr/ja_asr.common_voice_8_0
    metrics:
    - type: WER
      value: 59.27
      name: WER
    - type: CER
      value: 9.44
      name: CER
  - task:
      type: automatic-speech-recognition
    dataset:
      name: ReazonSpeech (Test)
      type: japanese-asr/ja_asr.reazonspeech_test
    metrics:
    - type: WER
      value: 56.62
      name: WER
    - type: CER
      value: 12.6
      name: CER
  - task:
      type: automatic-speech-recognition
    dataset:
      name: JSUT Basic5000
      type: japanese-asr/ja_asr.jsut_basic5000
    metrics:
    - type: WER
      value: 64.36
      name: WER
    - type: CER
      value: 8.48
      name: CER
---

# Kotoba-Whisper-v1.1
_Kotoba-Whisper-v1.1_ is a Japanese ASR model based on [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0), with 
additional postprocessing stacks integrated as [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines). The new features includes 
(i) improved timestamp achieved by [stable-ts](https://github.com/jianfch/stable-ts) and (ii) adding punctuation with [punctuators](https://github.com/1-800-BAD-CODE/punctuators/tree/main). 
These libraries are merged into Kotoba-Whisper-v1.1 via pipeline and will be applied seamlessly to the predicted transcription from [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0).
The pipeline has been developed through the collaboration between [Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://twitter.com/kotoba_tech)


Following table presents the raw CER (unlike usual CER where the punctuations are removed before computing the metrics, see the evaluation script [here](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.1/blob/main/run_short_form_eval.py))
along with the.


| model                                                    |   CommonVoice 8.0 (Japanese) |   JSUT Basic 5000 |  ReazonSpeech Test |
|:---------------------------------------------------------|---------------------------------------:|-------------------------------------:|----------------------------------------:|
| kotoba-tech/kotoba-whisper-v1.0                          |                                   15.6 |                                 15.2 |                                    17.8 |
| kotoba-tech/kotoba-whisper-v1.1 (punctuator + stable-ts) |                                   13.7 |                                 ***11.2*** |                                    ***17.4*** |
| kotoba-tech/kotoba-whisper-v1.1 (punctuator)             |                                   13.9 |                                 11.4 |                                    18   |
| kotoba-tech/kotoba-whisper-v1.1 (stable-ts)              |                                   15.7 |                                 15   |                                    17.7 |
| openai/whisper-large-v3                                  |                                   ***12.9*** |                                 13.4 |                                    20.6 |

Regarding to the normalized CER, since those update from v1.1 will be removed by the normalization, kotoba-tech/kotoba-whisper-v1.1 marks the same CER values as [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0).

### Latency
Kotoba-whisper-v1.1 improves the punctuation and the timestamp of the output from Kotoba-whisper-v1.0. However, since we apply the punctuator and stable-ts to each chunk,
we need to obtain the timestamps, which decreases the latency of the original kotoba-whisper-v1.0. See the following table comparing the inference speed on 
transcribing **50min** Japanese speech audio, where we report the average over five independent runs.

| model                                                    | return_timestamps   |   time (mean) |
|:---------------------------------------------------------|:--------------------|--------------:|
| kotoba-tech/kotoba-whisper-v1.0                          | False               |          10.8 |
| kotoba-tech/kotoba-whisper-v1.0                          | True                |          15.7 |
| kotoba-tech/kotoba-whisper-v1.1 (punctuator + stable-ts) | True                |          17.9 |
| kotoba-tech/kotoba-whisper-v1.1 (punctuator)             | True                |          17.7 |
| kotoba-tech/kotoba-whisper-v1.1 (stable-ts)              | True                |          16.1 |
| openai/whisper-large-v3                                  | False               |          29.1 |
| openai/whisper-large-v3                                  | True                |          37.9 |


See the full table [here](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.1/raw/main/latency.csv).

## Transformers Usage
Kotoba-Whisper-v1.1 is supported in the Hugging Face πŸ€— Transformers library from version 4.39 onwards. To run the model, first 
install the latest version of Transformers.

```bash
pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install stable-ts==2.16.0
pip install punctuators==0.0.5
```

### Transcription
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
class to transcribe audio files as follows:

```python
import torch
from transformers import pipeline
from datasets import load_dataset

# config
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "japanese", "task": "transcribe"}

# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    chunk_length_s=15,
    batch_size=16,
    trust_remote_code=True,
    stable_ts=True,
    punctuator=True
)

# load sample audio
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
sample = dataset[0]["audio"]

# run inference
result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
print(result)
```

- To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
```diff
- result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
+ result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs)
```

- To deactivate stable-ts:
```diff
-     stable_ts=True,
+     stable_ts=False,
```

- To deactivate punctuator:
```diff
-     punctuator=True,
+     punctuator=False,
```

### Transcription with Prompt
Kotoba-whisper can generate transcription with prompting as below:

```python
import re
import torch
from transformers import pipeline
from datasets import load_dataset

# config
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "japanese", "task": "transcribe"}

# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    chunk_length_s=15,
    batch_size=16,
    trust_remote_code=True
)

# load sample audio
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")

# --- Without prompt ---
text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
print(text)
# 81ζ­³γ€εŠ›εΌ·γ„θ΅°γ‚Šγ«ε€‰γ‚γ£γ¦γγΎγ™γ€‚

# --- With prompt ---: Let's change `81` to `91`.
prompt = "91ζ­³"
generate_kwargs['prompt_ids'] = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt").to(device)
text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
# currently the pipeline for ASR appends the prompt at the beginning of the transcription, so remove it
text = re.sub(rf"\A\s*{prompt}\s*", "", text)
print(text)
# γ‚γ£γΆγ£γŸγ§γ‚‚γ‚Ήγƒ«γ‚¬γ•γ‚“γ€91ζ­³γ€εŠ›εΌ·γ„θ΅°γ‚Šγ«ε€‰γ‚γ£γ¦γγΎγ™γ€‚
```

### Flash Attention 2
We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) 
if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):

```
pip install flash-attn --no-build-isolation
```

Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:

```diff
- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}
```


## Acknowledgements
* [OpenAI](https://openai.com/) for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).
* Hugging Face πŸ€— [Transformers](https://github.com/huggingface/transformers) for the model integration.
* Hugging Face πŸ€— for the [Distil-Whisper codebase](https://github.com/huggingface/distil-whisper).
* [Reazon Human Interaction Lab](https://research.reazon.jp/) for the [ReazonSpeech dataset](https://huggingface.co/datasets/reazon-research/reazonspeech).