Automatic Speech Recognition
Transformers
Safetensors
Japanese
whisper
audio
hf-asr-leaderboard
Inference Endpoints
File size: 7,741 Bytes
a4a9954
c855343
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a4a9954
 
c855343
 
 
 
 
 
 
1b7749a
 
 
 
 
 
 
 
 
 
c855343
 
 
 
 
 
b03ec50
 
 
c855343
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b13ab57
 
c855343
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b13ab57
 
c855343
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
---
license: apache-2.0
language: ja
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
widget:
- example_title: CommonVoice 8.0 (Test Split)
  src: >-
    https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0/resolve/main/sample.flac
- example_title: JSUT Basic 5000
  src: >-
    https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000/resolve/main/sample.flac
- example_title: ReazonSpeech (Test Split)
  src: >-
    https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test/resolve/main/sample.flac
pipeline_tag: automatic-speech-recognition
metrics:
- wer
model-index:
  - name: kotoba-tech/kotoba-whisper-v1.0
    results:
      - task:
          type: automatic-speech-recognition
        dataset:
          name: CommonVoice_8.0 (Japanese)
          type: japanese-asr/ja_asr.common_voice_8_0
        metrics:
          - name: WER
            type: WER
            value: TBA
          - name: CER
            type: CER
            value: TBA
      - task:
          type: automatic-speech-recognition
        dataset:
          name: ReazonSpeech (Test)
          type: japanese-asr/ja_asr.reazonspeech_test
        metrics:
          - name: WER
            type: WER
            value: TBA
          - name: CER
            type: CER
            value: TBA
      - task:
          type: automatic-speech-recognition
        dataset:
          name: JSUT Basic5000
          type: japanese-asr/ja_asr.jsut_basic5000
        metrics:
          - name: WER
            type: WER
            value: TBA
          - name: CER
            type: CER
            value: TBA
---

# Kotoba-Whisper-v1.1
_Kotoba-Whisper-v1.1_ is a Japanese ASR model based on [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0), with 
additional postprocessing stacks integrated as [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines). The new features includes 
(i) improved timestamp achieved by [stable-ts](https://github.com/jianfch/stable-ts) and (ii) adding punctuation with [punctuators](https://github.com/1-800-BAD-CODE/punctuators/tree/main). 
These libraries are merged into Kotoba-Whisper-v1.1 via pipeline and will be applied seamlessly to the predicted transcription from [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0).
The pipeline has been developed through the collaboration between [Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://twitter.com/kotoba_tech)


Following table presents the raw CER (unlike usual CER where the punctuations are removed before computing the metrics).

| model                           |   CommonVoice 8.0 (Japanese) |   JSUT Basic 5000 |  ReazonSpeech Test |
|:--------------------------------|---------------------------------------:|-------------------------------------:|----------------------------------------:|
| kotoba-tech/kotoba-whisper-v1.0 |                                   17.8 |                                 15.2 |                                    17.8 |
| kotoba-tech/kotoba-whisper-v1.1 |                                   16   |                                 11.6 |                                    18.5 |
| openai/whisper-large-v3         |                                   15.4 |                                 13.6 |                                    20.7 |


## Transformers Usage
Kotoba-Whisper-v1.1 is supported in the Hugging Face πŸ€— Transformers library from version 4.39 onwards. To run the model, first 
install the latest version of Transformers.

```bash
pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install stable-ts==2.16.0
pip install punctuators==0.0.5
```

### Transcription
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
class to transcribe audio files as follows:

```python
import torch
from transformers import pipeline
from datasets import load_dataset

# config
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "japanese", "task": "transcribe"}

# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    chunk_length_s=15,
    batch_size=16,
    trust_remote_code=True
)

# load sample audio
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
sample = dataset[0]["audio"]

# run inference
result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
print(result)
```

- To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
```diff
- result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
+ result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs)
```

### Transcription with Prompt
Kotoba-whisper can generate transcription with prompting as below:

```python
import re
import torch
from transformers import pipeline
from datasets import load_dataset

# config
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "japanese", "task": "transcribe"}

# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    chunk_length_s=15,
    batch_size=16,
    trust_remote_code=True
)

# load sample audio
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")

# --- Without prompt ---
text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
print(text)
# 81ζ­³γ€εŠ›εΌ·γ„θ΅°γ‚Šγ«ε€‰γ‚γ£γ¦γγΎγ™γ€‚

# --- With prompt ---: Let's change `81` to `91`.
prompt = "91ζ­³"
generate_kwargs['prompt_ids'] = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt").to(device)
text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
# currently the pipeline for ASR appends the prompt at the beginning of the transcription, so remove it
text = re.sub(rf"\A\s*{prompt}\s*", "", text)
print(text)
# γ‚γ£γΆγ£γŸγ§γ‚‚γ‚Ήγƒ«γ‚¬γ•γ‚“γ€91ζ­³γ€εŠ›εΌ·γ„θ΅°γ‚Šγ«ε€‰γ‚γ£γ¦γγΎγ™γ€‚
```

### Flash Attention 2
We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) 
if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):

```
pip install flash-attn --no-build-isolation
```

Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:

```diff
- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}
```


## Acknowledgements
* [OpenAI](https://openai.com/) for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).
* Hugging Face πŸ€— [Transformers](https://github.com/huggingface/transformers) for the model integration.
* Hugging Face πŸ€— for the [Distil-Whisper codebase](https://github.com/huggingface/distil-whisper).
* [Reazon Human Interaction Lab](https://research.reazon.jp/) for the [ReazonSpeech dataset](https://huggingface.co/datasets/reazon-research/reazonspeech).