File size: 10,100 Bytes
a4a9954 c855343 d90b22e c855343 d90b22e c855343 d90b22e c855343 d90b22e c855343 d90b22e c855343 27efd41 d90b22e a4a9954 c855343 1b7749a 1cb5c30 1b7749a 1cb5c30 bd9e759 1b7749a f89e76b 1b7749a ce2d123 1c228b5 8fccc61 ce2d123 98b3fc0 c855343 b03ec50 c855343 b13ab57 1cb5c30 ce2d123 1cb5c30 c855343 ce2d123 1cb5c30 ce2d123 1cb5c30 ce2d123 1cb5c30 c855343 b13ab57 c855343 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 |
---
language: ja
license: apache-2.0
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
metrics:
- wer
widget:
- example_title: CommonVoice 8.0 (Test Split)
src: https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0/resolve/main/sample.flac
- example_title: JSUT Basic 5000
src: https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000/resolve/main/sample.flac
- example_title: ReazonSpeech (Test Split)
src: https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test/resolve/main/sample.flac
pipeline_tag: automatic-speech-recognition
model-index:
- name: kotoba-tech/kotoba-whisper-v1.1
results:
- task:
type: automatic-speech-recognition
dataset:
name: CommonVoice_8.0 (Japanese)
type: japanese-asr/ja_asr.common_voice_8_0
metrics:
- type: WER
value: 59.27
name: WER
- type: CER
value: 9.44
name: CER
- task:
type: automatic-speech-recognition
dataset:
name: ReazonSpeech (Test)
type: japanese-asr/ja_asr.reazonspeech_test
metrics:
- type: WER
value: 56.62
name: WER
- type: CER
value: 12.6
name: CER
- task:
type: automatic-speech-recognition
dataset:
name: JSUT Basic5000
type: japanese-asr/ja_asr.jsut_basic5000
metrics:
- type: WER
value: 64.36
name: WER
- type: CER
value: 8.48
name: CER
---
# Kotoba-Whisper-v1.1
_Kotoba-Whisper-v1.1_ is a Japanese ASR model based on [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0), with
additional postprocessing stacks integrated as [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines). The new features includes
(i) improved timestamp achieved by [stable-ts](https://github.com/jianfch/stable-ts) and (ii) adding punctuation with [punctuators](https://github.com/1-800-BAD-CODE/punctuators/tree/main).
These libraries are merged into Kotoba-Whisper-v1.1 via pipeline and will be applied seamlessly to the predicted transcription from [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0).
The pipeline has been developed through the collaboration between [Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://twitter.com/kotoba_tech)
Following table presents the raw CER (unlike usual CER where the punctuations are removed before computing the metrics, see the evaluation script [here](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.1/blob/main/run_short_form_eval.py))
along with the.
| model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
|:---------------------------------------------------------|---------------------------------------:|-------------------------------------:|----------------------------------------:|
| kotoba-tech/kotoba-whisper-v1.0 | 15.6 | 15.2 | 17.8 |
| kotoba-tech/kotoba-whisper-v1.1 (punctuator + stable-ts) | 13.7 | ***11.2*** | ***17.4*** |
| kotoba-tech/kotoba-whisper-v1.1 (punctuator) | 13.9 | 11.4 | 18 |
| kotoba-tech/kotoba-whisper-v1.1 (stable-ts) | 15.7 | 15 | 17.7 |
| openai/whisper-large-v3 | ***12.9*** | 13.4 | 20.6 |
Regarding to the normalized CER, since those update from v1.1 will be removed by the normalization, kotoba-tech/kotoba-whisper-v1.1 marks the same CER values as [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0).
### Latency
Kotoba-whisper-v1.1 improves the punctuation and the timestamp of the output from Kotoba-whisper-v1.0. However, since we apply the punctuator and stable-ts to each chunk,
we need to obtain the timestamps, which decreases the latency of the original kotoba-whisper-v1.0. See the following table comparing the inference speed on
transcribing **50min** Japanese speech audio, where we report the average over five independent runs.
| model | return_timestamps | time (mean) |
|:---------------------------------------------------------|:--------------------|--------------:|
| kotoba-tech/kotoba-whisper-v1.0 | False | 10.8 |
| kotoba-tech/kotoba-whisper-v1.0 | True | 15.7 |
| kotoba-tech/kotoba-whisper-v1.1 (punctuator + stable-ts) | True | 17.9 |
| kotoba-tech/kotoba-whisper-v1.1 (punctuator) | True | 17.7 |
| kotoba-tech/kotoba-whisper-v1.1 (stable-ts) | True | 16.1 |
| openai/whisper-large-v3 | False | 29.1 |
| openai/whisper-large-v3 | True | 37.9 |
See the full table [here](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.1/raw/main/latency.csv).
## Transformers Usage
Kotoba-Whisper-v1.1 is supported in the Hugging Face π€ Transformers library from version 4.39 onwards. To run the model, first
install the latest version of Transformers.
```bash
pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install stable-ts==2.16.0
pip install punctuators==0.0.5
```
### Transcription
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
class to transcribe audio files as follows:
```python
import torch
from transformers import pipeline
from datasets import load_dataset
# config
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "japanese", "task": "transcribe"}
# load model
pipe = pipeline(
model=model_id,
torch_dtype=torch_dtype,
device=device,
model_kwargs=model_kwargs,
chunk_length_s=15,
batch_size=16,
trust_remote_code=True,
stable_ts=True,
punctuator=True
)
# load sample audio
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
sample = dataset[0]["audio"]
# run inference
result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
print(result)
```
- To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
```diff
- result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
+ result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs)
```
- To deactivate stable-ts:
```diff
- stable_ts=True,
+ stable_ts=False,
```
- To deactivate punctuator:
```diff
- punctuator=True,
+ punctuator=False,
```
### Transcription with Prompt
Kotoba-whisper can generate transcription with prompting as below:
```python
import re
import torch
from transformers import pipeline
from datasets import load_dataset
# config
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "japanese", "task": "transcribe"}
# load model
pipe = pipeline(
model=model_id,
torch_dtype=torch_dtype,
device=device,
model_kwargs=model_kwargs,
chunk_length_s=15,
batch_size=16,
trust_remote_code=True
)
# load sample audio
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
# --- Without prompt ---
text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
print(text)
# 81ζ³γεεΌ·γθ΅°γγ«ε€γγ£γ¦γγΎγγ
# --- With prompt ---: Let's change `81` to `91`.
prompt = "91ζ³"
generate_kwargs['prompt_ids'] = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt").to(device)
text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
# currently the pipeline for ASR appends the prompt at the beginning of the transcription, so remove it
text = re.sub(rf"\A\s*{prompt}\s*", "", text)
print(text)
# γγ£γΆγ£γγ§γγΉγ«γ¬γγγ91ζ³γεεΌ·γθ΅°γγ«ε€γγ£γ¦γγΎγγ
```
### Flash Attention 2
We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2)
if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
```
pip install flash-attn --no-build-isolation
```
Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
```diff
- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}
```
## Acknowledgements
* [OpenAI](https://openai.com/) for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).
* Hugging Face π€ [Transformers](https://github.com/huggingface/transformers) for the model integration.
* Hugging Face π€ for the [Distil-Whisper codebase](https://github.com/huggingface/distil-whisper).
* [Reazon Human Interaction Lab](https://research.reazon.jp/) for the [ReazonSpeech dataset](https://huggingface.co/datasets/reazon-research/reazonspeech). |