File size: 10,100 Bytes

---
language: ja
license: apache-2.0
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
metrics:
- wer
widget:
- example_title: CommonVoice 8.0 (Test Split)
  src: https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0/resolve/main/sample.flac
- example_title: JSUT Basic 5000
  src: https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000/resolve/main/sample.flac
- example_title: ReazonSpeech (Test Split)
  src: https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test/resolve/main/sample.flac
pipeline_tag: automatic-speech-recognition
model-index:
- name: kotoba-tech/kotoba-whisper-v1.1
  results:
  - task:
      type: automatic-speech-recognition
    dataset:
      name: CommonVoice_8.0 (Japanese)
      type: japanese-asr/ja_asr.common_voice_8_0
    metrics:
    - type: WER
      value: 59.27
      name: WER
    - type: CER
      value: 9.44
      name: CER
  - task:
      type: automatic-speech-recognition
    dataset:
      name: ReazonSpeech (Test)
      type: japanese-asr/ja_asr.reazonspeech_test
    metrics:
    - type: WER
      value: 56.62
      name: WER
    - type: CER
      value: 12.6
      name: CER
  - task:
      type: automatic-speech-recognition
    dataset:
      name: JSUT Basic5000
      type: japanese-asr/ja_asr.jsut_basic5000
    metrics:
    - type: WER
      value: 64.36
      name: WER
    - type: CER
      value: 8.48
      name: CER
---

# Kotoba-Whisper-v1.1
_Kotoba-Whisper-v1.1_ is a Japanese ASR model based on [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0), with 
additional postprocessing stacks integrated as [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines). The new features includes 
(i) improved timestamp achieved by [stable-ts](https://github.com/jianfch/stable-ts) and (ii) adding punctuation with [punctuators](https://github.com/1-800-BAD-CODE/punctuators/tree/main). 
These libraries are merged into Kotoba-Whisper-v1.1 via pipeline and will be applied seamlessly to the predicted transcription from [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0).
The pipeline has been developed through the collaboration between [Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://twitter.com/kotoba_tech)


Following table presents the raw CER (unlike usual CER where the punctuations are removed before computing the metrics, see the evaluation script [here](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.1/blob/main/run_short_form_eval.py))
along with the.


| model                                                    |   CommonVoice 8.0 (Japanese) |   JSUT Basic 5000 |  ReazonSpeech Test |
|:---------------------------------------------------------|---------------------------------------:|-------------------------------------:|----------------------------------------:|
| kotoba-tech/kotoba-whisper-v1.0                          |                                   15.6 |                                 15.2 |                                    17.8 |
| kotoba-tech/kotoba-whisper-v1.1 (punctuator + stable-ts) |                                   13.7 |                                 ***11.2*** |                                    ***17.4*** |
| kotoba-tech/kotoba-whisper-v1.1 (punctuator)             |                                   13.9 |                                 11.4 |                                    18   |
| kotoba-tech/kotoba-whisper-v1.1 (stable-ts)              |                                   15.7 |                                 15   |                                    17.7 |
| openai/whisper-large-v3                                  |                                   ***12.9*** |                                 13.4 |                                    20.6 |

Regarding to the normalized CER, since those update from v1.1 will be removed by the normalization, kotoba-tech/kotoba-whisper-v1.1 marks the same CER values as [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0).

### Latency
Kotoba-whisper-v1.1 improves the punctuation and the timestamp of the output from Kotoba-whisper-v1.0. However, since we apply the punctuator and stable-ts to each chunk,
we need to obtain the timestamps, which decreases the latency of the original kotoba-whisper-v1.0. See the following table comparing the inference speed on 
transcribing **50min** Japanese speech audio, where we report the average over five independent runs.

| model                                                    | return_timestamps   |   time (mean) |
|:---------------------------------------------------------|:--------------------|--------------:|
| kotoba-tech/kotoba-whisper-v1.0                          | False               |          10.8 |
| kotoba-tech/kotoba-whisper-v1.0                          | True                |          15.7 |
| kotoba-tech/kotoba-whisper-v1.1 (punctuator + stable-ts) | True                |          17.9 |
| kotoba-tech/kotoba-whisper-v1.1 (punctuator)             | True                |          17.7 |
| kotoba-tech/kotoba-whisper-v1.1 (stable-ts)              | True                |          16.1 |
| openai/whisper-large-v3                                  | False               |          29.1 |
| openai/whisper-large-v3                                  | True                |          37.9 |


See the full table [here](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.1/raw/main/latency.csv).

## Transformers Usage
Kotoba-Whisper-v1.1 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first 
install the latest version of Transformers.

```bash
pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install stable-ts==2.16.0
pip install punctuators==0.0.5
```

### Transcription
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
class to transcribe audio files as follows:

```python
import torch
from transformers import pipeline
from datasets import load_dataset

# config
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "japanese", "task": "transcribe"}

# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    chunk_length_s=15,
    batch_size=16,
    trust_remote_code=True,
    stable_ts=True,
    punctuator=True
)

# load sample audio
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
sample = dataset[0]["audio"]

# run inference
result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
print(result)
```

- To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
```diff
- result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
+ result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs)
```

- To deactivate stable-ts:
```diff
-     stable_ts=True,
+     stable_ts=False,
```

- To deactivate punctuator:
```diff
-     punctuator=True,
+     punctuator=False,
```

### Transcription with Prompt
Kotoba-whisper can generate transcription with prompting as below:

```python
import re
import torch
from transformers import pipeline
from datasets import load_dataset

# config
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "japanese", "task": "transcribe"}

# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    chunk_length_s=15,
    batch_size=16,
    trust_remote_code=True
)

# load sample audio
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")

# --- Without prompt ---
text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
print(text)
# 81歳、力強い走りに変わってきます。

# --- With prompt ---: Let's change `81` to `91`.
prompt = "91歳"
generate_kwargs['prompt_ids'] = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt").to(device)
text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
# currently the pipeline for ASR appends the prompt at the beginning of the transcription, so remove it
text = re.sub(rf"\A\s*{prompt}\s*", "", text)
print(text)
# あっぶったでもスルガさん、91歳、力強い走りに変わってきます。
```

### Flash Attention 2
We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) 
if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):

```
pip install flash-attn --no-build-isolation
```

Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:

```diff
- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}
```


## Acknowledgements
* [OpenAI](https://openai.com/) for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).
* Hugging Face 🤗 [Transformers](https://github.com/huggingface/transformers) for the model integration.
* Hugging Face 🤗 for the [Distil-Whisper codebase](https://github.com/huggingface/distil-whisper).
* [Reazon Human Interaction Lab](https://research.reazon.jp/) for the [ReazonSpeech dataset](https://huggingface.co/datasets/reazon-research/reazonspeech).