kotoba-whisper-v1.1 / README.md
asahi417's picture
Update README.md
ce2d123 verified
|
raw
history blame
No virus
11.8 kB
metadata
language: ja
license: apache-2.0
tags:
  - audio
  - automatic-speech-recognition
  - hf-asr-leaderboard
metrics:
  - wer
widget:
  - example_title: CommonVoice 8.0 (Test Split)
    src: >-
      https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0/resolve/main/sample.flac
  - example_title: JSUT Basic 5000
    src: >-
      https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000/resolve/main/sample.flac
  - example_title: ReazonSpeech (Test Split)
    src: >-
      https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test/resolve/main/sample.flac
pipeline_tag: automatic-speech-recognition
model-index:
  - name: kotoba-tech/kotoba-whisper-v1.1
    results:
      - task:
          type: automatic-speech-recognition
        dataset:
          name: CommonVoice_8.0 (Japanese)
          type: japanese-asr/ja_asr.common_voice_8_0
        metrics:
          - type: WER
            value: 59.27
            name: WER
          - type: CER
            value: 9.44
            name: CER
      - task:
          type: automatic-speech-recognition
        dataset:
          name: ReazonSpeech (Test)
          type: japanese-asr/ja_asr.reazonspeech_test
        metrics:
          - type: WER
            value: 56.62
            name: WER
          - type: CER
            value: 12.6
            name: CER
      - task:
          type: automatic-speech-recognition
        dataset:
          name: JSUT Basic5000
          type: japanese-asr/ja_asr.jsut_basic5000
        metrics:
          - type: WER
            value: 64.36
            name: WER
          - type: CER
            value: 8.48
            name: CER

Kotoba-Whisper-v1.1

Kotoba-Whisper-v1.1 is a Japanese ASR model based on kotoba-tech/kotoba-whisper-v1.0, with additional postprocessing stacks integrated as pipeline. The new features includes (i) improved timestamp achieved by stable-ts and (ii) adding punctuation with punctuators. These libraries are merged into Kotoba-Whisper-v1.1 via pipeline and will be applied seamlessly to the predicted transcription from kotoba-tech/kotoba-whisper-v1.0. The pipeline has been developed through the collaboration between Asahi Ushio and Kotoba Technologies

Following table presents the raw CER (unlike usual CER where the punctuations are removed before computing the metrics, see the evaluation script here) along with the.

model CommonVoice 8.0 (Japanese) JSUT Basic 5000 ReazonSpeech Test
kotoba-tech/kotoba-whisper-v1.0 15.6 15.2 17.8
kotoba-tech/kotoba-whisper-v1.1 (punctuator + stable-ts) 13.7 11.2 17.4
kotoba-tech/kotoba-whisper-v1.1 (punctuator) 13.9 11.4 18
kotoba-tech/kotoba-whisper-v1.1 (stable-ts) 15.7 15 17.7
openai/whisper-large-v3 12.9 13.4 20.6

Regarding to the normalized CER, since those update from v1.1 will be removed by the normalization, kotoba-tech/kotoba-whisper-v1.1 marks the same CER values as kotoba-tech/kotoba-whisper-v1.0.

Latency

Kotoba-whisper-v1.1 improves the punctuation and the timestamp of the output from Kotoba-whisper-v1.0. However, since we apply the punctuator and stable-ts to each chunk, we need to obtain the timestamps, which decreases the latency of the original kotoba-whisper-v1.0. See the following table comparing the inference speed on transcribing 50min Japanese speech audio. In addition to the timestamp, we compare different attention implementations, models (kotoba-whispers and whisper-large-v3), and activate/deactivate punctuators and stable_ts for kotoba-whisper-v1.1.

model return_timestamps stable_ts punctuator attention time (mean)
kotoba-tech/kotoba-whisper-v1.0 False flash_attention_2 10.7136
kotoba-tech/kotoba-whisper-v1.0 False sdpa 10.7695
kotoba-tech/kotoba-whisper-v1.0 False 10.7792
kotoba-tech/kotoba-whisper-v1.0 True flash_attention_2 15.5307
kotoba-tech/kotoba-whisper-v1.0 True sdpa 15.8254
kotoba-tech/kotoba-whisper-v1.0 True 15.7362
kotoba-tech/kotoba-whisper-v1.1 True False True flash_attention_2 17.6345
kotoba-tech/kotoba-whisper-v1.1 True False True sdpa 18.0241
kotoba-tech/kotoba-whisper-v1.1 True False True 17.7098
kotoba-tech/kotoba-whisper-v1.1 True True False flash_attention_2 16.0146
kotoba-tech/kotoba-whisper-v1.1 True True False sdpa 16.4895
kotoba-tech/kotoba-whisper-v1.1 True True False 16.1083
kotoba-tech/kotoba-whisper-v1.1 True True True flash_attention_2 17.6783
kotoba-tech/kotoba-whisper-v1.1 True True True sdpa 18.2042
kotoba-tech/kotoba-whisper-v1.1 True True True 17.9164
openai/whisper-large-v3 False flash_attention_2 28.436
openai/whisper-large-v3 False sdpa 28.9149
openai/whisper-large-v3 False 29.1029
openai/whisper-large-v3 True 37.871

Transformers Usage

Kotoba-Whisper-v1.1 is supported in the Hugging Face πŸ€— Transformers library from version 4.39 onwards. To run the model, first install the latest version of Transformers.

pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install stable-ts==2.16.0
pip install punctuators==0.0.5

Transcription

The model can be used with the pipeline class to transcribe audio files as follows:

import torch
from transformers import pipeline
from datasets import load_dataset

# config
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "japanese", "task": "transcribe"}

# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    chunk_length_s=15,
    batch_size=16,
    trust_remote_code=True,
    stable_ts=True,
    punctuator=True
)

# load sample audio
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
sample = dataset[0]["audio"]

# run inference
result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
print(result)
  • To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
- result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
+ result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs)
  • To deactivate stable-ts:
-     stable_ts=True,
+     stable_ts=False,
  • To deactivate punctuator:
-     punctuator=True,
+     punctuator=False,

Transcription with Prompt

Kotoba-whisper can generate transcription with prompting as below:

import re
import torch
from transformers import pipeline
from datasets import load_dataset

# config
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "japanese", "task": "transcribe"}

# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    chunk_length_s=15,
    batch_size=16,
    trust_remote_code=True
)

# load sample audio
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")

# --- Without prompt ---
text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
print(text)
# 81ζ­³γ€εŠ›εΌ·γ„θ΅°γ‚Šγ«ε€‰γ‚γ£γ¦γγΎγ™γ€‚

# --- With prompt ---: Let's change `81` to `91`.
prompt = "91ζ­³"
generate_kwargs['prompt_ids'] = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt").to(device)
text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
# currently the pipeline for ASR appends the prompt at the beginning of the transcription, so remove it
text = re.sub(rf"\A\s*{prompt}\s*", "", text)
print(text)
# γ‚γ£γΆγ£γŸγ§γ‚‚γ‚Ήγƒ«γ‚¬γ•γ‚“γ€91ζ­³γ€εŠ›εΌ·γ„θ΅°γ‚Šγ«ε€‰γ‚γ£γ¦γγΎγ™γ€‚

Flash Attention 2

We recommend using Flash-Attention 2 if your GPU allows for it. To do so, you first need to install Flash Attention:

pip install flash-attn --no-build-isolation

Then pass attn_implementation="flash_attention_2" to from_pretrained:

- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}

Acknowledgements