|
--- |
|
language: ja |
|
license: apache-2.0 |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
- hf-asr-leaderboard |
|
metrics: |
|
- wer |
|
widget: |
|
- example_title: CommonVoice 8.0 (Test Split) |
|
src: https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0/resolve/main/sample.flac |
|
- example_title: JSUT Basic 5000 |
|
src: https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000/resolve/main/sample.flac |
|
- example_title: ReazonSpeech (Test Split) |
|
src: https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test/resolve/main/sample.flac |
|
pipeline_tag: automatic-speech-recognition |
|
model-index: |
|
- name: kotoba-tech/kotoba-whisper-v1.1 |
|
results: |
|
- task: |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: CommonVoice_8.0 (Japanese) |
|
type: japanese-asr/ja_asr.common_voice_8_0 |
|
metrics: |
|
- type: WER |
|
value: 59.27 |
|
name: WER |
|
- type: CER |
|
value: 9.44 |
|
name: CER |
|
- task: |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: ReazonSpeech (Test) |
|
type: japanese-asr/ja_asr.reazonspeech_test |
|
metrics: |
|
- type: WER |
|
value: 56.62 |
|
name: WER |
|
- type: CER |
|
value: 12.6 |
|
name: CER |
|
- task: |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: JSUT Basic5000 |
|
type: japanese-asr/ja_asr.jsut_basic5000 |
|
metrics: |
|
- type: WER |
|
value: 64.36 |
|
name: WER |
|
- type: CER |
|
value: 8.48 |
|
name: CER |
|
--- |
|
|
|
# Kotoba-Whisper-v1.1 |
|
_Kotoba-Whisper-v1.1_ is a Japanese ASR model based on [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0), with |
|
additional postprocessing stacks integrated as [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines). The new features includes |
|
(i) improved timestamp achieved by [stable-ts](https://github.com/jianfch/stable-ts) and (ii) adding punctuation with [punctuators](https://github.com/1-800-BAD-CODE/punctuators/tree/main). |
|
These libraries are merged into Kotoba-Whisper-v1.1 via pipeline and will be applied seamlessly to the predicted transcription from [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0). |
|
The pipeline has been developed through the collaboration between [Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://twitter.com/kotoba_tech) |
|
|
|
|
|
Following table presents the raw CER (unlike usual CER where the punctuations are removed before computing the metrics, see the evaluation script [here](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.1/blob/main/run_short_form_eval.py)) |
|
along with the. |
|
|
|
|
|
| model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test | |
|
|:---------------------------------------------------------|---------------------------------------:|-------------------------------------:|----------------------------------------:| |
|
| kotoba-tech/kotoba-whisper-v1.0 | 17.8 | 15.2 | **17.8** | |
|
| kotoba-tech/kotoba-whisper-v1.1 (stable-ts) | 17.8 | 15.2 | **17.8** | |
|
| kotoba-tech/kotoba-whisper-v1.1 (punctuator) | 16.0 | **11.7** | 18.5 | |
|
| kotoba-tech/kotoba-whisper-v1.1 (punctuator + stable-ts) | 16.0 | **11.7** | 18.5 | |
|
| openai/whisper-large-v3 | **15.2** | 13.4 | 20.6 | |
|
|
|
|
|
## Transformers Usage |
|
Kotoba-Whisper-v1.1 is supported in the Hugging Face π€ Transformers library from version 4.39 onwards. To run the model, first |
|
install the latest version of Transformers. |
|
|
|
```bash |
|
pip install --upgrade pip |
|
pip install --upgrade transformers accelerate torchaudio |
|
pip install stable-ts==2.16.0 |
|
pip install punctuators==0.0.5 |
|
``` |
|
|
|
### Transcription |
|
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) |
|
class to transcribe audio files as follows: |
|
|
|
```python |
|
import torch |
|
from transformers import pipeline |
|
from datasets import load_dataset |
|
|
|
# config |
|
model_id = "kotoba-tech/kotoba-whisper-v1.1" |
|
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 |
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {} |
|
generate_kwargs = {"language": "japanese", "task": "transcribe"} |
|
|
|
# load model |
|
pipe = pipeline( |
|
model=model_id, |
|
torch_dtype=torch_dtype, |
|
device=device, |
|
model_kwargs=model_kwargs, |
|
chunk_length_s=15, |
|
batch_size=16, |
|
trust_remote_code=True, |
|
stable_ts=False, |
|
punctuator=True |
|
) |
|
|
|
# load sample audio |
|
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test") |
|
sample = dataset[0]["audio"] |
|
|
|
# run inference |
|
result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs) |
|
print(result) |
|
``` |
|
|
|
- To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline: |
|
```diff |
|
- result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs) |
|
+ result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs) |
|
``` |
|
|
|
- As default, stable-ts is deactivated. To activate stable-ts: |
|
```diff |
|
- stable_ts=False, |
|
+ stable_ts=True, |
|
``` |
|
|
|
- As default, punctuator is activated. To deactivate punctuator: |
|
```diff |
|
- punctuator=True, |
|
+ punctuator=False, |
|
``` |
|
|
|
### Transcription with Prompt |
|
Kotoba-whisper can generate transcription with prompting as below: |
|
|
|
```python |
|
import re |
|
import torch |
|
from transformers import pipeline |
|
from datasets import load_dataset |
|
|
|
# config |
|
model_id = "kotoba-tech/kotoba-whisper-v1.1" |
|
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 |
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {} |
|
generate_kwargs = {"language": "japanese", "task": "transcribe"} |
|
|
|
# load model |
|
pipe = pipeline( |
|
model=model_id, |
|
torch_dtype=torch_dtype, |
|
device=device, |
|
model_kwargs=model_kwargs, |
|
chunk_length_s=15, |
|
batch_size=16, |
|
trust_remote_code=True |
|
) |
|
|
|
# load sample audio |
|
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test") |
|
|
|
# --- Without prompt --- |
|
text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text'] |
|
print(text) |
|
# 81ζ³γεεΌ·γθ΅°γγ«ε€γγ£γ¦γγΎγγ |
|
|
|
# --- With prompt ---: Let's change `81` to `91`. |
|
prompt = "91ζ³" |
|
generate_kwargs['prompt_ids'] = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt").to(device) |
|
text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text'] |
|
# currently the pipeline for ASR appends the prompt at the beginning of the transcription, so remove it |
|
text = re.sub(rf"\A\s*{prompt}\s*", "", text) |
|
print(text) |
|
# γγ£γΆγ£γγ§γγΉγ«γ¬γγγ91ζ³γεεΌ·γθ΅°γγ«ε€γγ£γ¦γγΎγγ |
|
``` |
|
|
|
### Flash Attention 2 |
|
We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) |
|
if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention): |
|
|
|
``` |
|
pip install flash-attn --no-build-isolation |
|
``` |
|
|
|
Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`: |
|
|
|
```diff |
|
- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {} |
|
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {} |
|
``` |
|
|
|
|
|
## Acknowledgements |
|
* [OpenAI](https://openai.com/) for the Whisper [model](https://huggingface.co/openai/whisper-large-v3). |
|
* Hugging Face π€ [Transformers](https://github.com/huggingface/transformers) for the model integration. |
|
* Hugging Face π€ for the [Distil-Whisper codebase](https://github.com/huggingface/distil-whisper). |
|
* [Reazon Human Interaction Lab](https://research.reazon.jp/) for the [ReazonSpeech dataset](https://huggingface.co/datasets/reazon-research/reazonspeech). |