kotoba-whisper-v1.1 / README.md

asahi417

Update README.md

98b3fc0 verified 3 months ago

preview code

raw

history blame contribute delete

No virus

10.1 kB

	---
	language: ja
	license: apache-2.0
	tags:
	- audio
	- automatic-speech-recognition
	- hf-asr-leaderboard
	metrics:
	- wer
	widget:
	- example_title: CommonVoice 8.0 (Test Split)
	src: https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0/resolve/main/sample.flac
	- example_title: JSUT Basic 5000
	src: https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000/resolve/main/sample.flac
	- example_title: ReazonSpeech (Test Split)
	src: https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test/resolve/main/sample.flac
	pipeline_tag: automatic-speech-recognition
	model-index:
	- name: kotoba-tech/kotoba-whisper-v1.1
	results:
	- task:
	type: automatic-speech-recognition
	dataset:
	name: CommonVoice_8.0 (Japanese)
	type: japanese-asr/ja_asr.common_voice_8_0
	metrics:
	- type: WER
	value: 59.27
	name: WER
	- type: CER
	value: 9.44
	name: CER
	- task:
	type: automatic-speech-recognition
	dataset:
	name: ReazonSpeech (Test)
	type: japanese-asr/ja_asr.reazonspeech_test
	metrics:
	- type: WER
	value: 56.62
	name: WER
	- type: CER
	value: 12.6
	name: CER
	- task:
	type: automatic-speech-recognition
	dataset:
	name: JSUT Basic5000
	type: japanese-asr/ja_asr.jsut_basic5000
	metrics:
	- type: WER
	value: 64.36
	name: WER
	- type: CER
	value: 8.48
	name: CER
	---

	# Kotoba-Whisper-v1.1
	_Kotoba-Whisper-v1.1_ is a Japanese ASR model based on [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0), with
	additional postprocessing stacks integrated as [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines). The new features includes
	(i) improved timestamp achieved by [stable-ts](https://github.com/jianfch/stable-ts) and (ii) adding punctuation with [punctuators](https://github.com/1-800-BAD-CODE/punctuators/tree/main).
	These libraries are merged into Kotoba-Whisper-v1.1 via pipeline and will be applied seamlessly to the predicted transcription from [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0).
	The pipeline has been developed through the collaboration between [Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://twitter.com/kotoba_tech)


	Following table presents the raw CER (unlike usual CER where the punctuations are removed before computing the metrics, see the evaluation script [here](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.1/blob/main/run_short_form_eval.py))
	along with the.


	\| model \| CommonVoice 8.0 (Japanese) \| JSUT Basic 5000 \| ReazonSpeech Test \|
	\|:---------------------------------------------------------\|---------------------------------------:\|-------------------------------------:\|----------------------------------------:\|
	\| kotoba-tech/kotoba-whisper-v1.0 \| 15.6 \| 15.2 \| 17.8 \|
	\| kotoba-tech/kotoba-whisper-v1.1 (punctuator + stable-ts) \| 13.7 \| *11.2* \| *17.4* \|
	\| kotoba-tech/kotoba-whisper-v1.1 (punctuator) \| 13.9 \| 11.4 \| 18 \|
	\| kotoba-tech/kotoba-whisper-v1.1 (stable-ts) \| 15.7 \| 15 \| 17.7 \|
	\| openai/whisper-large-v3 \| *12.9* \| 13.4 \| 20.6 \|

	Regarding to the normalized CER, since those update from v1.1 will be removed by the normalization, kotoba-tech/kotoba-whisper-v1.1 marks the same CER values as [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0).

	### Latency
	Kotoba-whisper-v1.1 improves the punctuation and the timestamp of the output from Kotoba-whisper-v1.0. However, since we apply the punctuator and stable-ts to each chunk,
	we need to obtain the timestamps, which decreases the latency of the original kotoba-whisper-v1.0. See the following table comparing the inference speed on
	transcribing 50min Japanese speech audio, where we report the average over five independent runs.

	\| model \| return_timestamps \| time (mean) \|
	\|:---------------------------------------------------------\|:--------------------\|--------------:\|
	\| kotoba-tech/kotoba-whisper-v1.0 \| False \| 10.8 \|
	\| kotoba-tech/kotoba-whisper-v1.0 \| True \| 15.7 \|
	\| kotoba-tech/kotoba-whisper-v1.1 (punctuator + stable-ts) \| True \| 17.9 \|
	\| kotoba-tech/kotoba-whisper-v1.1 (punctuator) \| True \| 17.7 \|
	\| kotoba-tech/kotoba-whisper-v1.1 (stable-ts) \| True \| 16.1 \|
	\| openai/whisper-large-v3 \| False \| 29.1 \|
	\| openai/whisper-large-v3 \| True \| 37.9 \|


	See the full table [here](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.1/raw/main/latency.csv).

	## Transformers Usage
	Kotoba-Whisper-v1.1 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first
	install the latest version of Transformers.

	```bash
	pip install --upgrade pip
	pip install --upgrade transformers accelerate torchaudio
	pip install stable-ts==2.16.0
	pip install punctuators==0.0.5
	```

	### Transcription
	The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
	class to transcribe audio files as follows:

	```python
	import torch
	from transformers import pipeline
	from datasets import load_dataset

	# config
	model_id = "kotoba-tech/kotoba-whisper-v1.1"
	torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
	generate_kwargs = {"language": "japanese", "task": "transcribe"}

	# load model
	pipe = pipeline(
	model=model_id,
	torch_dtype=torch_dtype,
	device=device,
	model_kwargs=model_kwargs,
	chunk_length_s=15,
	batch_size=16,
	trust_remote_code=True,
	stable_ts=True,
	punctuator=True
	)

	# load sample audio
	dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
	sample = dataset[0]["audio"]

	# run inference
	result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
	print(result)
	```

	- To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
	```diff
	- result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
	+ result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs)
	```

	- To deactivate stable-ts:
	```diff
	- stable_ts=True,
	+ stable_ts=False,
	```

	- To deactivate punctuator:
	```diff
	- punctuator=True,
	+ punctuator=False,
	```

	### Transcription with Prompt
	Kotoba-whisper can generate transcription with prompting as below:

	```python
	import re
	import torch
	from transformers import pipeline
	from datasets import load_dataset

	# config
	model_id = "kotoba-tech/kotoba-whisper-v1.1"
	torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
	generate_kwargs = {"language": "japanese", "task": "transcribe"}

	# load model
	pipe = pipeline(
	model=model_id,
	torch_dtype=torch_dtype,
	device=device,
	model_kwargs=model_kwargs,
	chunk_length_s=15,
	batch_size=16,
	trust_remote_code=True
	)

	# load sample audio
	dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")

	# --- Without prompt ---
	text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
	print(text)
	# 81歳、力強い走りに変わってきます。

	# --- With prompt ---: Let's change `81` to `91`.
	prompt = "91歳"
	generate_kwargs['prompt_ids'] = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt").to(device)
	text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
	# currently the pipeline for ASR appends the prompt at the beginning of the transcription, so remove it
	text = re.sub(rf"\A\s{prompt}\s", "", text)
	print(text)
	# あっぶったでもスルガさん、91歳、力強い走りに変わってきます。
	```

	### Flash Attention 2
	We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2)
	if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):

	```
	pip install flash-attn --no-build-isolation
	```

	Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:

	```diff
	- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
	+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}
	```


	## Acknowledgements
	* [OpenAI](https://openai.com/) for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).
	* Hugging Face 🤗 [Transformers](https://github.com/huggingface/transformers) for the model integration.
	* Hugging Face 🤗 for the [Distil-Whisper codebase](https://github.com/huggingface/distil-whisper).
	* [Reazon Human Interaction Lab](https://research.reazon.jp/) for the [ReazonSpeech dataset](https://huggingface.co/datasets/reazon-research/reazonspeech).