Update README.md

a8326c9 verified 2 months ago

No virus

5.91 kB

	---
	language:
	- ja
	license: apache-2.0
	tags:
	- generated_from_trainer
	datasets:
	- mozilla-foundation/common_voice_11_0
	metrics:
	- wer
	- cer
	model-index:
	- name: wav2vec2-base-japanese-asr
	results:
	- task:
	type: automatic-speech-recognition
	name: Speech Recognition
	dataset:
	name: common_voice_11_0
	type: common_voice
	args: ja
	metrics:
	- type: wer
	value: 14.177284
	name: Test WER
	- type: cer
	value: 6.462501
	name: Test CER
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Reazonspeech
	type: custom
	args: ja
	metrics:
	- name: Test WER
	type: wer
	value: 40.864413
	- name: Test CER
	type: cer
	value: 29.367348
	---

	# wav2vec2-base-asr

	This model is a fine-tuned version of [rinna/japanese-wav2vec2-base](https://huggingface.co/rinna/japanese-wav2vec2-base) on the [common_voice_11_0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ja) for ASR tasks.

	This model can only predict Hiragana.

	## Acknowledgments

	This model's fine-tuning approach was inspired by and references the training methodology used in [vumichien/wav2vec2-large-xlsr-japanese-hiragana](https://huggingface.co/vumichien/wav2vec2-large-xlsr-japanese-hiragana).

	## Training Procedure

	Fine-tuning on the common_voice_11_0 dataset led to the following results:

	\| Step \| Training Loss \| Validation Loss \| WER \|
	\|-------\|---------------\|-----------------\|----------\|
	\| 1000 \| 6.088100 \| 3.452597 \| 1.000000 \|
	\| 2000 \| 2.816600 \| 0.756278 \| 0.263624 \|
	\| 3000 \| 0.837600 \| 0.471486 \| 0.185915 \|
	\| 4000 \| 0.624900 \| 0.420854 \| 0.159801 \|
	\| 5000 \| 0.533300 \| 0.392494 \| 0.149141 \|
	\| 6000 \| 0.490000 \| 0.394669 \| 0.144826 \|
	\| 7000 \| 0.441600 \| 0.379999 \| 0.141807 \|

	### Training hyperparameters

	The training hyperparameters remained consistent throughout the fine-tuning process:

	- learning_rate: 1e-4
	- train_batch_size: 16
	- eval_batch_size: 16
	- seed: 42
	- gradient_accumulation_steps: 2
	- num_train_epochs: 20
	- warmup_steps: 2000
	- lr_scheduler_type: linear

	### How to evaluate the model

	```python
	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
	from datasets import load_dataset
	import torch
	import torchaudio
	import librosa
	import numpy as np
	import re
	import MeCab
	import pykakasi
	from evaluate import load

	model = Wav2Vec2ForCTC.from_pretrained('TKU410410103/wav2vec2-base-japanese-asr')
	processor = Wav2Vec2Processor.from_pretrained("TKU410410103/wav2vec2-base-japanese-asr")

	# load dataset
	test_dataset = load_dataset('mozilla-foundation/common_voice_11_0', 'ja', split='test')
	remove_columns = [col for col in test_dataset.column_names if col not in ['audio', 'sentence']]
	test_dataset = test_dataset.remove_columns(remove_columns)

	# resample
	def process_waveforms(batch):
	speech_arrays = []
	sampling_rates = []

	for audio_path in batch['audio']:
	speech_array, _ = torchaudio.load(audio_path['path'])
	speech_array_resampled = librosa.resample(np.asarray(speech_array[0].numpy()), orig_sr=48000, target_sr=16000)
	speech_arrays.append(speech_array_resampled)
	sampling_rates.append(16000)

	batch["array"] = speech_arrays
	batch["sampling_rate"] = sampling_rates

	return batch

	# hiragana
	CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
	"؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
	"{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
	"、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
	"『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "'", "ʻ", "ˆ"]
	chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

	wakati = MeCab.Tagger("-Owakati")
	kakasi = pykakasi.kakasi()
	kakasi.setMode("J","H")
	kakasi.setMode("K","H")
	kakasi.setMode("r","Hepburn")
	conv = kakasi.getConverter()

	def prepare_char(batch):
	batch["sentence"] = conv.do(wakati.parse(batch["sentence"]).strip())
	batch["sentence"] = re.sub(chars_to_ignore_regex,'', batch["sentence"]).strip()
	return batch


	resampled_eval_dataset = test_dataset.map(process_waveforms, batched=True, batch_size=50, num_proc=4)
	eval_dataset = resampled_eval_dataset.map(prepare_char, num_proc=4)

	# begin the evaluation process
	wer = load("wer")
	cer = load("cer")

	def evaluate(batch):
	inputs = processor(batch["array"], sampling_rate=16_000, return_tensors="pt", padding=True)
	with torch.no_grad():
	logits = model(inputs.input_values.to(device), attention_mask=inputs.attention_mask.to(device)).logits
	pred_ids = torch.argmax(logits, dim=-1)
	batch["pred_strings"] = processor.batch_decode(pred_ids)
	return batch

	columns_to_remove = [column for column in eval_dataset.column_names if column != "sentence"]
	batch_size = 16
	result = eval_dataset.map(evaluate, remove_columns=columns_to_remove, batched=True, batch_size=batch_size)

	wer_result = wer.compute(predictions=result["pred_strings"], references=result["sentence"])
	cer_result = cer.compute(predictions=result["pred_strings"], references=result["sentence"])

	print("WER: {:2f}%".format(100 * wer_result))
	print("CER: {:2f}%".format(100 * cer_result))
	```

	### Test results
	The final model was evaluated as follows:

	On common_voice_11_0:
	- WER: 14.177284%
	- CER: 6.462501%

	On reazonspeech(tiny):
	- WER: 40.864413%
	- CER: 29.367348%
	### Framework versions

	- Transformers 4.39.1
	- Pytorch 2.2.1+cu118
	- Datasets 2.17.1