TKU410410103 commited on
Commit
de7c257
1 Parent(s): 3636af5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +148 -0
README.md CHANGED
@@ -1,3 +1,151 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ tags:
4
+ - generated_from_trainer
5
+ metrics:
6
+ - wer
7
+ - cer
8
+ model-index:
9
+ - name: wav2vec2-base-japanese-asr
10
+ results:
11
+ - task:
12
+ name: Speech Recognition
13
+ type: automatic-speech-recognition
14
+ dataset:
15
+ name: common_voice_11_0
16
+ type: common_voice
17
+ args: ja
18
+ metrics:
19
+ - name: Test WER
20
+ type: wer
21
+ value:
22
+ - name: Test CER
23
+ type: cer
24
+ value:
25
+ datasets:
26
+ - mozilla-foundation/common_voice_11_0
27
+ language:
28
+ - ja
29
  ---
30
+
31
+ # wav2vec2-base-asr
32
+
33
+ This model is a fine-tuned version of [rinna/japanese-wav2vec2-base](https://huggingface.co/rinna/japanese-wav2vec2-base) on the [common_voice_11_0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ja) for ASR tasks.
34
+
35
+ ## Acknowledgments
36
+
37
+ This model's fine-tuning approach was inspired by and references the training methodology used in [vumichien/wav2vec2-large-xlsr-japanese-hiragana](https://huggingface.co/vumichien/wav2vec2-large-xlsr-japanese-hiragana).
38
+
39
+ ## Training Procedure
40
+
41
+ Fine-tuning on the common_voice_11_0 dataset led to the following results:
42
+
43
+
44
+ ### Training hyperparameters
45
+
46
+ The training hyperparameters remained consistent throughout the fine-tuning process:
47
+
48
+ - learning_rate: 1e-4
49
+ - train_batch_size: 16
50
+ - eval_batch_size: 16
51
+ - seed: 42
52
+ - gradient_accumulation_steps: 2
53
+ - num_train_epochs: 30
54
+ - lr_scheduler_type: linear
55
+
56
+ ### How to evaluate the model
57
+
58
+ ```python
59
+ from transformers import HubertForCTC, Wav2Vec2Processor
60
+ from datasets import load_dataset
61
+ import torchaudio
62
+ import librosa
63
+ import numpy as np
64
+ import re
65
+ import MeCab
66
+ import pykakasi
67
+ from evaluate import load
68
+
69
+ model = Wav2vec2ForCTC.from_pretrained('TKU410410103/wav2vec2-base-japanese-asr')
70
+ processor = Wav2Vec2Processor.from_pretrained("TKU410410103/wav2vec2-base-japanese-asr")
71
+
72
+ # load dataset
73
+ test_dataset = load_dataset('mozilla-foundation/common_voice_11_0', 'ja', split='test')
74
+ remove_columns = [col for col in test_dataset.column_names if col not in ['audio', 'sentence']]
75
+ test_dataset = test_dataset.remove_columns(remove_columns)
76
+
77
+ # resample
78
+ def process_waveforms(batch):
79
+ speech_arrays = []
80
+ sampling_rates = []
81
+
82
+ for audio_path in batch['audio']:
83
+ speech_array, _ = torchaudio.load(audio_path['path'])
84
+ speech_array_resampled = librosa.resample(np.asarray(speech_array[0].numpy()), orig_sr=48000, target_sr=16000)
85
+ speech_arrays.append(speech_array_resampled)
86
+ sampling_rates.append(16000)
87
+
88
+ batch["array"] = speech_arrays
89
+ batch["sampling_rate"] = sampling_rates
90
+
91
+ return batch
92
+
93
+ # hiragana
94
+ CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", ";", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
95
+ "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
96
+ "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
97
+ "、", "﹂", "﹁", "‧", "~", "﹏", ",", "{", "}", "(", ")", "[", "]", "【", "】", "‥", "〽",
98
+ "『", "』", "〝", "〟", "⟨", "⟩", "〜", ":", "!", "?", "♪", "؛", "/", "\\", "º", "−", "^", "'", "ʻ", "ˆ"]
99
+ chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"
100
+
101
+ wakati = MeCab.Tagger("-Owakati")
102
+ kakasi = pykakasi.kakasi()
103
+ kakasi.setMode("J","H")
104
+ kakasi.setMode("K","H")
105
+ kakasi.setMode("r","Hepburn")
106
+ conv = kakasi.getConverter()
107
+
108
+ def prepare_char(batch):
109
+ batch["sentence"] = conv.do(wakati.parse(batch["sentence"]).strip())
110
+ batch["sentence"] = re.sub(chars_to_ignore_regex,'', batch["sentence"]).strip()
111
+ return batch
112
+
113
+
114
+ resampled_eval_dataset = test_dataset.map(process_waveforms, batched=True, batch_size=50, num_proc=4)
115
+ eval_dataset = resampled_eval_dataset.map(prepare_char, num_proc=4)
116
+
117
+ # begin the evaluation process
118
+ wer = load("wer")
119
+ cer = load("cer")
120
+
121
+ def evaluate(batch):
122
+ inputs = processor(batch["array"], sampling_rate=16_000, return_tensors="pt", padding=True)
123
+ with torch.no_grad():
124
+ logits = model(inputs.input_values.to(device), attention_mask=inputs.attention_mask.to(device)).logits
125
+ pred_ids = torch.argmax(logits, dim=-1)
126
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
127
+ return batch
128
+
129
+ columns_to_remove = [column for column in eval_dataset.column_names if column != "sentence"]
130
+ batch_size = 16
131
+ result = eval_dataset.map(evaluate, remove_columns=columns_to_remove, batched=True, batch_size=batch_size)
132
+
133
+ wer_result = wer.compute(predictions=result["pred_strings"], references=result["sentence"])
134
+ cer_result = cer.compute(predictions=result["pred_strings"], references=result["sentence"])
135
+
136
+ print("WER: {:2f}%".format(100 * wer_result))
137
+ print("CER: {:2f}%".format(100 * cer_result))
138
+ ```
139
+
140
+ ### Test results
141
+ The final model was evaluated as follows:
142
+
143
+ On common_voice_11_0:
144
+ - WER:
145
+ - CER:
146
+
147
+ ### Framework versions
148
+
149
+ - Transformers 4.39.1
150
+ - Pytorch 2.2.1+cu118
151
+ - Datasets 2.17.1