slplab's picture
Update README.md
d3a14c4
metadata
language: ja
datasets:
  - common_voice
metrics:
  - wer
  - cer
model-index:
  - name: >-
      wav2vec2-xls-r-300m finetuned on Japanese Hiragana with no word boundaries
      by Hyungshin Ryu of SLPlab
    results:
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice Japanese
          type: common_voice
          args: ja
        metrics:
          - name: Test WER
            type: wer
            value: 90.66
          - name: Test CER
            type: cer
            value: 19.35

Wav2Vec2-XLS-R-300M-Japanese-Hiragana

Fine-tuned facebook/wav2vec2-xls-r-300m on Japanese Hiragana characters using the Common Voice and JSUT. The sentence outputs do not contain word boundaries. Audio inputs should be sampled at 16kHz.

Usage

The model can be used directly as follows:

!pip install mecab-python3
!pip install unidic-lite
!pip install pykakasi


import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import load_dataset, load_metric
import pykakasi
import MeCab
import re


# load datasets, processor, and model
test_dataset = load_dataset("common_voice", "ja", split="test")
wer = load_metric("wer")
cer = load_metric("cer")
PTM = "slplab/wav2vec2-xls-r-300m-japanese-hiragana"
print("PTM:", PTM)
processor = Wav2Vec2Processor.from_pretrained(PTM)
model = Wav2Vec2ForCTC.from_pretrained(PTM)
device = "cuda"
model.to(device)


# preprocess datasets
wakati = MeCab.Tagger("-Owakati")
kakasi = pykakasi.kakasi()
chars_to_ignore_regex = "[、,。]"

def speech_file_to_array_fn_hiragana_nospace(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).strip()
    batch["sentence"] = ''.join([d['hira'] for d in kakasi.convert(batch["sentence"])])
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    resampler = torchaudio.transforms.Resample(sampling_rate, 16000)
    batch["speech"] = resampler(speech_array).squeeze()

    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn_hiragana_nospace)


#evaluate
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to(device)).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
for i in range(10):
    print("="*20)
    print("Prd:", result[i]["pred_strings"])
    print("Ref:", result[i]["sentence"])
    
print("WER: {:2f}%".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
print("CER: {:2f}%".format(100 * cer.compute(predictions=result["pred_strings"], references=result["sentence"])))
Original Text Prediction
この料理は家庭で作れます。 このりょうりはかていでつくれます
日本人は、決して、ユーモアと無縁な人種ではなかった。 にっぽんじんはけしてゆうもあどむえんなじんしゅではなかった
木村さんに電話を貸してもらいました。 きむらさんにでんわおかしてもらいました

Test Results

WER: 90.66%, CER: 19.35%

Training

Trained on JSUT and train+valid set of Common Voice Japanese. Tested on test set of Common Voice Japanese.