metadata

language: zh
datasets:
  - aishell1
metrics:
  - wer
tags:
  - audio
  - automatic-speech-recognition
  - speech
  - xlsr-fine-tuning-week
license: apache-2.0
model-index:
  - name: XLSR Wav2Vec2 Large 53 - Chinese (zh-CN), by Yue Qin
    results:
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: AISHELL-1 zh-CN
          type: aishell1
          args: zh-CN
        metrics:
          - name: Test WER
            type: wer
            value: 7.04

Wav2Vec2-Large-XLSR-53-Chinese-zh-CN-aishell1

Fine-tuned facebook/wav2vec2-large-xlsr-53 on Chinese using the AISHELL-1 dataset. When using this model, make sure that your speech input is sampled at 16kHz.

Usage

The model can be used directly (without a language model) as follows:

import torch
import librosa
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

device = "cuda:0" if torch.cuda.is_available() else "cpu"

processor = Wav2Vec2Processor.from_pretrained(
    'qinyue/wav2vec2-large-xlsr-53-chinese-zn-cn-aishell1')
model = Wav2Vec2ForCTC.from_pretrained(
    'qinyue/wav2vec2-large-xlsr-53-chinese-zn-cn-aishell1').to(device)

filepath = 'test.wav'
audio, sr = librosa.load(filepath, sr=16000, mono=True)
inputs = processor(audio, sample_rate=16000, return_tensors="pt").to(device)
with torch.no_grad():
    logits = model(inputs.input_values,
                   attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
pred_str = processor.decode(predicted_ids[0])

print(pred_str)

Evaluation

wer_metric = load_metric("wer")

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids, spaces_between_special_tokens=True)
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False, spaces_between_special_tokens=True)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

Results

Reference	Prediction
据伟业我爱我家市场研究院测算	据北业我爱我家市场研究院测算
七月北京公积金贷款成交量提升了百分之五	七月北京公积金贷款成交量提升了百分之五
培育门类丰富层次齐用的综合利用产业	培育门类丰富层资集业的综合利用产业
我们迎来了赶超发达国家的难得机遇	我们迎来了赶超发达国家的单得机遇
坚持基本草原保护制度	坚持基本草员保护制度
强化水生生态修复和建设	强化水生生态修复和建设
温州两男子为争女人驾奔驰宝马街头四次对撞	温州两男子为争女人架奔驰宝马接头四次对重
她表示应该是吃吃饭看电影之类的	他表示一的是吃吃饭看电影之理
加强畜禽遗传资源和农业野生植物资源保护	加强续紧遗传资源和农业野生职物资源保护
两人都是依赖电话沟通	两人都是依赖电话沟通

Test Result:

In the table below I report the Word Error Rate (WER) of the model on the AISHELL-1 test dataset.

Model	WER	WER-with-LM
qinyue/wav2vec2-large-xlsr-53-chinese-zn-cn-aishell1	7.04%	3.96%