Vietnamese Speech Recognition using Wav2vec 2.0
Table of contents
Model Description
Fine-tuned the Wav2vec2-based model on about 160 hours of Vietnamese speech dataset from different resources, including VIOS, COMMON VOICE, FOSD and VLSP 100h. We have not yet incorporated the Language Model into our ASR system but still gained a promising result.
Implementation
We also provide code for Pre-training and Fine-tuning the Wav2vec2 model. If you wish to train on your dataset, check it out here:
Benchmark WER Result
VIVOS | COMMON VOICE 8.0 | |
---|---|---|
without LM | 15.05 | 10.78 |
with LM | in progress | in progress |
Example Usage
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import librosa
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = Wav2Vec2Processor.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
model = Wav2Vec2ForCTC.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
model.to(device)
def transcribe(wav):
input_values = processor(wav, sampling_rate=16000, return_tensors="pt").input_values
logits = model(input_values.to(device)).logits
pred_ids = torch.argmax(logits, dim=-1)
pred_transcript = processor.batch_decode(pred_ids)[0]
return pred_transcript
wav, _ = librosa.load('path/to/your/audio/file', sr = 16000)
print(f"transcript: {transcribe(wav)}")
Evaluation
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
import re
from datasets import load_dataset, load_metric, Audio
wer = load_metric("wer")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# load processor and model
processor = Wav2Vec2Processor.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
model = Wav2Vec2ForCTC.from_pretrained("khanhld/wav2vec2-base-vietnamese-160h")
model.to(device)
model.eval()
# Load dataset
test_dataset = load_dataset("mozilla-foundation/common_voice_8_0", "vi", split="test", use_auth_token="your_huggingface_auth_token")
test_dataset = test_dataset.cast_column("audio", Audio(sampling_rate=16000))
chars_to_ignore = r'[,?.!\-;:"“%\'�]' # ignore special characters
# preprocess data
def preprocess(batch):
audio = batch["audio"]
batch["input_values"] = audio["array"]
batch["transcript"] = re.sub(chars_to_ignore, '', batch["sentence"]).lower()
return batch
# run inference
def inference(batch):
input_values = processor(batch["input_values"],
sampling_rate=16000,
return_tensors="pt").input_values
logits = model(input_values.to(device)).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_transcript"] = processor.batch_decode(pred_ids)
return batch
test_dataset = test_dataset.map(preprocess)
result = test_dataset.map(inference, batched=True, batch_size=1)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_transcript"], references=result["transcript"])))
Test Result: 10.78%
Citation
@mics{Duy_Khanh_Finetune_Wav2vec_2_0_2022,
author = {Duy Khanh, Le},
doi = {10.5281/zenodo.6542357},
license = {CC-BY-NC-4.0},
month = {5},
title = {{Finetune Wav2vec 2.0 For Vietnamese Speech Recognition}},
url = {https://github.com/khanld/ASR-Wa2vec-Finetune},
year = {2022}
}
APA
Duy Khanh, L. (2022). Finetune Wav2vec 2.0 For Vietnamese Speech Recognition [Data set]. https://doi.org/10.5281/zenodo.6542357
Contact
- Downloads last month
- 469
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Datasets used to train khanhld/wav2vec2-base-vietnamese-160h
Space using khanhld/wav2vec2-base-vietnamese-160h 1
Evaluation results
- Test WER on common-voice-vietnameseself-reported10.780
- Test WER on VIVOSself-reported15.050