Wav2Vec2 base model trained of 3K hours of Vietnamese speech

The base model is pre-trained on 16kHz sampled speech audio from Vietnamese speech corpus containing 3K hours of spontaneous, reading, and broadcasting speech. When using the model make sure that your speech input is also sampled at 16Khz. Note that this model should be fine-tuned on a downstream task, like Vietnamese Automatic Speech Recognition.

Note: This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model speech recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data. Check out this blog for more in-detail explanation of how to fine-tune the model.
Facebook's Wav2Vec2 blog Paper

Usage

See this notebook for more information on how to fine-tune the English pre-trained model.

import torch
from transformers import Wav2Vec2Model

model = Wav2Vec2Model.from_pretrained("dragonSwing/viwav2vec2-base-3k")

# Sanity check
inputs = torch.rand([1, 16000])
outputs = model(inputs)
Downloads last month
19
Safetensors
Model size
95M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.