NeMo
Armenian
Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

NVIDIA FastConformer-Hybrid Large (arm)

| Model architecture | Model size

This model transcribes speech to Armenian without punctuation and capitalization. It is a "large" version of the FastConformer Transducer-CTC model with approximately 115M parameters. This hybrid model is trained on two losses: Transducer (default) and CTC. See the model architecture section and NeMo documentation for complete architecture details.

NVIDIA NeMo: Training

To train, fine-tune or play with the model, you will need to install NVIDIA NeMo. We recommend you install it after you've installed the latest Pytorch version.

pip install nemo_toolkit['all']

How to Use this Model

The model is available for use in the NeMo toolkit and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically instantiate the model

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="mheryerznka/stt_arm_fastconformer_hybrid_large_no_pc")

Transcribing using Python

First, let's get a sample:

wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1Np_gMOeSac-Yc8GZ-yrq2xq9wsl7zT1_' -O hy_am-test-26-audio-audio.wav

Then simply do:

asr_model.transcribe(['hy_am-test-26-audio-audio.wav'])

Transcribing many audio files

Using Transducer mode inference:

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
 pretrained_name="mheryerznka/stt_arm_fastconformer_hybrid_large_no_pc" \
 audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"

Using CTC mode inference:

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
 pretrained_name="mheryerznka/stt_arm_fastconformer_hybrid_large_no_pc" \
 audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
 decoder_type="ctc"

Input

This model accepts 16000 Hz Mono-channel Audio (wav files) as input.

Output

This model provides transcribed speech as a string for a given audio sample.

Model Architecture

FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. The model is trained in a multitask setup with joint Transducer and CTC decoder loss. You may find more information on the details of FastConformer here: Fast-Conformer Model and about Hybrid Transducer-CTC training here: Hybrid Transducer-CTC.

Training

The NeMo toolkit was used for training the models 50 epochs on A100 GPUs at Yerevan State University. These models are trained with this example script and this base config.

The training process also incorporated a technique called slimIPL (slim Iterative Pseudo-Labeling), which involves self-training with intermediate pseudo-labels. The slimIPL algorithm uses pseudo-labels generated from high-confidence unlabeled data from youtube to iteratively refine the model.

Datasets

The model in this collection is trained on a composite dataset comprising of several hundred of Armenian speech:

Performance

The performance of Automatic Speech Recognition models is measured using Word Error Rate (WER). This model was specifically designed to handle the complexities of the Armenian language. The following tables summarize the performance of the available models in this collection with the RNN-Transducer decoder and CTC decoder. Performances of the ASR models are reported in terms of WER.

On data without Punctuation and Capitalization with Transducer decoder

Vocabulary Size MCV17 TEST RNN-T MCV17 TEST CTC GOOGLE FLEURS TEST RNN-T GOOGLE FLEURS TEST CTC
256 9.03 10.77 7.41 9.09

Limitations

Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech that includes technical terms or vernacular that the model has not been trained on especially western armenian. The model might also perform worse for accented speech.

Downloads last month
39
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Datasets used to train mheryerznka/stt_arm_fastconformer_hybrid_large_no_pc