Introduction

This repo contains torchscript model of Conformer CTC from NeMo.

See https://registry.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_ctc_small

The following code is used to obtain model.pt and tokens.txt:

import nemo.collections.asr as nemo_asr

m = nemo_asr.models.EncDecCTCModelBPE.from_pretrained('stt_en_conformer_ctc_small')
m.export("model.pt")

# Caution: We use 0 for blank here, while NeMo treat the last token as blank.
# For instance, when len(m.decoder.vocabulary) is 1024. NeMo treats
# ID 1025 as blank but we treat 0 as blank.
with open('tokens.txt', 'w', encoding='utf-8') as f:
  f.write("<blk> 0\n")
  for i, s in enumerate(m.decoder.vocabulary):
    f.write(f"{s} {i+1}\n")

Caution

The exported model takes log-filterbank as input and it does not include preprocess.

You can use the following code to replace the preprocessor:

import kaldifeat
opts = kaldifeat.FbankOptions()
opts.device = "cpu"
opts.frame_opts.dither = 0
opts.frame_opts.snip_edges = False
opts.frame_opts.samp_freq = 16000
opts.frame_opts.window_type = "povey"
opts.mel_opts.num_bins = 80

fbank = kaldifeat.Fbank(opts)

import torchaudio
samples, sample_rate = torchaudio.load("./test_wavs/0.wav")
assert sample_rate == 16000

features = fbank(samples[0])
mean = features.mean(dim=0, keepdims=True)
std = features.std(dim=0, keepdims=True)

features = (features - mean) / std
features = features.unsqueeze(0).permute(0, 2, 1)
# Note features is of shape (N, C, T)

model = torch.jit.load('model.pt')
logprob = model(features, torch.tensor([features.shape[2]]))