Prediction is cut in random samples.

#1
by huks - opened

Currently I train a citrinet-512 Model. I copied the config from the example configs of you. While training and eval a lot of sentences are cut randomly during prediction like e.g.

[NeMo I 2024-06-02 22:59:01 wer_bpe:302] reference:so ließ es sich nicht ändern er blieb oberleutnant und um so lieber weil ihm herr bantes sein gewesener vormund längst den winzigen rest seines väterlichen erbteils ausgehändigt hatte und dieses längst schon zu allen heiden ausgewandert war
[NeMo I 2024-06-02 22:59:01 wer_bpe:303] predicted:so ließ es sich nicht ändern er blieb oberleutnant und um so lieber weil ihm herr bantes sein gewesener vormund längst winzigen rest seines väterlichen erbteils ausgehändigt hatte

This happens right after a few epochs (2 epochs) and doesnt vanish even after 90 epochs. Its not a dataset specific issue, but occurs randomly for several datasets. It does not happen for every sample but for a relevant share of the data. It happens for training and evaluation. I tried to figure out what the problem relates to or if any parameter could solve it, but couldnt detect where this issue comes from. It depends a bit on length i would say. It rather happens for longer sentences.

Have you encountered this problem as well? Could you solve it?
Furthermore, could you also upload your tokenizer to the model card? I would appreciate it a lot.

Neon AI org

Tokenizer is uploaded here

Without your training code and config can't really help

But you can try to look here
Line max_duration: 16.7, so that you won't get memory overflow

Below is the config.
I didnt change the training code, its the standard code of nemo.

Is your training code and onnx/spm conversion code public available?
Can I find the tokenizer vocab and vocab.txt somewhere?

name: &name "Citrinet-512 training"

model:
sample_rate: &sample_rate 16000
log_prediction: true

train_ds:
manifest_filepath: "/train_cleaned.json"
sample_rate: 16000
batch_size: 32
trim_silence: false
max_duration: 16.0
min_duration: 1.0
shuffle: true
use_start_end_token: false
num_workers: 8
pin_memory: true
is_tarred: false
tarred_audio_filepaths: null
shuffle_n: 2048
bucketing_strategy: 'synced_randomized'
bucketing_batch_size: null

validation_ds:
manifest_filepath: "/dev_cleaned.json"
sample_rate: 16000
batch_size: 32
shuffle: false
use_start_end_token: false
num_workers: 8
pin_memory: true

test_ds:
manifest_filepath: "/test_cleaned.json"
sample_rate: 16000
batch_size: 32
shuffle: false
use_start_end_token: false
num_workers: 8
pin_memory: true

model_defaults:
repeat: 5
dropout: 0.1
separable: true
se: true
se_context_size: -1
kernel_size_factor: 1
filters: 512
enc_final: 640

tokenizer:
dir: "tokenizer_spe_unigram_v1024" # path to directory which contains either tokenizer.model (bpe) or vocab.txt (for wpe)
type: "bpe" # Can be either bpe or wpe

preprocessor:
target: 'nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor'
sample_rate: 16000
normalize: 'per_feature'
window_size: 0.025
window_stride: 0.01
window: 'hann'
features: &n_mels 80
n_fft: 512
frame_splicing: 1
dither: 1e-05
pad_to: 16
stft_conv: false

spec_augment:
target: 'nemo.collections.asr.modules.SpectrogramAugmentation'
freq_masks: 2
time_masks: 5
freq_width: 27
time_width: 0.05

encoder:
target: nemo.collections.asr.modules.ConvASREncoder
feat_in: *n_mels
activation: relu
conv_mask: true

jasper:
  - filters: 512
    repeat: 1
    kernel: [5]
    stride: [1]
    dilation: [1]
    dropout: 0.0
    residual: false
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [11]
    stride: [2]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    stride_last: true
    residual_mode: "stride_add"
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [13]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [15]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [17]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [19]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [21]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [13]
    stride: [2]  # *stride
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    stride_last: true
    residual_mode: "stride_add"
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [15]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [17]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [19]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [21]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [23]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [25]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [25]
    stride: [2]  # stride
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    stride_last: true
    residual_mode: "stride_add"
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [27]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [29]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [31]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [33]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [35]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [37]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: 512
    repeat: ${model.model_defaults.repeat}
    kernel: [39]
    stride: [1]
    dilation: [1]
    dropout: ${model.model_defaults.dropout}
    residual: true
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  - filters: ${model.model_defaults.enc_final}
    repeat: 1
    kernel: [41]
    stride: [1]
    dilation: [1]
    dropout: 0.0
    residual: false
    separable: ${model.model_defaults.separable}
    se: ${model.model_defaults.se}
    se_context_size: ${model.model_defaults.se_context_size}
    kernel_size_factor: ${model.model_defaults.kernel_size_factor}

decoder:
target: 'nemo.collections.asr.modules.ConvASRDecoder'
feat_in: 640
num_classes: 1024
vocabulary: []

optim:
name: 'novograd'
lr: 0.005
betas: [0.8, 0.25]
weight_decay: 0.0001
sched:
name: 'CosineAnnealing'
warmup_steps: null
warmup_ratio: 0.1
min_lr: 1e-05
last_epoch: -1

target: 'nemo.collections.asr.models.ctc_bpe_models.EncDecCTCModelBPE'
nemo_version: '1.12.0'

decoding:
strategy: 'greedy'
preserve_alignments: null
compute_timestamps: null
word_seperator: ' '
ctc_timestamp_type: 'all'
batch_dim_index: 0
greedy:
preserve_alignments: false
compute_timestamps: false

trainer:
devices: 2 # number of gpus
max_epochs: 100
max_steps: -1 # computed at runtime if not set
num_nodes: 1
accelerator: gpu
strategy: auto
accumulate_grad_batches: 1
enable_checkpointing: True # Provided by exp_manager
enable_progress_bar: True
logger: false # Provided by exp_manager
log_every_n_steps: 50 # Interval of logging.
val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
check_val_every_n_epoch: 1
precision: 32
sync_batchnorm: false
benchmark: false # needs to be false for models with variable-length speech input as it slows down training

exp_manager:
exp_dir: null
name: *name
create_tensorboard_logger: false
create_checkpoint_callback: false
create_mlflow_logger: true
mlflow_logger_kwargs:
experiment_name: "training-citrinet-512"
tracking_uri: [removed, here would be the tracking uri]

checkpoint_callback_params:
monitor: "val_wer"
mode: "min"
save_top_k: 3
always_save_nemo: True #not tested yet, found this in a nemo repo
create_wandb_logger: false
wandb_logger_kwargs:
name: null
project: null
entity: null
resume_if_exists: false
resume_ignore_no_checkpoint: false

Neon AI org
edited Jun 4

*.nemo file is just renamed tar archive
Open it and you will find all the config and files there

Neon AI org
edited Jun 4

The onnx/tokenizer.spm is the one you need
It's sentencepiece tokenizer that Nemo calls simply tokenizer.model

Neon AI org

Drop examples that are longer than 16 seconds
Or increase max_duration in config

in my understanding it filters samples longer than 16 seconds with setting max_duration to 16.

I now used your tokenizer.model, vocab.txt and tokenizer.vocab.

Whats the difference between your tokenizer.spm and tokenizer.model?

Thanks for helping out btw, appreciate it a lot.

Neon AI org

Just renamed, spm is default extention for sentencepiece model

NeonBohdan changed discussion status to closed

the problem aint solved yet, I tried it with your tokenizer.

Do you have any idea what could be the reason?

Neon AI org

I don't think the problem is with the tokenizer
I've never seen anything like this

Debug your code, I think the audio is fed to the stt already trimmed

Sign up or log in to comment