|
##################### |
|
ASR Language Modeling |
|
##################### |
|
|
|
Language models have shown to help the accuracy of ASR models. NeMo supports the following two approaches to incorporate language models into the ASR models: |
|
|
|
* :ref:`ngram_modeling` |
|
* :ref:`neural_rescoring` |
|
|
|
It is possible to use both approaches on the same ASR model. |
|
|
|
|
|
.. _ngram_modeling: |
|
|
|
************************ |
|
N-gram Language Modeling |
|
************************ |
|
|
|
In this approach, an N-gram LM is trained on text data, then it is used in fusion with beam search decoding to find the |
|
best candidates. The beam search decoders in NeMo support language models trained with KenLM library ( |
|
`https: |
|
The beam search decoders and KenLM library are not installed by default in NeMo, and you need to install them to be |
|
able to use beam search decoding and N-gram LM. |
|
Please refer to `scripts/asr_language_modeling/ngram_lm/install_beamsearch_decoders.sh` on how to install them. |
|
|
|
NeMo supports both character-based and BPE-based models for N-gram LMs. An N-gram LM can be used with beam search |
|
decoders on top of the ASR models to produce more accurate candidates. The beam search decoder would incorporate |
|
the scores produced by the N-gram LM into its score calculations as the following: |
|
|
|
.. code-block:: |
|
|
|
final_score = acoustic_score + beam_alpha*lm_score + beam_beta*seq_length |
|
|
|
where acoustic_score is the score predicted by the acoustic encoder and lm_score is the one estimated by the LM. |
|
Parameter 'beam_alpha' specifies amount of importance to place on the N-gram language model, and 'beam_beta' is a |
|
penalty term to consider the sequence length in the scores. Larger alpha means more importance on the LM and less |
|
importance on the acoustic model. Negative values for beta will give penalty to longer sequences and make the decoder |
|
to prefer shorter predictions, while positive values would result in longer candidates. |
|
|
|
|
|
Train N-gram LM |
|
=============== |
|
|
|
The script to train an N-gram language model with KenLM can be found at |
|
`scripts/asr_language_modeling/ngram_lm/train_kenlm.py <https: |
|
|
|
This script would train an N-gram language model with KenLM library which can be used with the beam search decoders |
|
on top of the ASR models. This script supports both character level and BPE level encodings and models which is |
|
detected automatically from the type of the model. |
|
|
|
|
|
You may train the N-gram model as the following: |
|
|
|
.. code-block:: |
|
|
|
python train_kenlm.py --nemo_model_file <path to the .nemo file of the model> \ |
|
--train_file <path to the training text or JSON manifest file \ |
|
--kenlm_bin_path <path to the bin folder of KenLM library> \ |
|
--kenlm_model_file <path to store the binary KenLM model> \ |
|
--ngram_length <order of N-gram model> \ |
|
--preserve_arpa |
|
|
|
The train file specified by `--train_file` can be a text file or JSON manifest. If the file's extension is anything |
|
other than `.json`, it assumes that data format is plain text. For plain text format, each line should contain one |
|
sample. For JSON manifest file, the file need to contain json formatted samples per each line like this: |
|
|
|
.. code-block:: |
|
|
|
{"audio_filepath": "/data_path/file1.wav", "text": "The transcript of the audio file."} |
|
|
|
It just extracts the `text` field from each line to create the training text file. After the N-gram model is trained, |
|
it is stored at the path specified by `--kenlm_model_file`. |
|
|
|
The following is the list of the arguments for the training script: |
|
|
|
+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ |
|
| **Argument** | **Type** | **Default** | **Description** | |
|
+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ |
|
| nemo_model_file | str | Required | The path of the `.nemo` file of the ASR model. It is needed to extract the tokenizer. | |
|
+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ |
|
| train_file | str | Required | Path to the training file, it can be a text file or JSON manifest. | |
|
+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ |
|
| kenlm_model_file | str | Required | The path to store the KenLM binary model file. | |
|
+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ |
|
| kenlm_bin_path | str | Required | The path to the bin folder of KenLM. It is a folder named `bin` under where KenLM is installed. | |
|
+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ |
|
| ngram_length** | int | Required | Specifies order of N-gram LM. | |
|
+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ |
|
| do_lower_case | bool | ``False`` | Whether to make the training text all lower case. | |
|
+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ |
|
| preserve_arpa | bool | ``False`` | Whether to preserve the intermediate ARPA file after construction of the BIN file. | |
|
+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ |
|
|
|
** Note: Recommend to use 6 as the order of the N-gram model for BPE-based models. Higher orders may need the re-compilation of KenLM to support it. |
|
|
|
Evaluate by Beam Search Decoding and N-gram LM |
|
============================================== |
|
|
|
NeMo's beam search decoders are capable of using the KenLM's N-gram models to find the best candidates. |
|
The script to evaluate an ASR model with beam search decoding and N-gram models can be found at |
|
`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py>`__. |
|
|
|
This script has a large number of possible argument overrides, therefore it is advised to use ``python eval_beamsearch_ngram.py --help`` to see the full list of arguments. |
|
|
|
You may evaluate an ASR model as the following: |
|
|
|
.. code-block:: |
|
|
|
python eval_beamsearch_ngram.py nemo_model_file=<path to the .nemo file of the model> \ |
|
input_manifest=<path to the evaluation JSON manifest file \ |
|
kenlm_model_file=<path to the binary KenLM model> \ |
|
beam_width=[<list of the beam widths, separated with commas>] \ |
|
beam_alpha=[<list of the beam alphas, separated with commas>] \ |
|
beam_beta=[<list of the beam betas, separated with commas>] \ |
|
preds_output_folder=<optional folder to store the predictions> \ |
|
probs_cache_file=null \ |
|
decoding_mode=beamsearch_ngram \ |
|
decoding_strategy="<Beam library such as beam, pyctcdecode or flashlight>" |
|
|
|
It can evaluate a model in the three following modes by setting the argument `--decoding_mode`: |
|
|
|
* greedy: Just greedy decoding is done, and no beam search decoding is performed. |
|
* beamsearch: The beam search decoding is done but without using the N-gram language model, final results would be equivalent to setting the weight of LM (beam_beta) to zero. |
|
* beamsearch_ngram: The beam search decoding is done with N-gram LM. |
|
|
|
The `beamsearch` mode would evaluate by beam search decoding without any language model. |
|
It would report the performances in terms of Word Error Rate (WER) and Character Error Rate (CER). Moreover, |
|
the WER/CER of the model when the best candidate is selected among the candidates is also reported as the best WER/CER. |
|
It can be an indicator of how good the predicted candidates are. |
|
|
|
The script would initially load the ASR model and predict the outputs of the model's encoder as log probabilities. |
|
This part would be computed in batches on a device selected by `--device`, which can be CPU (`--device=cpu`) or a |
|
single GPU (`--device=cuda:0`). The batch size of this part can get specified by `--acoustic_batch_size`. You may use |
|
the largest batch size feasible to speed up the step of calculating the log probabilities. You may also use `--use_amp` |
|
to speed up the calculation of log probabilities and make it possible to use larger sizes for `--acoustic_batch_size`. |
|
Currently multi-GPU is not supported for calculating the log probabilities, but using `--probs_cache_file` can help. |
|
It stores the log probabilities produced from the model's encoder into a pickle file so that next time the first step |
|
can get skipped. |
|
|
|
The following is the list of the important arguments for the evaluation script: |
|
|
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
| **Argument** | **Type** | **Default** | **Description** | |
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
| nemo_model_file | str | Required | The path of the `.nemo` file of the ASR model to extract the tokenizer. | |
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
| input_manifest | str | Required | Path to the training file, it can be a text file or JSON manifest. | |
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
| kenlm_model_file | str | Required | The path to store the KenLM binary model file. | |
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
| preds_output_folder | str | None | The path to an optional folder to store the predictions. | |
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
| probs_cache_file | str | None | The cache file for storing the outputs of the model. | |
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
| acoustic_batch_size | int | 16 | The batch size to calculate log probabilities. | |
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
| use_amp | bool | False | Whether to use AMP if available to calculate log probabilities. | |
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
| device | str | cuda | The device to load the model onto to calculate log probabilities. | |
|
| | | | It can `cpu`, `cuda`, `cuda:0`, `cuda:1`, ... | |
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
| decoding_mode | str | beamsearch_ngram | The decoding scheme to be used for evaluation. | |
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
| beam_width | float | Required | List of the width or list of the widths of the beam search decoding. | |
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
| beam_alpha | float | Required | List of the alpha parameter for the beam search decoding. | |
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
| beam_beta | float | Required | List of the beta parameter for the beam search decoding. | |
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
| beam_batch_size | int | 128 | The batch size to be used for beam search decoding. | |
|
| | | | Larger batch size can be a little faster, but uses larger memory. | |
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
| decoding_strategy | str | beam | String argument for type of decoding strategy for the model. | |
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
| decoding | Dict | BeamCTC | Subdict of beam search configs. Values found via | |
|
| | Config | InferConfig | python eval_beamsearch_ngram.py --help | |
|
+---------------------+----------+------------------+-------------------------------------------------------------------------+ |
|
|
|
Width of the beam search (`--beam_width`) specifies the number of top candidates/predictions the beam search decoder |
|
would search for. Larger beams result in more accurate but slower predictions. |
|
|
|
.. note:: |
|
|
|
The ``eval_beamsearch_ngram.py`` script contains the entire subconfig used for CTC Beam Decoding. |
|
Therefore it is possible to forward arguments for various beam search libraries such as ``flashlight`` |
|
and ``pyctcdecode`` via the ``decoding`` subconfig. |
|
|
|
There is also a tutorial to learn more about evaluating the ASR models with N-gram LM here: |
|
`Offline ASR Inference with Beam Search and External Language Model Rescoring <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/Offline_ASR.ipynb>`_ |
|
|
|
Beam Search Engines |
|
------------------- |
|
|
|
NeMo ASR CTC supports multiple beam search engines for decoding. The default engine is ``beam`` which is the OpenSeq2Seq |
|
decoding library. |
|
|
|
OpenSeq2Seq (``beam``) |
|
~~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
CPU-based beam search engine that is quite efficient and supports char and subword models. It requires a character/subword |
|
KenLM model to be provided. |
|
|
|
The config for this decoding library is described above. |
|
|
|
Flashlight (``flashlight``) |
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
Flashlight is a C++ library for ASR decoding provided at `https://github.com/flashlight/flashlight <https://github.com/flashlight/flashlight>`_. It is a CPU and CUDA-based beam search engine that is quite efficient and supports |
|
char and subword models. It an ARPA KenLM file. |
|
|
|
It supports several advanced features such as lexicon based / lexicon free decoding, beam pruning threshold, and more. |
|
|
|
.. code-block:: python |
|
|
|
@dataclass |
|
class FlashlightConfig: |
|
lexicon_path: Optional[str] = None |
|
beam_size_token: int = 16 |
|
beam_threshold: float = 20.0 |
|
unk_weight: float = -math.inf |
|
sil_weight: float = 0.0 |
|
unit_lm: bool = False |
|
|
|
.. code-block:: |
|
|
|
# Lexicon-based decoding |
|
python eval_beamsearch_ngram.py ... \ |
|
decoding_strategy="flashlight" \ |
|
decoding.beam.flashlight_cfg.lexicon_path='/path/to/lexicon.lexicon' \ |
|
decoding.beam.flashlight_cfg.beam_size_token = 32 \ |
|
decoding.beam.flashlight_cfg.beam_threshold = 25.0 |
|
|
|
# Lexicon-free decoding |
|
python eval_beamsearch_ngram.py ... \ |
|
decoding_strategy="flashlight" \ |
|
decoding.beam.flashlight_cfg.beam_size_token = 32 \ |
|
decoding.beam.flashlight_cfg.beam_threshold = 25.0 |
|
|
|
|
|
PyCTCDecode (``pyctcdecode``) |
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
PyCTCDecode is a Python library for ASR decoding provided at `https://github.com/kensho-technologies/pyctcdecode <https://github.com/kensho-technologies/pyctcdecode>`_. It is a CPU-based beam search engine that is somewhat efficient for a pure python library, and supports char and subword models. It requires a character/subword KenLM ARPA / BINARY model to be provided. |
|
|
|
It has advanced features such as word boosting which can be useful for transcript customization. |
|
|
|
.. code-block:: python |
|
|
|
@dataclass |
|
class PyCTCDecodeConfig: |
|
beam_prune_logp: float = -10.0 |
|
token_min_logp: float = -5.0 |
|
prune_history: bool = False |
|
hotwords: Optional[List[str]] = None |
|
hotword_weight: float = 10.0 |
|
|
|
.. code-block:: |
|
|
|
# PyCTCDecoding |
|
python eval_beamsearch_ngram.py ... \ |
|
decoding_strategy="pyctcdecode" \ |
|
decoding.beam.pyctcdecode_cfg.beam_prune_logp = -10. \ |
|
decoding.beam.pyctcdecode_cfg.token_min_logp = -5. \ |
|
decoding.beam.pyctcdecode_cfg.hotwords=[<List of str words>] \ |
|
decoding.beam.pyctcdecode_cfg.hotword_weight=10.0 |
|
|
|
|
|
Hyperparameter Grid Search |
|
-------------------------- |
|
|
|
Beam search decoding with N-gram LM has three main hyperparameters: `beam_width`, `beam_alpha`, and `beam_beta`. |
|
The accuracy of the model is dependent to the values of these parameters, specially beam_alpha and beam_beta. |
|
You may specify a single or list of values for each of these parameters to perform grid search. It would perform the |
|
beam search decoding on all the combinations of the these three hyperparameters. |
|
For instance, the following set of parameters would results in 2*1*2=4 beam search decodings: |
|
|
|
.. code-block:: |
|
|
|
python eval_beamsearch_ngram.py ... \ |
|
beam_width=[64,128] \ |
|
beam_alpha=[1.0] \ |
|
beam_beta=[1.0,0.5] |
|
|
|
|
|
.. _neural_rescoring: |
|
|
|
**************** |
|
Neural Rescoring |
|
**************** |
|
|
|
In this approach a neural network is used which can gives scores to a candidate. A candidate is the text transcript predicted by the decoder of the ASR model. |
|
The top K candidates produced by the beam search decoding (beam width of K) are given to a neural language model to rank them. |
|
Ranking can be done by a language model which gives a score to each candidate. |
|
This score is usually combined with the scores from the beam search decoding to produce the final scores and rankings. |
|
|
|
Train Neural Rescorer |
|
===================== |
|
|
|
An example script to train such a language model with Transformer can be found at `examples/nlp/language_modeling/transformer_lm.py <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/language_modeling/transformer_lm.py>`__. |
|
It trains a ``TransformerLMModel`` which can be used as a neural rescorer for an ASR system. Full documentation on language models training is available at: |
|
|
|
:doc:`../nlp/language_modeling` |
|
|
|
You may also use a pretrained language model from HuggingFace library like Transformer-XL and GPT instead of training your model. |
|
Models like BERT and RoBERTa are not supported by this script as they are trained as a Masked Language Model and are not efficient and effective to score sentences out of the box. |
|
|
|
|
|
Evaluation |
|
========== |
|
|
|
Given a trained TransformerLMModel `.nemo` file or a pretrained HF model, the script available at |
|
`scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py>`__ |
|
can be used to re-score beams obtained with ASR model. You need the `.tsv` file containing the candidates produced |
|
by the acoustic model and the beam search decoding to use this script. The candidates can be the result of just the beam |
|
search decoding or the result of fusion with an N-gram LM. You may generate this file by specifying `--preds_output_folder' for |
|
`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py <https: |
|
|
|
The neural rescorer would rescore the beams/candidates by using two parameters of `rescorer_alpha` and `rescorer_beta` as the following: |
|
|
|
.. code-block:: |
|
|
|
final_score = beam_search_score + rescorer_alpha*neural_rescorer_score + rescorer_beta*seq_length |
|
|
|
Parameter `rescorer_alpha` specifies amount of importance to place on the neural rescorer model, and `rescorer_beta` is |
|
a penalty term to consider the sequence length in the scores. They have similar effects like the parameters |
|
`beam_alpha` and `beam_beta` of beam search decoder and N-gram LM. |
|
|
|
You may follow the following steps to evaluate a neural LM: |
|
|
|
#. Obtain `.tsv` file with beams and their corresponding scores. Scores can be from a regular beam search decoder or |
|
in fusion with an N-gram LM scores. For a given beam size `beam_size` and a number of examples |
|
for evaluation `num_eval_examples`, it should contain (`num_eval_examples` x `beam_size`) lines of |
|
form `beam_candidate_text \t score`. This file can be generated by `scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py <https: |
|
|
|
#. Rescore the candidates by `scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py <https: |
|
|
|
.. code-block:: |
|
|
|
python eval_neural_rescorer.py |
|
--lm_model=[path to .nemo file of the LM or the name of a HF pretrained model] |
|
--beams_file=[path to beams .tsv file] |
|
--beam_size=[size of the beams] |
|
--eval_manifest=[path to eval manifest .json file] |
|
--batch_size=[batch size used for inference on the LM model] |
|
--alpha=[the value for the parameter rescorer_alpha] |
|
--beta=[the value for the parameter rescorer_beta] |
|
--scores_output_file=[the optional path to store the rescored candidates] |
|
|
|
The candidates along with their new scores would be stored at the file specified by `--scores_output_file`. |
|
|
|
The following is the list of the arguments for the evaluation script: |
|
|
|
+---------------------+--------+------------------+-------------------------------------------------------------------------+ |
|
| **Argument** |**Type**| **Default** | **Description** | |
|
+---------------------+--------+------------------+-------------------------------------------------------------------------+ |
|
| lm_model | str | Required | The path of the '.nemo' file of an ASR model, or the name of a | |
|
| | | | HuggingFace pretrained model like 'transfo-xl-wt103' or 'gpt2' | |
|
+---------------------+--------+------------------+-------------------------------------------------------------------------+ |
|
| eval_manifest | str | Required | Path to the evaluation manifest file (.json manifest file) | |
|
+---------------------+--------+------------------+-------------------------------------------------------------------------+ |
|
| beams_file | str | Required | path to beams file (.tsv) containing the candidates and their scores | |
|
+---------------------+--------+------------------+-------------------------------------------------------------------------+ |
|
| beam_size | int | Required | The width of the beams (number of candidates) generated by the decoder | |
|
+---------------------+--------+------------------+-------------------------------------------------------------------------+ |
|
| alpha | float | None | The value for parameter rescorer_alpha | |
|
| | | | Not passing value would enable linear search for rescorer_alpha | |
|
+---------------------+--------+------------------+-------------------------------------------------------------------------+ |
|
| beta | float | None | The value for parameter rescorer_beta | |
|
| | | | Not passing value would enable linear search for rescorer_beta | |
|
+---------------------+--------+------------------+-------------------------------------------------------------------------+ |
|
| batch_size | int | 16 | The batch size used to calculate the scores | |
|
+---------------------+--------+------------------+-------------------------------------------------------------------------+ |
|
| max_seq_length | int | 512 | Maximum sequence length (in tokens) for the input | |
|
+---------------------+--------+------------------+-------------------------------------------------------------------------+ |
|
| scores_output_file | str | None | The optional file to store the rescored beams | |
|
+---------------------+--------+------------------+-------------------------------------------------------------------------+ |
|
| use_amp | bool | ``False`` | Whether to use AMP if available calculate the scores | |
|
+---------------------+--------+------------------+-------------------------------------------------------------------------+ |
|
| device | str | cuda | The device to load LM model onto to calculate the scores | |
|
| | | | It can be 'cpu', 'cuda', 'cuda:0', 'cuda:1', ... | |
|
+---------------------+--------+------------------+-------------------------------------------------------------------------+ |
|
|
|
|
|
Hyperparameter Linear Search |
|
---------------------------- |
|
|
|
This script also supports linear search for parameters `alpha` and `beta`. If any of the two is not |
|
provided, a linear search is performed to find the best value for that parameter. When linear search is used, initially |
|
`beta` is set to zero and the best value for `alpha` is found, then `alpha` is fixed with |
|
that value and another linear search is done to find the best value for `beta`. |
|
If any of the of these two parameters is already specified, then search for that one is skipped. After each search for a |
|
parameter, the plot of WER% for different values of the parameter is also shown. |
|
|
|
It is recommended to first use the linear search for both parameters on a validation set by not providing any values for `--alpha` and `--beta`. |
|
Then check the WER curves and decide on the best values for each parameter. Finally, evaluate the best values on the test set. |
|
|