NeMo / docs /source /asr /asr_language_modeling.rst

thanks to NVIDIA ❤

7934b29 about 2 years ago

27.4 kB

	#####################
	ASR Language Modeling
	#####################

	Language models have shown to help the accuracy of ASR models. NeMo supports the following two approaches to incorporate language models into the ASR models:

	* :ref:`ngram_modeling`
	* :ref:`neural_rescoring`

	It is possible to use both approaches on the same ASR model.


	.. _ngram_modeling:

	************************
	N-gram Language Modeling
	************************

	In this approach, an N-gram LM is trained on text data, then it is used in fusion with beam search decoding to find the
	best candidates. The beam search decoders in NeMo support language models trained with KenLM library (
	`https://github.com/kpu/kenlm <https://github.com/kpu/kenlm>`__).
	The beam search decoders and KenLM library are not installed by default in NeMo, and you need to install them to be
	able to use beam search decoding and N-gram LM.
	Please refer to `scripts/asr_language_modeling/ngram_lm/install_beamsearch_decoders.sh` on how to install them.

	NeMo supports both character-based and BPE-based models for N-gram LMs. An N-gram LM can be used with beam search
	decoders on top of the ASR models to produce more accurate candidates. The beam search decoder would incorporate
	the scores produced by the N-gram LM into its score calculations as the following:

	.. code-block::

	final_score = acoustic_score + beam_alphalm_score + beam_betaseq_length

	where acoustic_score is the score predicted by the acoustic encoder and lm_score is the one estimated by the LM.
	Parameter 'beam_alpha' specifies amount of importance to place on the N-gram language model, and 'beam_beta' is a
	penalty term to consider the sequence length in the scores. Larger alpha means more importance on the LM and less
	importance on the acoustic model. Negative values for beta will give penalty to longer sequences and make the decoder
	to prefer shorter predictions, while positive values would result in longer candidates.


	Train N-gram LM
	===============

	The script to train an N-gram language model with KenLM can be found at
	`scripts/asr_language_modeling/ngram_lm/train_kenlm.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/train_kenlm.py>`__.

	This script would train an N-gram language model with KenLM library which can be used with the beam search decoders
	on top of the ASR models. This script supports both character level and BPE level encodings and models which is
	detected automatically from the type of the model.


	You may train the N-gram model as the following:

	.. code-block::

	python train_kenlm.py --nemo_model_file <path to the .nemo file of the model> \
	--train_file <path to the training text or JSON manifest file \
	--kenlm_bin_path <path to the bin folder of KenLM library> \
	--kenlm_model_file <path to store the binary KenLM model> \
	--ngram_length <order of N-gram model> \
	--preserve_arpa

	The train file specified by `--train_file` can be a text file or JSON manifest. If the file's extension is anything
	other than `.json`, it assumes that data format is plain text. For plain text format, each line should contain one
	sample. For JSON manifest file, the file need to contain json formatted samples per each line like this:

	.. code-block::

	{"audio_filepath": "/data_path/file1.wav", "text": "The transcript of the audio file."}

	It just extracts the `text` field from each line to create the training text file. After the N-gram model is trained,
	it is stored at the path specified by `--kenlm_model_file`.

	The following is the list of the arguments for the training script:

	+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+
	\| Argument \| Type \| Default \| Description \|
	+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+
	\| nemo_model_file \| str \| Required \| The path of the `.nemo` file of the ASR model. It is needed to extract the tokenizer. \|
	+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+
	\| train_file \| str \| Required \| Path to the training file, it can be a text file or JSON manifest. \|
	+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+
	\| kenlm_model_file \| str \| Required \| The path to store the KenLM binary model file. \|
	+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+
	\| kenlm_bin_path \| str \| Required \| The path to the bin folder of KenLM. It is a folder named `bin` under where KenLM is installed. \|
	+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+
	\| ngram_length** \| int \| Required \| Specifies order of N-gram LM. \|
	+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+
	\| do_lower_case \| bool \| ``False`` \| Whether to make the training text all lower case. \|
	+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+
	\| preserve_arpa \| bool \| ``False`` \| Whether to preserve the intermediate ARPA file after construction of the BIN file. \|
	+------------------+----------+-------------+-------------------------------------------------------------------------------------------------+

	** Note: Recommend to use 6 as the order of the N-gram model for BPE-based models. Higher orders may need the re-compilation of KenLM to support it.

	Evaluate by Beam Search Decoding and N-gram LM
	==============================================

	NeMo's beam search decoders are capable of using the KenLM's N-gram models to find the best candidates.
	The script to evaluate an ASR model with beam search decoding and N-gram models can be found at
	`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py>`__.

	This script has a large number of possible argument overrides, therefore it is advised to use ``python eval_beamsearch_ngram.py --help`` to see the full list of arguments.

	You may evaluate an ASR model as the following:

	.. code-block::

	python eval_beamsearch_ngram.py nemo_model_file=<path to the .nemo file of the model> \
	input_manifest=<path to the evaluation JSON manifest file \
	kenlm_model_file=<path to the binary KenLM model> \
	beam_width=[<list of the beam widths, separated with commas>] \
	beam_alpha=[<list of the beam alphas, separated with commas>] \
	beam_beta=[<list of the beam betas, separated with commas>] \
	preds_output_folder=<optional folder to store the predictions> \
	probs_cache_file=null \
	decoding_mode=beamsearch_ngram \
	decoding_strategy="<Beam library such as beam, pyctcdecode or flashlight>"

	It can evaluate a model in the three following modes by setting the argument `--decoding_mode`:

	* greedy: Just greedy decoding is done, and no beam search decoding is performed.
	* beamsearch: The beam search decoding is done but without using the N-gram language model, final results would be equivalent to setting the weight of LM (beam_beta) to zero.
	* beamsearch_ngram: The beam search decoding is done with N-gram LM.

	The `beamsearch` mode would evaluate by beam search decoding without any language model.
	It would report the performances in terms of Word Error Rate (WER) and Character Error Rate (CER). Moreover,
	the WER/CER of the model when the best candidate is selected among the candidates is also reported as the best WER/CER.
	It can be an indicator of how good the predicted candidates are.

	The script would initially load the ASR model and predict the outputs of the model's encoder as log probabilities.
	This part would be computed in batches on a device selected by `--device`, which can be CPU (`--device=cpu`) or a
	single GPU (`--device=cuda:0`). The batch size of this part can get specified by `--acoustic_batch_size`. You may use
	the largest batch size feasible to speed up the step of calculating the log probabilities. You may also use `--use_amp`
	to speed up the calculation of log probabilities and make it possible to use larger sizes for `--acoustic_batch_size`.
	Currently multi-GPU is not supported for calculating the log probabilities, but using `--probs_cache_file` can help.
	It stores the log probabilities produced from the model's encoder into a pickle file so that next time the first step
	can get skipped.

	The following is the list of the important arguments for the evaluation script:

	+---------------------+----------+------------------+-------------------------------------------------------------------------+
	\| Argument \| Type \| Default \| Description \|
	+---------------------+----------+------------------+-------------------------------------------------------------------------+
	\| nemo_model_file \| str \| Required \| The path of the `.nemo` file of the ASR model to extract the tokenizer. \|
	+---------------------+----------+------------------+-------------------------------------------------------------------------+
	\| input_manifest \| str \| Required \| Path to the training file, it can be a text file or JSON manifest. \|
	+---------------------+----------+------------------+-------------------------------------------------------------------------+
	\| kenlm_model_file \| str \| Required \| The path to store the KenLM binary model file. \|
	+---------------------+----------+------------------+-------------------------------------------------------------------------+
	\| preds_output_folder \| str \| None \| The path to an optional folder to store the predictions. \|
	+---------------------+----------+------------------+-------------------------------------------------------------------------+
	\| probs_cache_file \| str \| None \| The cache file for storing the outputs of the model. \|
	+---------------------+----------+------------------+-------------------------------------------------------------------------+
	\| acoustic_batch_size \| int \| 16 \| The batch size to calculate log probabilities. \|
	+---------------------+----------+------------------+-------------------------------------------------------------------------+
	\| use_amp \| bool \| False \| Whether to use AMP if available to calculate log probabilities. \|
	+---------------------+----------+------------------+-------------------------------------------------------------------------+
	\| device \| str \| cuda \| The device to load the model onto to calculate log probabilities. \|
	\| \| \| \| It can `cpu`, `cuda`, `cuda:0`, `cuda:1`, ... \|
	+---------------------+----------+------------------+-------------------------------------------------------------------------+
	\| decoding_mode \| str \| beamsearch_ngram \| The decoding scheme to be used for evaluation. \|
	+---------------------+----------+------------------+-------------------------------------------------------------------------+
	\| beam_width \| float \| Required \| List of the width or list of the widths of the beam search decoding. \|
	+---------------------+----------+------------------+-------------------------------------------------------------------------+
	\| beam_alpha \| float \| Required \| List of the alpha parameter for the beam search decoding. \|
	+---------------------+----------+------------------+-------------------------------------------------------------------------+
	\| beam_beta \| float \| Required \| List of the beta parameter for the beam search decoding. \|
	+---------------------+----------+------------------+-------------------------------------------------------------------------+
	\| beam_batch_size \| int \| 128 \| The batch size to be used for beam search decoding. \|
	\| \| \| \| Larger batch size can be a little faster, but uses larger memory. \|
	+---------------------+----------+------------------+-------------------------------------------------------------------------+
	\| decoding_strategy \| str \| beam \| String argument for type of decoding strategy for the model. \|
	+---------------------+----------+------------------+-------------------------------------------------------------------------+
	\| decoding \| Dict \| BeamCTC \| Subdict of beam search configs. Values found via \|
	\| \| Config \| InferConfig \| python eval_beamsearch_ngram.py --help \|
	+---------------------+----------+------------------+-------------------------------------------------------------------------+

	Width of the beam search (`--beam_width`) specifies the number of top candidates/predictions the beam search decoder
	would search for. Larger beams result in more accurate but slower predictions.

	.. note::

	The ``eval_beamsearch_ngram.py`` script contains the entire subconfig used for CTC Beam Decoding.
	Therefore it is possible to forward arguments for various beam search libraries such as ``flashlight``
	and ``pyctcdecode`` via the ``decoding`` subconfig.

	There is also a tutorial to learn more about evaluating the ASR models with N-gram LM here:
	`Offline ASR Inference with Beam Search and External Language Model Rescoring <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/Offline_ASR.ipynb>`_

	Beam Search Engines
	-------------------

	NeMo ASR CTC supports multiple beam search engines for decoding. The default engine is ``beam`` which is the OpenSeq2Seq
	decoding library.

	OpenSeq2Seq (``beam``)
	~~~~~~~~~~~~~~~~~~~~~~

	CPU-based beam search engine that is quite efficient and supports char and subword models. It requires a character/subword
	KenLM model to be provided.

	The config for this decoding library is described above.

	Flashlight (``flashlight``)
	~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Flashlight is a C++ library for ASR decoding provided at `https://github.com/flashlight/flashlight <https://github.com/flashlight/flashlight>`_. It is a CPU and CUDA-based beam search engine that is quite efficient and supports
	char and subword models. It an ARPA KenLM file.

	It supports several advanced features such as lexicon based / lexicon free decoding, beam pruning threshold, and more.

	.. code-block:: python

	@dataclass
	class FlashlightConfig:
	lexicon_path: Optional[str] = None
	beam_size_token: int = 16
	beam_threshold: float = 20.0
	unk_weight: float = -math.inf
	sil_weight: float = 0.0
	unit_lm: bool = False

	.. code-block::

	# Lexicon-based decoding
	python eval_beamsearch_ngram.py ... \
	decoding_strategy="flashlight" \
	decoding.beam.flashlight_cfg.lexicon_path='/path/to/lexicon.lexicon' \
	decoding.beam.flashlight_cfg.beam_size_token = 32 \
	decoding.beam.flashlight_cfg.beam_threshold = 25.0

	# Lexicon-free decoding
	python eval_beamsearch_ngram.py ... \
	decoding_strategy="flashlight" \
	decoding.beam.flashlight_cfg.beam_size_token = 32 \
	decoding.beam.flashlight_cfg.beam_threshold = 25.0


	PyCTCDecode (``pyctcdecode``)
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	PyCTCDecode is a Python library for ASR decoding provided at `https://github.com/kensho-technologies/pyctcdecode <https://github.com/kensho-technologies/pyctcdecode>`_. It is a CPU-based beam search engine that is somewhat efficient for a pure python library, and supports char and subword models. It requires a character/subword KenLM ARPA / BINARY model to be provided.

	It has advanced features such as word boosting which can be useful for transcript customization.

	.. code-block:: python

	@dataclass
	class PyCTCDecodeConfig:
	beam_prune_logp: float = -10.0
	token_min_logp: float = -5.0
	prune_history: bool = False
	hotwords: Optional[List[str]] = None
	hotword_weight: float = 10.0

	.. code-block::

	# PyCTCDecoding
	python eval_beamsearch_ngram.py ... \
	decoding_strategy="pyctcdecode" \
	decoding.beam.pyctcdecode_cfg.beam_prune_logp = -10. \
	decoding.beam.pyctcdecode_cfg.token_min_logp = -5. \
	decoding.beam.pyctcdecode_cfg.hotwords=[<List of str words>] \
	decoding.beam.pyctcdecode_cfg.hotword_weight=10.0


	Hyperparameter Grid Search
	--------------------------

	Beam search decoding with N-gram LM has three main hyperparameters: `beam_width`, `beam_alpha`, and `beam_beta`.
	The accuracy of the model is dependent to the values of these parameters, specially beam_alpha and beam_beta.
	You may specify a single or list of values for each of these parameters to perform grid search. It would perform the
	beam search decoding on all the combinations of the these three hyperparameters.
	For instance, the following set of parameters would results in 212=4 beam search decodings:

	.. code-block::

	python eval_beamsearch_ngram.py ... \
	beam_width=[64,128] \
	beam_alpha=[1.0] \
	beam_beta=[1.0,0.5]


	.. _neural_rescoring:

	****************
	Neural Rescoring
	****************

	In this approach a neural network is used which can gives scores to a candidate. A candidate is the text transcript predicted by the decoder of the ASR model.
	The top K candidates produced by the beam search decoding (beam width of K) are given to a neural language model to rank them.
	Ranking can be done by a language model which gives a score to each candidate.
	This score is usually combined with the scores from the beam search decoding to produce the final scores and rankings.

	Train Neural Rescorer
	=====================

	An example script to train such a language model with Transformer can be found at `examples/nlp/language_modeling/transformer_lm.py <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/language_modeling/transformer_lm.py>`__.
	It trains a ``TransformerLMModel`` which can be used as a neural rescorer for an ASR system. Full documentation on language models training is available at:

	:doc:`../nlp/language_modeling`

	You may also use a pretrained language model from HuggingFace library like Transformer-XL and GPT instead of training your model.
	Models like BERT and RoBERTa are not supported by this script as they are trained as a Masked Language Model and are not efficient and effective to score sentences out of the box.


	Evaluation
	==========

	Given a trained TransformerLMModel `.nemo` file or a pretrained HF model, the script available at
	`scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py>`__
	can be used to re-score beams obtained with ASR model. You need the `.tsv` file containing the candidates produced
	by the acoustic model and the beam search decoding to use this script. The candidates can be the result of just the beam
	search decoding or the result of fusion with an N-gram LM. You may generate this file by specifying `--preds_output_folder' for
	`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py>`__.

	The neural rescorer would rescore the beams/candidates by using two parameters of `rescorer_alpha` and `rescorer_beta` as the following:

	.. code-block::

	final_score = beam_search_score + rescorer_alphaneural_rescorer_score + rescorer_betaseq_length

	Parameter `rescorer_alpha` specifies amount of importance to place on the neural rescorer model, and `rescorer_beta` is
	a penalty term to consider the sequence length in the scores. They have similar effects like the parameters
	`beam_alpha` and `beam_beta` of beam search decoder and N-gram LM.

	You may follow the following steps to evaluate a neural LM:

	#. Obtain `.tsv` file with beams and their corresponding scores. Scores can be from a regular beam search decoder or
	in fusion with an N-gram LM scores. For a given beam size `beam_size` and a number of examples
	for evaluation `num_eval_examples`, it should contain (`num_eval_examples` x `beam_size`) lines of
	form `beam_candidate_text \t score`. This file can be generated by `scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py>`__

	#. Rescore the candidates by `scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py <https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py>`__.

	.. code-block::

	python eval_neural_rescorer.py
	--lm_model=[path to .nemo file of the LM or the name of a HF pretrained model]
	--beams_file=[path to beams .tsv file]
	--beam_size=[size of the beams]
	--eval_manifest=[path to eval manifest .json file]
	--batch_size=[batch size used for inference on the LM model]
	--alpha=[the value for the parameter rescorer_alpha]
	--beta=[the value for the parameter rescorer_beta]
	--scores_output_file=[the optional path to store the rescored candidates]

	The candidates along with their new scores would be stored at the file specified by `--scores_output_file`.

	The following is the list of the arguments for the evaluation script:

	+---------------------+--------+------------------+-------------------------------------------------------------------------+
	\| Argument \|Type\| Default \| Description \|
	+---------------------+--------+------------------+-------------------------------------------------------------------------+
	\| lm_model \| str \| Required \| The path of the '.nemo' file of an ASR model, or the name of a \|
	\| \| \| \| HuggingFace pretrained model like 'transfo-xl-wt103' or 'gpt2' \|
	+---------------------+--------+------------------+-------------------------------------------------------------------------+
	\| eval_manifest \| str \| Required \| Path to the evaluation manifest file (.json manifest file) \|
	+---------------------+--------+------------------+-------------------------------------------------------------------------+
	\| beams_file \| str \| Required \| path to beams file (.tsv) containing the candidates and their scores \|
	+---------------------+--------+------------------+-------------------------------------------------------------------------+
	\| beam_size \| int \| Required \| The width of the beams (number of candidates) generated by the decoder \|
	+---------------------+--------+------------------+-------------------------------------------------------------------------+
	\| alpha \| float \| None \| The value for parameter rescorer_alpha \|
	\| \| \| \| Not passing value would enable linear search for rescorer_alpha \|
	+---------------------+--------+------------------+-------------------------------------------------------------------------+
	\| beta \| float \| None \| The value for parameter rescorer_beta \|
	\| \| \| \| Not passing value would enable linear search for rescorer_beta \|
	+---------------------+--------+------------------+-------------------------------------------------------------------------+
	\| batch_size \| int \| 16 \| The batch size used to calculate the scores \|
	+---------------------+--------+------------------+-------------------------------------------------------------------------+
	\| max_seq_length \| int \| 512 \| Maximum sequence length (in tokens) for the input \|
	+---------------------+--------+------------------+-------------------------------------------------------------------------+
	\| scores_output_file \| str \| None \| The optional file to store the rescored beams \|
	+---------------------+--------+------------------+-------------------------------------------------------------------------+
	\| use_amp \| bool \| ``False`` \| Whether to use AMP if available calculate the scores \|
	+---------------------+--------+------------------+-------------------------------------------------------------------------+
	\| device \| str \| cuda \| The device to load LM model onto to calculate the scores \|
	\| \| \| \| It can be 'cpu', 'cuda', 'cuda:0', 'cuda:1', ... \|
	+---------------------+--------+------------------+-------------------------------------------------------------------------+


	Hyperparameter Linear Search
	----------------------------

	This script also supports linear search for parameters `alpha` and `beta`. If any of the two is not
	provided, a linear search is performed to find the best value for that parameter. When linear search is used, initially
	`beta` is set to zero and the best value for `alpha` is found, then `alpha` is fixed with
	that value and another linear search is done to find the best value for `beta`.
	If any of the of these two parameters is already specified, then search for that one is skipped. After each search for a
	parameter, the plot of WER% for different values of the parameter is also shown.

	It is recommended to first use the linear search for both parameters on a validation set by not providing any values for `--alpha` and `--beta`.
	Then check the WER curves and decide on the best values for each parameter. Finally, evaluate the best values on the test set.