ngram

ngram

NAME

ngram - apply N-gram language models

SYNOPSIS

ngram [-help] option ...

DESCRIPTION

ngram performs various operations with N-gram-based and related language models, including sentence scoring, perplexity computation, sentences generation, and various types of model interpolation. The N-gram language models are read from files in ARPA ngram-format(5); various extended language model formats are described with the options below.

Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.

OPTIONS

-help
Print option summary.
-order n
Set the maximal N-gram order to be used, by default 3. NOTE: The order of the model is not set automatically when a model file is read, so the same file can be used at various orders. To use models of order higher than 3 it is always necessary to specify this option.
-debug level
Set the debugging output level (0 means no debugging output). Debugging messages are sent to stderr, with the exception of -ppl output as explained below.
-memuse
Print memory usage statistics for the LM.

The following options determine the type of LM to be used.

-null
Use a `null' LM as the main model (one that gives probability 1 to all words). This is useful in combination with mixture creation or for debugging.
-lm file
Read the (main) N-gram model from file. This option is always required, unless -null was chosen.
-tagged
Interpret the LM as containing word/tag N-grams.
-skip
Interpret the LM as a ``skip'' N-gram model.
-hidden-vocab file
Interpret the LM as an N-gram model containing hidden events between words. The list of hidden event tags is read from file.
Hidden event definitions may also follow the N-gram definitions in the LM file (the argument to -lm). The format for such definitions is
event [-delete D] [-repeat R] [-insert w] [-observed] [-omit]
The optional flags after the event name modify the default behavior of hidden events in the model. By default events are unobserved pseudo-words of which at most one can occur between regular words, and which are added to the context to predict following words and events. (A typical use would be to model hidden sentence boundaries.) -delete indicates that upon encountering the event, D words are deleted from the next word's context. -repeat indicates that after the event the next R words from the context are to be repeated. -insert specifies that an (unobserved) word w is to be inserted into the history. -observed specifies the event tag is not hidden, but observed in the word stream. -omit indicates that the event tag itself is not to be added to the history for predicting the following words.
The hidden event mechanism represents a generalization of the disfluency LM enabled by -df.
-hidden-not
Modifies processing of hidden event N-grams for the case that the event tags are embedded in the word stream, as opposed to inferred through dynamic programming.
-df
Interpret the LM as containing disfluency events. This enables an older form of hidden-event LM used in Stolcke & Shriberg (1996). It is roughly equivalent to a hidden-event LM with
UH -observed -omit (filled pause)
UM -observed -omit (filled pause)
@SDEL -insert <s> (sentence restart)
@DEL1 -delete 1 -omit (1-word deletion)
@DEL2 -delete 2 -omit (2-word deletion)
@REP1 -repeat 1 -omit (1-word repetition)
@REP2 -repeat 2 -omit (2-word repetition)
-classes file
Interpret the LM as an N-gram over word classes. The expansions of the classes are given in file in classes-format(5). Tokens in the LM that are not defined as classes in file are assumed to be plain words, so that the LM can contain mixed N-grams over both words and word classes.
Class definitions may also follow the N-gram definitions in the LM file (the argument to -lm). In that case -classes /dev/null should be specified to trigger interpretation of the LM as a class-based model. Otherwise, class definitions specified with this option override any definitions found in the LM file itself.
-simple-classes
Assume a "simple" class model: each word is member of at most one word class, and class expansions are exactly one word long.
-expand-classes k
Replace the read class-N-gram model with an (approximately) equivalent word-based N-gram. The argument k limits the length of the N-grams included in the new model (k=0 allows N-grams of arbitrary length).
-expand-exact k
Use a more exact (but also more expensive) algorithm to compute the conditional probabilities of N-grams expanded from classes, for N-grams of length k or longer (k=0 is a special case and the default, it disables the exact algorithm for all N-grams). The exact algorithm is recommended for class-N-gram models that contain multi-word class expansions, for N-gram lengths exceeding the order of the underlying class N-grams.
-decipher
Use the N-gram model exactly as the Decipher(TM) recognizer would, i.e., choosing the backoff path if it has a higher probability than the bigram transition, and rounding log probabilities to bytelog precision.
-hmm
Use an HMM of N-grams language model. The -lm option specifies a file that describes a probabilistic graph, with each line corresponding to a node or state. A line has the format:
statename ngram-file s1 p1 s2 p2 ...
where statename is a string identifying the state, ngram-file names a file containing a backoff N-gram model, s1,s2 ... are names of follow-states, and p1,p2 ... are the associated transition probabilities. A filename of ``-'' can be used to indicate the N-gram model data is included in the HMM file, after the current line. (Further HMM states may be specified after the N-gram data.)
The names INITIAL and FINAL denote the start and end states, respectively, and have no associated N-gram model ( ngram-file must be specified as ``.'' for these). The -order option specifies the maximal N-gram length in the component models.
The semantics of an HMM of N-grams is as follows: as each state is visited, words are emitted from the associated N-gram model. The first state (corresponding to the start-of-sentence) is INITIAL. A state is left with the probability of the end-of-sentence token in the respective model, and the next state is chosen according to the state transition probabilities. Each state has to emit at least one word. The actual end-of-sentence is emitted if and only if the FINAL state is reached. Each word probability is conditioned on all preceding words, regardless of whether they were emitted in the same or a previous state.
-vocab file
Initialize the vocabulary for the LM from file. This is especially useful if the LM itself does not specify a complete vocabulary, e.g., as with -null.
-nonevents file
Read a list of words from file that are to be considered non-events, i.e., that should only occur in LM contexts, but not as predictions. Such words are excluded from sentence generation (-gen) and probability summation (-ppl -debug 3).
-limit-vocab
Discard LM parameters on reading that do not pertain to the words specified in the vocabulary. The default is that words used in the LM are automatically added to the vocabulary. This option can be used to reduce the memory requirements for large LMs that are going to be evaluated only on a small vocabulary subset.
-unk
Indicates that the LM contains the unknown word, i.e., is an open-class LM.
-map-unk word
Map out-of-vocabulary words to word, rather than the default <unk> tag.
-tolower
Map all vocabulary to lowercase. Useful if case conventions for text/counts and language model differ.
-multiwords
Split input words consisting of multiwords joined by underscores into their components, before evaluating LM probabilities.
-mix-lm file
Read a second N-gram model for interpolation purposes. The second and any additional interpolated models can also be class N-grams (using the same -classes definitions), but are otherwise constrained to be standard N-grams, i.e., the options -df, -tagged, -skip, and -hidden-vocab do not apply to them.
NOTE: Unless -bayes (see below) is specified, -mix-lm triggers a static interpolation of the models in memory. In most cases a more efficient, dynamic interpolation is sufficient, requested by -bayes 0.
-lambda weight
Set the weight of the main model when interpolating with -mix-lm. Default value is 0.5.
-mix-lm2 file
-mix-lm3 file
-mix-lm4 file
-mix-lm5 file
-mix-lm6 file
-mix-lm7 file
-mix-lm8 file
-mix-lm9 file
Up to 9 more N-gram models can be specified for interpolation.
-mix-lambda2 weight
-mix-lambda3 weight
-mix-lambda4 weight
-mix-lambda5 weight
-mix-lambda6 weight
-mix-lambda7 weight
-mix-lambda8 weight
-mix-lambda9 weight
These are the weights for the additional mixture components, corresponding to -mix-lm2 through -mix-lm9. The weight for the -mix-lm model is 1 minus the sum of -lambda and -mix-lambda2 through -mix-lambda9.
-bayes length
Interpolate the second and the main model using posterior probabilities for local N-gram-contexts of length length. The -lambda value is used as a prior mixture weight in this case.
-bayes-scale scale
Set the exponential scale factor on the context likelihood in conjunction with the -bayes function. Default value is 1.0.
-cache length
Interpolate the main LM (or the one resulting from operations above) with a unigram cache language model based on a history of length words.
-cache-lambda weight
Set interpolation weight for the cache LM. Default value is 0.05.
-dynamic
Interpolate the main LM (or the one resulting from operations above) with a dynamically changing LM. LM changes are indicated by the tag ``<LMstate>'' starting a line in the input to -ppl, -counts, or -rescore, followed by a filename containing the new LM.
-dynamic-lambda weight
Set interpolation weight for the dynamic LM. Default value is 0.05.

The following options specify the operations performed on/with the LM constructed as per the options above.

-renorm
Renormalize the main model by recomputing backoff weights for the given probabilities.
-prune threshold
Prune N-gram probabilities if their removal causes (training set) perplexity of the model to increase by less than threshold relative.
-prune-lowprobs
Prune N-gram probabilities that are lower than the corresponding backed-off estimates. This generates N-gram models that can be correctly converted into probabilistic finite-state networks.
-minprune n
Only prune N-grams of length at least n. The default (and minimum allowed value) is 2, i.e., only unigrams are excluded from pruning. This option applies to both -prune and -prune-lowprobs.
-write-lm file
Write a model back to file. The output will be in the same format as read by -lm, except if operations such as -mix-lm or -expand-classes were applied, in which case the output will contain the generated single N-gram backoff model in ARPA ngram-format(5).
-write-vocab file
Write the LM's vocabulary to file.
-gen number
Generate number random sentences from the LM.
-seed value
Initialize the random number generator used for sentence generation using seed value. The default is to use a seed that should be close to unique for each invocation of the program.
-ppl textfile
Compute sentence scores (log probabilities) and perplexities from the sentences in textfile, which should contain one sentence per line. The -debug option controls the level of detail printed, even though output is to stdout (not stderr). At level 0, only summary statistics for the entire corpus are printed. These include the number of sentences, words, out-of-vocabulary words and zero-probability tokens in the input, as well as its total log probability and perplexity. Perplexity is given with two different normalizations: counting all input tokens (``ppl'') and excluding end-of-sentence tags (``ppl1''). At level 1, statistics for individual sentences are printed. At level 2, probabilities for each word, plus LM-dependent details about backoff used etc., are printed. At level 3, the probabilities for all words are summed in each context, and the sum is printed. If this differs significantly from 1, a warning message to stderr will be issued.
-nbest file
Read an N-best list in nbest-format(5) and rerank the hypotheses using the specified LM. The reordered N-best list is written to stdout. If the N-best list is given in ``NBestList1.0'' format and contains composite acoustic/language model scores, then -decipher-lm and the recognizer language model and word transition weights (see below) need to be specified so the original acoustic scores can be recovered.
-max-nbest n
Limits the number of hypotheses read from an N-best list. Only the first n hypotheses are processed.
-rescore file
Similar to -nbest, but the input is processed as a stream of N-best hypotheses (without header). The output consists of the rescored hypotheses in SRILM format (the third of the formats described in nbest-format(5)).
-decipher-lm model-file
Designates the N-gram backoff model (typically a bigram) that was used by the Decipher(TM) recognizer in computing composite scores for the hypotheses fed to -rescore or -nbest. Used to compute acoustic scores from the composite scores.
-decipher-order N
Specifies the order of the Decipher N-gram model used (default is 2).
-decipher-nobackoff
Indicates that the Decipher N-gram model does not contain backoff nodes, i.e., all recognizer LM scores are correct up to rounding.
-decipher-lmw weight
Specifies the language model weight used by the recognizer. Used to compute acoustic scores from the composite scores.
-decipher-wtw weight
Specifies the word transition weight used by the recognizer. Used to compute acoustic scores from the composite scores.
-escape string
Set an ``escape string'' for the -ppl, -counts, and -rescore computations. Input lines starting with string are not processed as sentences and passed unchanged to stdout instead. This allows associated information to be passed to scoring scripts etc.
-counts countsfile
Perform a computation similar to -ppl, but based only on the N-gram counts found in countsfile. Probabilities are computed for the last word of each N-gram, using the other words as contexts, and scaling by the associated N-gram count. Summary statistics are output at the end, as well as before each escaped input line.
-count-order n
Use only counts of order n in the -counts computation. The default value is 0, meaning use all counts.
-skipoovs
Instruct the LM to skip over contexts that contain out-of-vocabulary words, instead of using a backoff strategy in these cases.
-noise noise-tag
Designate noise-tag as a vocabulary item that is to be ignored by the LM. (This is typically used to identify a noise marker.) Note that the LM specified by -decipher-lm does NOT ignore this noise-tag since the DECIPHER recognizer treats noise as a regular word.
-noise-vocab file
Read several noise tags from file, instead of, or in addition to, the single noise tag specified by -noise.
-reverse
Reverse the words in a sentence for LM scoring purposes. (This assumes the LM used is a ``right-to-left'' model.) Note that the LM specified by -decipher-lm is always applied to the original, left-to-right word sequence.

SEE ALSO

ngram-count(1), ngram-class(1), lm-scripts(1), ppl-scripts(1), pfsg-scripts(1), nbest-scripts(1), ngram-format(5), nbest-format(5), classes-format(5).
M. Weintraub et al., ``Fast Training and Portability,'' in Research Note No. 1, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, Feb. 1996.
A. Stolcke,`` Entropy-based Pruning of Backoff Language Models,'' Proc. DARPA Broadcast News Transcription and Understanding Workshop, 270-274, Lansdowne, VA, 1998.
A. Stolcke et al., ``Automatic Detection of Sentence Boundaries and Disfluencies based on Recognized Words, '' Proc. ICSLP, 2247-2250, Sydney, 1998.
A. Stolcke and E. Shriberg, ``Statistical language modeling for speech disfluencies,''Proc. IEEE ICASSP, pp. 405-409, Atlanta, GA, 1996.

BUGS

Some LM types (such as Bayes-interpolated LMs) currently do not support the -write-lm function.

For the -limit-vocab option to work correctly with hidden event and class N-gram LMs, the event/class vocabularies have to be specified by options ( -hidden-vocab and -classes, respectively). Embedding event/class definitions in the LM file only will not work correctly.

Sentence generation is slow and takes time proportional to the vocabulary size.

AUTHOR

Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 1995-2003 SRI International