ngram
ngram
NAME
ngram - apply N-gram language models
SYNOPSIS
ngram
[-help]
option
...
DESCRIPTION
ngram
performs various operations with N-gram-based and related language models,
including sentence scoring, perplexity computation, sentences generation,
and various types of model interpolation.
The N-gram language models are read from files in ARPA
ngram-format(5);
various extended language model formats are described with the options
below.
Each filename argument can be an ASCII file, or a
compressed file (name ending in .Z or .gz), or ``-'' to indicate
stdin/stdout.
OPTIONS
- -help
-
Print option summary.
- -order n
-
Set the maximal N-gram order to be used, by default 3.
NOTE: The order of the model is not set automatically when a model
file is read, so the same file can be used at various orders.
To use models of order higher than 3 it is always necessary to specify this
option.
- -debug level
-
Set the debugging output level (0 means no debugging output).
Debugging messages are sent to stderr, with the exception of
-ppl
output as explained below.
- -memuse
-
Print memory usage statistics for the LM.
The following options determine the type of LM to be used.
- -null
-
Use a `null' LM as the main model (one that gives probability 1 to all words).
This is useful in combination with mixture creation or for debugging.
- -lm file
-
Read the (main) N-gram model from
file.
This option is always required, unless
-null
was chosen.
- -df
-
Interpret the LM as containing disfluency events.
- -tagged
-
Interpret the LM as containing word/tag N-grams.
- -skip
-
Interpret the LM as a ``skip'' N-gram model.
- -hidden-vocab file
-
Interpret the LM as an N-gram containing hidden events between words.
The list of hidden event tags is read from
file.
- -hidden-not
-
Modifies processing of hidden event N-grams for the case that
the event tags are embedded in the word stream, as opposed to inferred
through dynamic programming.
- -classes file
-
Interpret the LM as an N-gram over word classes.
The expansions of the classes are given in
file
in
classes-format(5).
Tokens in the LM that are not defined as classes in
file
are assumed to be plain words, so that the LM can contain mixed N-grams over
both words and word classes.
Class definitions may also follow the N-gram definitions in the
LM file (the argument to
-lm).
In that case
-classes /dev/null
should be specified to trigger interpretation of the LM as a class-based model.
- -expand-classes k
-
Replace the read class-N-gram model with an (approximately) equivalent
word-based N-gram.
The argument
k
limits the length of the N-grams included in the new model
(k=0
allows N-grams of arbitrary length).
- -expand-exact k
-
Use a more exact (but also more expensive) algorithm to compute the
conditional probabilities of N-grams expanded from classes, for
N-grams of length
k
or longer
(k=0
is a special case and the default, it disables the exact algorithm for all
N-grams).
The exact algorithm is recommended for class-N-gram models that contain
multi-word class expansions, for N-gram lengths exceeding the order of
the underlying class N-grams.
- -decipher
-
Use the N-gram model exactly as the Decipher(TM) recognizer would,
i.e., choosing the backoff path if it has a higher probability than
the bigram transition, and rounding log probabilities to bytelog
precision.
- -hmm
-
Use an HMM of N-grams language model.
The
-lm
option specifies a file that describes a probabilistic graph, with each
line corresponding to a node or state.
A line has the format:
statename ngram-file s1 p1 s2 p2 ...
where
statename
is a string identifying the state,
ngram-file
names a file containing a backoff N-gram model,
s1,s2
... are names of follow-states, and
p1,p2
... are the associated transition probabilities.
A filename of ``-'' can be used to indicate the N-gram model data
is included in the HMM file, after the current line.
(Further HMM states may be specified after the N-gram data.)
The names
INITIAL
and
FINAL
denote the start and end states, respectively, and have no associated
N-gram model ( ngram-file
must be specified as ``.'' for these).
The
-order
option specifies the maximal N-gram length in the component models.
The semantics of an HMM of N-grams is as follows: as each state is visited,
words are emitted from the associated N-gram model.
The first state (corresponding to the start-of-sentence) is
INITIAL.
A state is left with the probability of the end-of-sentence token
in the respective model, and the next state is chosen according to
the state transition probabilities.
Each state has to emit at least one word.
The actual end-of-sentence is emitted if and only if the
FINAL
state is reached.
Each word probability is conditioned on all preceding words, regardless
of whether they were emitted in the same or a previous state.
- -vocab file
-
Initialize the vocabulary for the LM from
file.
This is especially useful if the LM itself does not specify a complete
vocabulary, e.g., as with
-null.
- -unk
-
Indicates that the LM contains the unknown word, i.e., is an open-class LM.
- -tolower
-
Map all vocabulary to lowercase.
Useful if case conventions for text/counts and language model differ.
- -mix-lm file
-
Read a second N-gram model for interpolation purposes.
The second and any additional interpolated models can also be class N-grams
(using the same
-classes
definitions), but are otherwise constrained to be standard N-grams, i.e.,
the options
-df,
-tagged,
-skip,
and
-hidden-vocab
do not apply to then.
NOTE:
Unless
-bayes
(see below) is specified,
-mix-lm
triggers a static interpolation of the models in memory.
In most cases a more efficient, dynamic interpolation is sufficient, requested
by
-bayes 0.
- -lambda weight
-
Set the weight of the main model when interpolating with
-mix-lm.
Default value is 0.5.
- -mix-lm2 file
-
- -mix-lm3 file
-
- -mix-lm4 file
-
- -mix-lm5 file
-
Up to 4 more N-gram models can be specified for interpolation.
- -mix-lambda2 weight
-
- -mix-lambda3 weight
-
- -mix-lambda4 weight
-
- -mix-lambda5 weight
-
These are the weights for the additional mixture components, corresponding
to
-mix-lm2
through
-mix-lm5.
The weight for the
-mix-lm
model is 1 minus the sum of
-lambda
and
-mix-lambda2
through
-mix-lambda5.
- -bayes length
-
Interpolate the second and the main model using posterior probabilities
for local N-gram-contexts of length
length.
The
-lambda
value is used as a prior mixture weight in this case.
- -bayes-scale scale
-
Set the exponential scale factor on the context likelihood in conjunction
with the
-bayes
function.
Default value is 1.0.
- -cache length
-
Interpolate the main LM (or the one resulting from operations above) with
a unigram cache language model based on a history of
length
words.
- -cache-lambda weight
-
Set interpolation weight for the cache LM.
Default value is 0.05.
- -dynamic
-
Interpolate the main LM (or the one resulting from operations above) with
a dynamically changing LM.
LM changes are indicated by the tag ``<LMstate>'' starting a line in the
input to
-ppl,
followed by a filename containing the new LM.
- -dynamic-lambda weight
-
Set interpolation weight for the dynamic LM.
Default value is 0.05.
The following options specify the operations performed on/with the LM
constructed as per the options above.
- -renorm
-
Renormalize the main model by recomputing backoff weights for the given
probabilities.
- -prune threshold
-
Prune N-gram probabilities if their removal causes (training set)
perplexity of the model to increase by less than
threshold
relative.
- -prune-lowprobs
-
Prune N-gram probabilities that are lower than the corresponding
backed-off estimates.
This generates N-gram models that can be correctly
converted into probabilistic finite-state networks.
- -minprune n
-
Only prune N-grams of length at least
n.
The default (and minimum allowed value) is 2, i.e., only unigrams are excluded
from pruning.
This option applies to both
-prune
and
-prune-lowprobs.
- -write-lm file
-
Write a model back to
file.
The output will be in the same format as read by
-lm,
except if operations such as
-mix-lm
or
-expand-classes
were applied, in which case the output will contain the generated
single N-gram backoff model in ARPA
ngram-format(5).
- -write-vocab file
-
Write the LM's vocabulary to
file.
- -gen number
-
Generate
number
random sentences from the LM.
- -seed value
-
Initialize the random number generator used for sentence generation
using seed
value.
The default is to use a seed that should be close to unique for each
invocation of the program.
- -ppl textfile
-
Compute sentence scores (log probabilities) and perplexities from
the sentences in
textfile,
which should contain one sentence per line.
The
-debug
option controls the level of detail printed, even though output is
to stdout (not stderr).
At level 0, only summary statistics for the entire corpus are printed.
At level 1, statistics for individual sentences are printed.
At level 2, probabilities for each word, plus LM-dependent details about backoff
used etc., are printed.
At level 3, the probabilities for all words are summed in each context, and
the sum is printed. If this differs significantly from 1, a warning message
to stderr will be issued.
- -nbest file
-
Read an N-best list in
nbest-format(5)
and rerank the hypotheses using the specified LM.
The reordered N-best list is written to stdout.
If the N-best list is given in
``NBestList1.0'' format and contains
composite acoustic/language model scores, then
-decipher-lm
and the recognizer language model and word transition weights (see below)
need to be specified so the original acoustic scores can be recovered.
- -max-nbest n
-
Limits the number of hypotheses read from an N-best list.
Only the first
n
hypotheses are processed.
- -rescore file
-
Similar to
-nbest,
but the input is processed as a stream of N-best hypotheses (without header).
The output consists of the rescored hypotheses in
SRILM format (the third of the formats described in
nbest-format(5)).
- -decipher-lm model-file
-
Designates the N-gram backoff model (typically a bigram) that was used by the
Decipher(TM) recognizer in computing composite scores for the hypotheses fed to
-rescore
or
-nbest.
Used to compute acoustic scores from the composite scores.
- -decipher-order N
-
Specifies the order of the Decipher N-gram model used (default is 2).
- -decipher-nobackoff
-
Indicates that the Decipher N-gram model does not contain backoff nodes,
i.e., all recognizer LM scores are correct up to rounding.
- -decipher-lmw weight
-
Specifies the language model weight used by the recognizer.
Used to compute acoustic scores from the composite scores.
- -decipher-wtw weight
-
Specifies the word transition weight used by the recognizer.
Used to compute acoustic scores from the composite scores.
- -escape string
-
Set an ``escape string'' for the
-ppl
and
-rescore
computations.
Input lines starting with
string
are not processed as sentences and passed unchanged to stdout instead.
This allows associated information to be passed to scoring scripts etc.
- -counts countsfile
-
Perform a computation similar to
-ppl,
but based only on the N-gram counts found in
countsfile.
Probabilities are computed for the last word of each N-gram, using the
other words as contexts, and scaling by the associated N-gram count.
- -count-order n
-
Use only counts of order
n
in the
-counts
computation.
The default value is 0, meaning use all counts.
- -skipoovs
-
Instruct the LM to skip over contexts that contain out-of-vocabulary
words, instead of using a backoff strategy in these cases.
- -noise noise-tag
-
Designate
noise-tag
as a vocabulary item that is to be ignored by the LM.
(This is typically used to identify a noise marker.)
Note that the LM specified by
-decipher-lm
does NOT ignore this
noise-tag
since the DECIPHER recognizer treats noise as a regular word.
- -noise-vocab file
-
Read several noise tags from
file,
instead of, or in addition to, the single noise tag specified by
-noise.
- -reverse
-
Reverse the words in a sentence for LM scoring purposes.
(This assumes the LM used is a ``right-to-left'' model.)
Note that the LM specified by
-decipher-lm
is always applied to the original, left-to-right word sequence.
SEE ALSO
ngram-count(1), ngram-class(1), lm-scripts(1), ppl-scripts(1),
pfsg-scripts(1), nbest-scripts(1),
ngram-format(5), nbest-format(5), classes-format(5).
M. Weintraub et al., ``Fast Training and Portability,''
in Research Note No. 1, Center for Language and Speech Processing,
Johns Hopkins University, Baltimore, Feb. 1996.
A. Stolcke,`` Entropy-based Pruning of Backoff Language Models,''
Proc. DARPA Broadcast News Transcription and Understanding Workshop,
270-274, Lansdowne, VA, 1998.
A. Stolcke et al., ``Automatic Detection of Sentence Boundaries and
Disfluencies based on Recognized Words, '' Proc. ICSLP, 2247-2250,
Sydney.
BUGS
Some LM types (such as Bayes-interpolated LMs) currently do not support the
-write-lm
function.
Sentence generation is slow and takes time proportional to the vocabulary
size.
AUTHOR
Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 1995-2000 SRI International