ngram-count

ngram-count

NAME

ngram-count - count N-grams and estimate language models

SYNOPSIS

ngram-count [-help] option ...

DESCRIPTION

ngram-count generates and manipulates N-gram counts, and estimates N-gram language models from them. The program first builds an internal N-gram count set, either by reading counts from a file, or by scanning text input. Following that, the resulting counts can be output back to a file or used for building an N-gram language model in ARPA ngram-format(5). Each of these actions is triggered by corresponding options, as described below.

OPTIONS

Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.

-help
Print option summary.
-version
Print version information.
-order n
Set the maximal order (length) of N-grams to count. This also determines the order of the estimated LM, if any. The default order is 3.
-vocab file
Read a vocabulary from file. Subsequently, out-of-vocabulary words in both counts or text are replaced with the unknown-word token. If this option is not specified all words found are implicitly added to the vocabulary.
-write-vocab file
Write the vocabulary built in the counting process to file.
-tagged
Interpret text and N-grams as consisting of word/tag pairs.
-tolower
Map all vocabulary to lowercase.
-memuse
Print memory usage statistics.

Counting Options

-text textfile
Generate N-gram counts from text file. textfile should contain one sentence unit per line. Begin/end sentence tokens are added if not already present. Empty lines are ignored.
-read countsfile
Read N-gram counts from a file. Each line contains an N-gram of words, followed by an integer count, all separated by whitespace. Repeated counts for the same N-gram are added. Thus several count files can be merged by using cat(1) and feeding the result to ngram-count -read - (but see ngram-merge(1) for merging counts that exceed available memory). Counts collected by -text and -read are additive as well.
-write file
Write total counts to file.
-write-order n
Order of counts to write. The default is 0, which stands for N-grams of all lengths.
-writen file
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Writes only counts of the indicated order to file. This is convenient to generate counts of different orders separately in a single pass.
-sort
Output counts in lexicographic order, as required for ngram-merge(1).
-recompute
Regenerate lower-order counts by summing the highest-order counts for each N-gram prefix.

LM Options

-lm lmfile
Estimate a backoff N-gram model from the total counts, and write it to lmfile in ngram-format(5).
-nonevents file
Read a list of words from file that are to be considered non-events, i.e., that can only occur in the context of an N-gram. Such words are given zero probability mass in model estimation.
-float-counts
Enable manipulation of fractional counts. Only certain discounting methods support non-integer counts.
-skip
Estimate a ``skip'' N-gram model, which predicts a word by an interpolation of the immediate context and the context one word prior. This also triggers N-gram counts to be generated that are one word longer than the indicated order. The following four options control the EM estimation algorithm used for skip-N-grams.
-init-lm lmfile
Load an LM to initialize the parameters of the skip-N-gram.
-skip-init value
The initial skip probability for all words.
-em-iters n
The maximum number of EM iterations.
-em-delta d
The convergence criterion for EM: if the relative change in log likelihood falls below the given value, iteration stops.
-unk
Build an ``open vocabulary'' LM, i.e., one that contains the unknown-word token as a regular word. The default is to remove the unknown word.
-map-unk word
Map out-of-vocabulary words to word, rather than the default <unk> tag.
-trust-totals
Force the lower-order counts to be used as total counts in estimating N-gram probabilities. Usually these totals are recomputed from the higher-order counts.
-prune threshold
Prune N-gram probabilities if their removal causes (training set) perplexity of the model to increase by less than threshold relative.
-minprune n
Only prune N-grams of length at least n. The default (and minimum allowed value) is 2, i.e., only unigrams are excluded from pruning.
-debug level
Set debugging output from estimated LM at level. Level 0 means no debugging. Debugging messages are written to stderr.
-gtnmin count
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the minimal count of N-grams of order n that will be included in the LM. All N-grams with frequency lower than that will effectively be discounted to 0. If n is omitted the parameter for N-grams of order > 9 is set.
NOTE: This option affects not only the default Good-Turing discounting but the alternative discounting methods described below as well.
-gtnmax count
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the maximal count of N-grams of order n that are discounted under Good-Turing. All N-grams more frequent than that will receive maximum likelihood estimates. Discounting can be effectively disabled by setting this to 0. If n is omitted the parameter for N-grams of order > 9 is set.

In the following discounting parameter options, the order n may be omitted, in which case a default for all N-gram orders is set. The corresponding discounting method then becomes the default method for all orders, unless specifically overridden by an option with n. If no discounting method is specified, Good-Turing is used.

-gtn gtfile
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Save or retrieve Good-Turing parameters (cutoffs and discounting factors) in/from gtfile. This is useful as GT parameters should always be determined from unlimited vocabulary counts, whereas the eventual LM may use a limited vocabulary. The parameter files may also be hand-edited. If an -lm option is specified the GT parameters are read from gtfile, otherwise they are computed from the current counts and saved in gtfile.
-cdiscountn discount
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Ney's absolute discounting for N-grams of order n, using discount as the constant to subtract.
-wbdiscountn
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Witten-Bell discounting for N-grams of order n. (This is the estimator where the first occurrence of each word is taken to be a sample for the ``unseen'' event.)
-ndiscountn
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Ristad's natural discounting law for N-grams of order n.
-kndiscountn
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Chen and Goodman's modified Kneser-Ney discounting for N-grams of order n.
-kn-counts-modified
Indicates that input counts have already been modified for Kneser-Ney smoothing. If this option is not given, the KN discounting method modifies counts (except those of highest order) in order to estimate the backoff distributions. When using the -write and related options the output will reflect the modified counts.
-kn-modify-counts-at-end
Modify Kneser-Ney counts after estimating discounting constants, rather than before as is the default.
-knn knfile
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Save or retrieve Kneser-Ney parameters (cutoff and discounting constants) in/from knfile. This is useful as smoothing parameters should always be determined from unlimited vocabulary counts, whereas the eventual LM may use a limited vocabulary. The parameter files may also be hand-edited. If an -lm option is specified the KN parameters are read from knfile, otherwise they are computed from the current counts and saved in knfile.
-ukndiscountn
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use the original (unmodified) Kneser-Ney discounting method for N-grams of order n.

In the above discounting options, if the parameter n is omitted the option sets the default discounting method for all N-grams of length greater than 9.

-interpolaten
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Causes the discounted N-gram probability estimates at the specified order n to be interpolated with lower-order estimates. (The result of the interpolation is encoded as a standard backoff model and can be evaluated as such -- the interpolation happens at estimation time.) This sometimes yields better models with some smoothing methods (see Chen & Goodman, 1998). Only Witten-Bell, absolute discounting, and modified Kneser-Ney smoothing currently support interpolation.
-meta-tag string
Interpret words starting with string as count-of-count (meta-count) tags. For example, an N-gram
a b string3 4
means that there were 4 trigrams starting with "a b" that occurred 3 times each. Meta-tags are only allowed in the last position of an N-gram.
Note: when using -tolower the meta-tag string must not contain any uppercase characters.
-read-with-mincounts
Save memory by eliminating N-grams with counts that fall below the thresholds set by -gtNmin options during -read operation (this assumes the input counts contain no duplicate N-grams). Also, if -meta-tag is defined, these low-count N-grams will be converted to count-of-count N-grams, so that smoothing methods that need this information still work correctly.

SEE ALSO

ngram-merge(1), ngram(1), ngram-class(1), training-scripts(1), lm-scripts(1), ngram-format(5).
S. M. Katz, ``Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer,'' IEEE Trans. ASSP 35(3), 400-401, 1987.
H. Ney and U. Essen, ``On Smoothing Techniques for Bigram-based Natural Language Modelling,'' Proc. ICASSP, 825-828, 1991.
I. H. Witten and T. C. Bell, ``The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression,'' IEEE Trans. Information Theory 37(4), 1085-1094, 1991.
E. S. Ristad, ``A Natural Law of Succession,'' CS-TR-495-95, Comp. Sci. Dept., Princeton Univ., 1995.
R. Kneser and H. Ney, ``Improved backing-off for M-gram language modeling,'' Proc. ICASSP, 181-184, 1995.
S. F. Chen and J. Goodman, ``An Empirical Study of Smoothing Techniques for Language Modeling,'' TR-10-98, Computer Science Group, Harvard Univ., 1998.

BUGS

Several of the LM types supported by ngram(1) don't have explicit support in ngram-count. Instead, they are built by separately manipulating N-gram counts, followed by standard N-gram model estimation.
LM support for tagged words is incomplete.
Only absolute and Witten-Bell discounting currently support fractional counts.
The combination of -read-with-mincounts and -meta-tag preserves enough count-of-count information for applying discounting parameters to the input counts, but it does not necessarily allow the parameters to be correctly estimated. Therefore, discounting parameters should always be estimated from full counts (e.g., using the helper training-scripts(1)), and then read from files.

AUTHOR

Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 1995-2004 SRI International