ngram-count

ngram-count

NAME

ngram-count - count N-grams and estimate language models

SYNOPSIS

ngram-count [-help] option ...

DESCRIPTION

ngram-count generates and manipulates N-gram counts, and estimates N-gram language models from them. The program first builds an internal N-gram count set, either by reading counts from a file, or by scanning text input. Following that, the resulting counts can be output back to a file or used for building an N-gram language model in ARPA ngram-format(5). Each of these actions is triggered by corresponding options, as described below.

Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.

OPTIONS

-help
Print option summary.
-order n
Set the maximal order (length) of N-grams to count. This also determines the order of the estimated LM, if any. The default order is 3.
-vocab file
Read a vocabulary from file. Subsequently, out-of-vocabulary words in both counts or text are replaced with the unknown-word token. If this option is not specified all words found are implicitly added to the vocabulary.
-write-vocab file
Write the vocabulary built in the counting process to file.
-tagged
Interpret text and N-grams as consisting of word/tag pairs.
-tolower
Map all vocabulary to lowercase.
-memuse
Print memory usage statistics.

Counting Options

-text textfile
Generate N-gram counts from text file. textfile should contain one sentence unit per line. Begin/end sentence tokens are added if not already present. Empty lines are ignored.
-read countsfile
Read N-gram counts from a file. Each line contains an N-gram of words, followed by an integer count, all separated by whitespace. Repeated counts for the same N-gram are added. Thus several count files can be merged by using cat(1) and feeding the result to ngram-count -read - (but see ngram-merge(1) for merging counts that exceed available memory). Counts collected by -text and -read are additive as well.
-write file
Write total counts to file.
-write-order n
Order of counts to write. The default is 0, which stands for N-grams of all lengths.
-writen file
where n is 1, 2, 3, 4, 5, or 6. Writes only counts of the indicated order to file. This is convenient to generate counts of different orders separately in a single pass.
-sort
Output counts in lexicographic order, as required for ngram-merge(1).
-recompute
Regenerate lower-order counts by summing the highest-order counts for each N-gram prefix.

LM Options

-lm lmfile
Estimate a backoff N-gram model from the total counts, and write it to lmfile in ngram-format(5).
-float-counts
Enable manipulation of fractional counts. Only certain discounting methods support non-integer counts.
-skip
Estimate a ``skip'' N-gram model, which predicts a word by an interpolation of the immediate context and the context one word prior. This also triggers N-gram counts to be generated that are one word longer than the indicated order. The following four options control the EM estimation algorithm used for skip-N-grams.
-init-lm lmfile
Load an LM to initialize the parameters of the skip-N-gram.
-skip-init value
The initial skip probability for all words.
-em-iters n
The maximum number of EM iterations.
-em-delta d
The convergence criterion for EM: if the relative change in log likelihood falls below the given value, iteration stops.
-unk
Build an ``open vocabulary'' LM, i.e., one that contains the unknown-word token as a regular word. The default is to remove the unknown word.
-trust-totals
Force the lower-order counts to be used as total counts in estimating N-gram probabilities. Usually these totals are recomputed from the higher-order counts.
-prune threshold
Prune N-gram probabilities if their removal causes (training set) perplexity of the model to increase by less than threshold relative.
-minprune n
Only prune N-grams of length at least n. The default (and minimum allowed value) is 2, i.e., only unigrams are excluded from pruning.
-debug level
Set debugging output from estimated LM at level. Level 0 means no debugging. Debugging messages are written to stderr.
-gtnmin count
where n is 1, 2, 3, 4, 5, or 6. Set the minimal count of N-grams of order n that will be included in the LM. All N-grams with frequency lower than that will effectively be discounted to 0. NOTE: This option affects not only the default Good-Turing discounting but the alternative discounting methods described below as well.
-gtnmax count
where n is 1, 2, 3, 4, 5, or 6. Set the maximal count of N-grams of order n that are discounted under Good-Turing. All N-grams more frequent than that will receive maximum likelihood estimates. Discounting can be effectively disabled by setting this to 0.
-gtn gtfile
where n is 1, 2, 3, 4, 5, or 6. Save or retrieve Good-Turing parameters (cutoffs and discounting factors) in/from gtfile. This is useful as GT parameters should always be determined from unlimited vocabulary counts, whereas the eventual LM may use a limited vocabulary. The parameter files may also be hand-edited. If an -lm option is specified the GT parameters are read from gtfile, otherwise they are computed from the current counts and saved in gtfile.
-cdiscountn discount
where n is 1, 2, 3, 4, 5, or 6. Use Ney's absolute discounting for N-grams of order n, using discount as the constant to subtract.
-wbdiscountn
where n is 1, 2, 3, 4, 5, or 6. Use Witten-Bell discounting for N-grams of order n. (This is the estimator where the first occurrence of each word is taken to be a sample for the ``unseen'' event.)
-ndiscountn
where n is 1, 2, 3, 4, 5, or 6. Use Ristad's natural discounting law for N-grams of order n.

SEE ALSO

ngram-merge(1), ngram(1), ngram-class(1), training-scripts(1), lm-scripts(1), ngram-format(5).
S. M. Katz, ``Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer,'' IEEE Trans. ASSP 35(3), 400-401, 1987.
H. Ney and U. Essen, ``On Smoothing Techniques for Bigram-based Natural Language Modelling,'' Proc. ICASSP, 825-828, 1991.
I. H. Witten and T. C. Bell, ``The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression,'' IEEE Trans. Information Theory 37(4), 1085-1094, 1991.
E. S. Ristad, ``A Natural Law of Succession,'' CS-TR-495-95, Comp. Sci. Dept., Princeton Univ., 1995.

BUGS

Several of the LM types supported by ngram(1) don't have explicit support in ngram-count. Instead, they are built by separately manipulating ngram counts, followed by standard ngram model estimation.
LM support for tagged words is incomplete.
Only absolute and Witten-Bell discounting currently supports fractional counts.

AUTHOR

Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 1995-1999 SRI International