hidden-ngram

hidden-ngram

NAME

hidden-ngram - tag hidden events between words

SYNOPSIS

hidden-ngram [-help] option ...

DESCRIPTION

hidden-ngram tags a stream of word tokens with hidden events occurring between words. For example, an unsegmented text could be tagged for sentence boundaries (the hidden events in this case being `boundary' and `no-boundary'). The most likely hidden tag sequence consistent with the given word sequence is found according to an N-gram language model over both words and hidden tags.

hidden-ngram is a generalization of segment(1).

OPTIONS

-help
Print option summary.
-text file
Specifies the file containing the word sequences to be tagged (one sentence per line). Start- and end-of-sentence tags are not added by the program, but should be included in the input if the language model uses them.
-escape string
Set an ``escape string.'' Input lines starting with string are not processed and passed unchanged to stdout instead. This allows associated information to be passed to scoring scripts etc.
-text-map file
Read the input words from a map file contain both the words and additional likelihoods of events following each word. Each line contains one input word, plus optional hidden-event/likelihood pairs in the format
w e1 [p1] e2 [p2] ...
If a p value is omitted a likelihood of 1 is assumed. All events not explicitly listed are given likelihood 0, and are hence excluded for that word. In particular, the label *noevent* must be listed to allow absence of a hidden event. Input word strings are assembled from multiple lines of -text-map input until either an end-of-sentence token </s> is found, or an escaped line (see -escape) is encountered.
-logmap
Interpret numeric values in the -text-map file as log probabilities, rather than probabilities.
-lm file
Specifies the word/tag language model as a standard ARPA N-gram backoff model file in ngram-format(5).
-order n
Set the effective N-gram order used by the language model to n. Default is 3 (use a trigram model).
-classes file
Interpret the LM as an N-gram over word classes. The expansions of the classes are given in file in classes-format(5). Tokens in the LM that are not defined as classes in file are assumed to be plain words, so that the LM can contain mixed N-grams over both words and word classes.
-simple-classes
Assume a "simple" class model: each word is member of at most one word class, and class expansions are exactly one word long.
-mix-lm file
Read a second N-gram model for interpolation purposes. The second and any additional interpolated models can also be class N-grams (using the same -classes definitions).
-lambda weight
Set the weight of the main model when interpolating with -mix-lm. Default value is 0.5.
-mix-lm2 file
-mix-lm3 file
-mix-lm4 file
-mix-lm5 file
-mix-lm6 file
-mix-lm7 file
-mix-lm8 file
-mix-lm9 file
Up to 9 more N-gram models can be specified for interpolation.
-mix-lambda2 weight
-mix-lambda3 weight
-mix-lambda4 weight
-mix-lambda5 weight
-mix-lambda6 weight
-mix-lambda7 weight
-mix-lambda8 weight
-mix-lambda9 weight
These are the weights for the additional mixture components, corresponding to -mix-lm2 through -mix-lm9. The weight for the -mix-lm model is 1 minus the sum of -lambda and -mix-lambda2 through -mix-lambda9.
-lmw W
Scales the language model probabilities by a factor W. Default language model weight is 1.
-mapw W
Scales the likelihood map probability by a factor W. Default map weight is 1.
-tolower
Map vocabulary to lowercase, removing case distinctions.
-hidden-vocab file
Read the list of hidden tags from file. Note: This is a subset of the vocabulary contained in the language model.
-force-event
Forces a non-default event after every word. This is useful for language models that represent the default event explicitly with a tag, rather than implicitly by the absence of a tag between words (which is the default).
-keep-unk
Do not map unknown input words to the <unk> token. Instead, output the input word unchanged. Also, with this option the LM is assumed to be open-vocabulary (the default is close-vocabulary).
-fb
Perform forward-backward decoding of the input token sequence. Outputs the tags that have the highest posterior probability, for each position. The default is to use Viterbi decoding, i.e., the output is the tag sequence with the highest joint posterior probability.
-fw-only
Similar to -fb, but uses only the forward probabilities for computing posteriors. This may be used to simulate on-line prediction of tags, without the benefit of future context.
-continuous
Process all words in the input as one sequence of words, irrespective of line breaks. Normally each line is processed separately as a sentence. Input tokens are output one-per-line, followed by event tokens.
-posteriors
Output the table of posterior probabilities for each tag position. If -fb is also specified the posterior probabilities will be computed using forward-backward probabilities; otherwise an approximation will be used that is based on the probability of the most likely path containing a given tag at given position.
-totals
Output the total string probability for each input sentence. If -fb is also specified this probability is obtained by summing over all hidden event sequences; otherwise it is calculated (i.e., underestimated) using the most probably hidden event sequence.
-nbest N
Output the N best hypotheses instead of just the first best when doing Viterbi search. If N>1, then each hypothesis is prefixed by the tag NBEST_n where n is the rank of the hypothesis in the N-best list and x its score, the negative log of the combined probability of transitions and observations of the corresponding HMM path.
-write-counts file
Write the posterior weighted counts of n-grams, including those with hidden tags, summed over the entire input data, to file. The posterior probabilities should normally be computed with the forward-backward algorithm (instead of Viterbi), so the -fb option is usually also specified. Only n-grams whose contexts occur in the language model are output.
-unk-prob L
Specifies that unknown words and other words having zero probability in the language model be assigned a log probability of L. This is -100 by default but might be set to 0, e.g., to compute perplexities excluding unknown words.
-debug
Sets debugging output level.

Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.

BUGS

The -continuous and -text-map options effectively disable -keep-unk, i.e., unknown input words are always mapped to <unk>. Also, -continuous doesn't preserve the positions of escaped input lines relative to the input.
The dynamic programming for event decoding is not efficiently interleaved with that required to evaluate class N-gram models; therefore, the state space generated in decoding with -classes quickly becomes infeasibly large unless -simple-classes is also specified.

SEE ALSO

ngram-count(1), disambig(1), segment(1), ngram-format(5), classes-format(5).
A. Stolcke et al., ``Automatic Detection of Sentence Boundaries and Disfluencies based on Recognized Words,'' Proc. ICSLP, 2247-2250, Sydney.

AUTHORS

Andreas Stolcke <stolcke@speech.sri.com>,
Anand Venkataraman <anand@speech.sri.com>.
Copyright 1998-2002 SRI International