hidden-ngram
hidden-ngram
NAME
hidden-ngram - tag hidden events between words
SYNOPSIS
hidden-ngram
[-help]
option
...
DESCRIPTION
hidden-ngram
tags a stream of word tokens with hidden events occurring between words.
For example, an unsegmented text could be tagged for sentence boundaries
(the hidden events in this case being `boundary' and `no-boundary').
The most likely hidden tag sequence consistent with the given word
sequence is found according to an N-gram language model over both
words and hidden tags.
hidden-ngram
is a generalization of
segment(1).
OPTIONS
- -help
-
Print option summary.
- -text file
-
Specifies the file containing the word sequences to be tagged
(one sentence per line).
- -escape string
-
Set an ``escape string.''
Input lines starting with
string
are not processed and passed unchanged to stdout instead.
This allows associated information to be passed to scoring scripts etc.
- -text-map file
-
Read the input words from a map file contain both the words and
additional likelihoods of events following each word.
Each line contains one input word, plus optional hidden-event/likelihood
pairs in the format
w e1 [p1] e2 [p2] ...
If a p value is omitted a likelihood of 1 is assumed.
All events not explicitly listed are given likelihood 0, and are
hence excluded for that word.A
In particular, the label
*noevent*
must be listed to allow absence of a hidden event.
Input word strings are assembled from multiple lines of
-text-map
input until either an end-of-sentence token </s> is found, or an escaped
line (see
-escape)
is encountered.
- -logmap
-
Interpret numeric values in the
-text-map
file as log probabilities, rather
than probabilities.
- -lm file
-
Specifies the word/tag language model as a standard ARPA N-gram backoff model
file in
ngram-format(5).
- -order n
-
Set the effective N-gram order used by the language model to
n.
Default is 3 (use a trigram model).
- -lmw W
-
Scales the language model probabilities by a factor
W.
Default language model weight is 1.
- -mapw W
-
Scales the likelihood map probability by a factor
W.
Default map weight is 1.
- -tolower
-
Map vocabulary to lowercase, removing case distinctions.
- -hidden-vocab file
-
Read the list of hidden tags from
file.
Note: This is a subset of the vocabulary contained in the language model.
- -force-event
-
Forces a non-default event after every word.
This is useful for language models that represent the default event
explicitly with a tag, rather than implicitly by the absence of a tag
between words (which is the default).
- -keep-unk
-
Do not map unknown input words to the <unk> token.
Instead, output the input word unchanged.
- -fb
-
Perform forward-backward decoding of the input token sequence.
Outputs the tags that have the highest posterior probability,
for each position.
The default is to use Viterbi decoding, i.e., the output is the
tag sequence with the highest joint posterior probability.
- -continuous
-
Process all words in the input as one sequence of words, irrespective of
line breaks.
Normally each line is processed separately as a sentence.
Input tokens are output one-per-line, followed by event tokens.
- -posteriors
-
Output the table of posterior probabilities for each
tag position.
If
-fb
is also specified the posterior probabilities will be computed using
forward-backward probabilities; otherwise an approximation will be used
that is based on the probability of the most likely path containing
a given tag at given position.
- -totals
-
Output the total string probability for each input sentence.
If
-fb
is also specified this probability is obtained by summing over all
hidden event sequences; otherwise it is calculated (i.e., underestimated)
using the most probably hidden event sequence.
- -write-counts file
-
Write the posterior weighted counts of n-grams, including those
with hidden tags, summed over the entire input data, to
file.
The posterior probabilities should normally be computed with the
forward-backward algorithm (instead of Viterbi), so the
-fb
option is usually also specified.
Only n-grams whose contexts occur in the language model are output.
- -unk-prob L
-
Specifies that unknown words and other words having zero probability in
the language model be assigned a log probability of
L.
This is -100 by default but might be set to 0, e.g., to compute
perplexities excluding unknown words.
- -debug
-
Sets debugging output level.
Each filename argument can be an ASCII file, or a compressed
file (name ending in .Z or .gz), or ``-'' to indicate
stdin/stdout.
BUGS
The
-continuous
option effectively disables
-keep-unk,
i.e., unknown input words are always mapped to <unk>.
Also,
-continuous
doesn't preserve the positions of escaped input lines relative to
the input.
SEE ALSO
ngram-count(1), disambig(1), segment(1), ngram-format(5).
A. Stolcke et al., ``Automatic Detection of Sentence Boundaries and
Disfluencies based on Recognized Words,''
Proc. ICSLP, 2247-2250, Sydney.
AUTHOR
Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 1998-1999 SRI International