disambig

disambig

NAME

disambig - disambiguate text tokens using an N-gram model

SYNOPSIS

disambig [-help] option ...

DESCRIPTION

disambig translates a stream of tokens from a vocabulary V1 to a corresponding stream of tokens from a vocabulary V2, according to a probabilistic, 1-to-many mapping. Ambiguities in the mapping are resolved by finding the V2 sequence with the highest posterior probability given the V1 sequence. This probability is computed from pairwise conditional probabilities P(V1|V2), as well as a language model for sequences over V2.

OPTIONS

Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.

-help
Print option summary.
-version
Print version information.
-text file
Specifies the file containing the V1 sentences.
-map file
Specifies the file containing the V1-to-V2 mapping information. Each line of file contains the mapping for a single word in V1:
w1 w21 [p21] w22 [p22] ...
where w1 is a word from V1, which has possible mappings w21, w22, ... from V2. Optionally, each of these can be followed by a numeric string for the probability p21, which defaults to 1. The number is used as the conditional probability P(w1|w21), but the program does not depend on these numbers being properly normalized.
-text-map file
Processes a combined text/map file. The format of file is the same as for -map, except that the w1 field on each line is interpreted as a word token rather than a word type. Hence, the V1 text input consists of all words in column 1 of file in order of appearance. This is convenient if different instances of a word have different mappings. There is no implicit insertion of begin/end sentence tokens in this mode. Sentence boundaries should be indicated explicitly by lines of the form
</s> </s>
<s> <s>
-classes file
Specifies the V1-to-V2 mapping information in classes-format(5). Class labels are interpreted as V2 words, and expansions as V1 words. Multi-word expansions are not allowed.
-scale
Interpret the numbers in the mapping as P(w21|w1). This is done by dividing probabilities by the unigram probabilities of w21, obtained from the V2 language model.
-logmap
Interpret numeric values in map file as log probabilities, not probabilities.
-lm file
Specifies the V2 language model as a standard ARPA N-gram backoff model file ngram-format(5). The default is not to use a language model, i.e., choose V2 tokens based only on the probabilities in the map file.
-order n
Set the effective N-gram order used by the language model to n. Default is 2 (use a bigram model).
-lmw W
Scales the language model probabilities by a factor W. Default language model weight is 1.
-mapw W
Scales the likelihood map probability by a factor W. Default map weight is 1. Note: For Viterbi decoding (the default) it is equivalent to use -lmw W or -mapw 1/W", but not for forward-backward computation.
-tolower1
Map input vocabulary (V1) to lowercase, removing case distinctions.
-tolower2
Map output vocabulary (V2) to lowercase, removing case distinctions.
-keep-unk
Do not map unknown input words to the <unk> token. Instead, output the input word unchanged. This is like having an implicit default mapping for unknown words to themselves, except that the word will still be treated as <unk> by the language model. Also, with this option the LM is assumed to be open-vocabulary (the default is close-vocabulary).
-no-eos
Do no assume that each input line contains a complete sentence. This prevents end-of-sentence tokens </s> from being appended automatically.
-continuous
Process all words in the input as one sequence of words, irrespective of line breaks. Normally each line is processed separately as a sentence. V2 tokens are output one-per-line. This option also prevents sentence start/end tokens (<s> and </s>) from being added to the input.
-fb
Perform forward-backward decoding of the input (V1) token sequence. Outputs the V2 tokens that have the highest posterior probability, for each position. The default is to use Viterbi decoding, i.e., the output is the V2 sequence with the higher joint posterior probability.
-fw-only
Similar to -fb, but uses only the forward probabilities for computing posteriors. This may be used to simulate on-line prediction of tags, without the benefit of future context.
-totals
Output the total string probability for each input sentence.
-posteriors
Output the table of posterior probabilities for each input (V1) token and each V2 token, in the same format as required for the -map file. If -fb is also specified the posterior probabilities will be computed using forward-backward probabilities; otherwise an approximation will be used that is based on the probability of the most likely path containing a given V2 token at given position.
-nbest N
Output the N best hypotheses instead of just the first best when doing Viterbi search. If N>1, then each hypothesis is prefixed by the tag NBEST_n where n is the rank of the hypothesis in the N-best list and x its score, the negative log of the combined probability of transitions and observations of the corresponding HMM path.
-write-counts file
Outputs the V2-V1 bigram counts corresponding to the tagging performed on the input data. If -fb was specified these are expected counts, and otherwise they reflect the 1-best tagging decisions.
-write-vocab1 file
Writes the input vocabulary from the map (V1) to file.
-write-vocab2 file
Writes the output vocabulary from the map (V2) to file. The vocabulary will also include the words specified in the language model.
-write-map file
Writes the map back to a file for validation purposes.
-debug
Sets debugging output level.

Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.

BUGS

The -continuous and -text-map options effectively disable -keep-unk, i.e., unknown input words are always mapped to <unk>.

SEE ALSO

ngram-count(1), hidden-ngram(1), training-scripts(1), ngram-format(5), classes-format(5).

AUTHOR

Andreas Stolcke <stolcke@speech.sri.com>,
Anand Venkataraman <anand@speech.sri.com>.
Copyright 1995-2004 SRI International