ngram-merge

ngram-merge

NAME

ngram-merge - merge N-gram counts

SYNOPSIS

ngram-merge [-help] [-write outfile] infile1 infile2 ...

DESCRIPTION

ngram-merge reads two or more lexicographically sorted ngram count files (as produced by ngram-count -sort) and outputs the merged, sorted counts. The output is thus suitable for subsequeunt merging steps.

The input format consists of one ngram count per line,
word1 word2 ... wordn count
The lines must be sorted lexicographically on the words, leftmost first. The input may contain ngrams of different lengths.

Each filename argument can be a plain ascii count file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.

ngram-merge is recommended in cases where the full counts would far exceed available real memory. Although an arbitrary number of input count files is accepted, it is best to use program as follows. First, partition the input text into the largest chunks so that ngram-count can run in real memory. Then merge the resulting sorted counts using ngram-merge pairwise, and continue doing so in a binary tree pattern until a single count file containing all ngrams remains.

OPTIONS

-help
Print option and usage summary.
-write outfile
Write merged counts to outfile, instead of standard output.

SEE ALSO

ngram-count(1), ngram(1).

AUTHOR

Andreas Stolcke <stolcke@speech.sri.com>
Copyright 1995 SRI International