File size: 2,351 Bytes
dc78b20 |
1 |
even moderately long documents typically address several topics or different aspects of the same topic. the aim of linear text segmentation is to discover the topic boundaries. the uses of this procedure include information retrieval (hearst and plaunt, 1993; hearst, 1994; yaari, 1997; reynar, 1999), summarization (reynar, 1998), text understanding, anaphora resolution (kozima, 1993), language modelling (morris and hirst, 1991; beeferman et al., 1997b) and improving document navigation for the visually disabled (choi, 2000). this paper focuses on domain independent methods for segmenting written text. we present a new algorithm that builds on previous work by reynar (reynar, 1998; reynar, 1994). the primary distinction of our method is the use of a ranking scheme and the cosine similarity measure (van rijsbergen, 1979) in formulating the similarity matrix. we propose that the similarity values of short text segments is statistically insignificant. thus, one can only rely on their order, or rank, for clustering.even moderately long documents typically address several topics or different aspects of the same topic. a segmentation algorithm has two key elements, a, clustering strategy and a similarity measure. we would also like to develop a linear time and multi-source version of the algorithm. thus, one can only rely on their order, or rank, for clustering. the significance of our results has been confirmed by both t-test and ks-test. the definition of a topic segment ranges from complete stories (allan et al., 1998) to summaries (ponte and croft, 1997). given the quality of an algorithm is task dependent, the following experiments focus on the relative performance. c99, k98 and r98 are all polynomial time algorithms. existing work falls into one of two categories, lexical cohesion methods and multi-source methods (yaari, 1997). it would be interesting to compare c99 with the multi-source method described in (beeferman et al., 1999) using the tdt corpus. if one disregards segmentation accuracy, h94 has the best algorithmic performance (linear). our evaluation strategy is a variant of that described in (reynar, 1998, 71-73) and the tdt segmentation task (allan et al., 1998). our results show divisive clustering (r98) is more precise than sliding window (h94) and lexical chains (k98) for locating topic boundaries. |