alwaysaditi's picture
End of training
dc78b20 verified
the statistical corpus-based renaissance in computational linguistics has produced a number of interesting technologies, including part-of-speech tagging and bilingual word alignment. unfortunately, these technologies are still not as widely deployed in practical applications as they might be. part-ofspeech taggers are used in a few applications, such as speech synthesis (sproat et al., 1992) and question answering (kupiec, 1993b). word alignment is newer, found only in a few places (gale and church, 1991a; brown et al., 1993; dagan et al., 1993). it is used at ibm for estimating parameters of their statistical machine translation prototype (brown et al., 1993). we suggest that part of speech tagging and word alignment could have an important role in glossary construction for translation. glossaries are extremely important for translation. how would microsoft, or some other software vendor, want the term "character menu" to be translated in their manuals? technical terms are difficult for translators because they are generally not as familiar with the subject domain as either the author of the source text or the reader of the target text. in many cases, there may be a number of acceptable translations, but it is important for the sake of consistency to standardize on a single one. it would be unacceptable for a manual to use a variety of synonyms for a particular menu or button. customarily, translation houses make extensive job-specific glossaries to ensure consistency and correctness of technical terminology for large jobs. a glossary is a list of terms and their translations.' we will subdivide the task of constructing a glossary into two subtasks: (1) generating a list of terms, and (2) finding the translation equivalents. the first task will be referred to as the monolingual task and the second as the bilingual task. how should a glossary be constructed? translation schools teach their students to read as much background material as possible in both the source and target languages, an extremely time-consuming process, as the introduction to hann's (1992, p. 8) text on technical translation indicates: contrary to popular opinion, the job of a technical translator has little in common with other linguistic professions, such as literature translation, foreign correspondence or interpreting. apart from an expert knowledge of both languages..., all that is required for the latter professions is a few general dictionaries, whereas a technical translator needs a whole library of specialized dictionaries, encyclopedias and 'the source and target fields are standard, though many other fields can also be found, e.g., usage notes, part of speech constraints, comments, etc. technical literature in both languages; he is more concerned with the exact meanings of terms than with stylistic considerations and his profession requires certain 'detective' skills as well as linguistic and literary ones. beginners in this profession have an especially hard time... this book attempts to meet this requirement. unfortunately, the academic prescriptions are often too expensive for commercial practice. translators need just-in-time glossaries. they cannot afford to do a lot of background reading and "detective" work when they are being paid by the word. they need something more practical. we propose a tool, termight, that automates some of the more tedious and laborious aspects of terminology research. the tool relies on part-of-speech tagging and word-alignment technologies to extract candidate terms and translations. it then sorts the extracted candidates and presents them to the user along with reference concordance lines, supporting efficient construction of glossaries. the tool is currently being used by the translators at at&t business translation services (formerly at&t language line services). termight may prove useful in contexts other than human-based translation. primarily, it can support customization of machine translation (mt) lexicons to a new domain. in fact, the arguments for constructing a job-specific glossary for human-based translation may hold equally well for an mt-based process, emphasizing the need for a productivity tool. the monolingual component of termight can be used to construct terminology lists in other applications, such as technical writing, book indexing, hypertext linking, natural language interfaces, text categorization and indexing in digital libraries and information retrieval (salton, 1988; cherry, 1990; harding, 1982; bourigault, 1992; damerau, 1993), while the bilingual component can be useful for information retrieval in multilingual text collections (landauer and littman, 1990).we have shown that terminology research provides a good application for robust natural language technology, in particular for part-of-speech tagging and word-alignment algorithms. the statistical corpus-based renaissance in computational linguistics has produced a number of interesting technologies, including part-of-speech tagging and bilingual word alignment. in particular, we have found the following to be very effective: as the need for efficient knowledge acquisition tools becomes widely recognized, we hope that this experience with termight will be found useful for other text-related systems as well. unfortunately, these technologies are still not as widely deployed in practical applications as they might be. in fact, the arguments for constructing a job-specific glossary for human-based translation may hold equally well for an mt-based process, emphasizing the need for a productivity tool. the monolingual component of termight can be used to construct terminology lists in other applications, such as technical writing, book indexing, hypertext linking, natural language interfaces, text categorization and indexing in digital libraries and information retrieval (salton, 1988; cherry, 1990; harding, 1982; bourigault, 1992; damerau, 1993), while the bilingual component can be useful for information retrieval in multilingual text collections (landauer and littman, 1990). primarily, it can support customization of machine translation (mt) lexicons to a new domain.