Spaces:
Running
on
A10G
Running
on
A10G
uroman version 1.2.8 | |
Release date: April 23, 2021 | |
Author: Ulf Hermjakob, USC Information Sciences Institute | |
uroman is a universal romanizer. It converts text in any script to the Latin alphabet. | |
Usage: uroman.pl [-l <lang-code>] [--chart] [--no-cache] < STDIN | |
where the optional <lang-code> is a 3-letter languages code, e.g. ara, bel, bul, deu, ell, eng, fas, | |
grc, ell, eng, heb, kaz, kir, lav, lit, mkd, mkd2, oss, pnt, pus, rus, srp, srp2, tur, uig, ukr, yid. | |
--chart specifies chart output (in JSON format) to represent alternative romanizations. | |
--no-cache disables caching. | |
Examples: bin/uroman.pl < text/zho.txt | |
bin/uroman.pl -l tur < text/tur.txt | |
bin/uroman.pl -l heb --chart < text/heb.txt | |
bin/uroman.pl < test/multi-script.txt > test/multi-script.uroman.txt | |
Identifying the input as Arabic, Belarusian, Bulgarian, English, Farsi, German, | |
Ancient Greek, Modern Greek, Pontic Greek, Hebrew, Kazakh, Kyrgyz, Latvian, | |
Lithuanian, North Macedonian, Russian, Serbian, Turkish, Ukrainian, Uyghur or Yiddish | |
will improve romanization for those languages as some letters in those languages | |
have different sound values from other languages using the same script. | |
No effect for other languages in this version. | |
Bibliography: Ulf Hermjakob, Jonathan May, and Kevin Knight. 2018. Out-of-the-box universal romanization tool uroman. In Proceedings of the 56th Annual Meeting of Association for Computational Linguistics, Demo Track. [Best Demo Paper Award] | |
Changes in version 1.2.8 | |
* Improved support for Georgian. | |
* Updated UnicodeData.txt to version 13 (2021) with several new scripts (10% larger). | |
* Preserve various symbols (as opposed to mapping to the symbols' names). | |
* Various small improvements. | |
Changes in version 1.2.7 | |
* Improved support for Pashto. | |
Changes in version 1.2.6 | |
* Improved support for Ukrainian, Russian and Ogham (ancient Irish script). | |
* Added support for English Braille. | |
* Added alternative Romanization for North Macedonian and Serbian (mkd2/srp2) | |
reflecting a casual style that many native speakers of those languages use | |
when writing text in Latin script, e.g. non-accented single letters (e.g. "s") | |
rather than phonetically motivated combinations of letters (e.g. "sh"). | |
* When a line starts with "::lcode xyz ", the new uroman version will switch to | |
that language for that line. This is used for the new reference test file. | |
* Various small improvements. | |
Changes in version 1.2.5 | |
* Improved support for Armenian and eight languages using Cyrillic scripts. | |
-- For Serbian and Macedonian, which are often written in both Cyrillic | |
and Latin scripts, uroman will map both official versions to the same | |
romanized text, e.g. both "Ниш" and "Niš" will be mapped to "Nish" (which | |
properly reflects the pronunciation of the city's name). | |
For both Serbian and Macedonian, casual writers often use a simplified | |
Latin form without diacritics, e.g. "s" to represent not only Cyrillic "с" | |
and Latin "s", but also "ш" or "š", even if this conflates "s" and "sh" and | |
other such pairs. The casual romanization can be simulated by using | |
alternative uroman language codes "srp2" and "mkd2", which romanize | |
both "Ниш" and "Niš" to "Nis" to reflect the casual Latin spelling. | |
* Various small improvements. | |
Changes in version 1.2.4 | |
* Added support for Tifinagh (a script used for Berber languages). | |
* Bug-fix that generated two emtpy lines for each empty line in cache mode. | |
Changes in version 1.2.3 | |
* Exclude emojis, dingbats, many other pictographs from being romanized (e.g. to "face") | |
Changes in version 1.2 | |
* Run-time improvement based on (1) token-based caching and (2) shortcut | |
romanization (identity) of ASCII strings for default 1-best (non-chart) | |
output. Speed-up by a factor of 10 for Bengali and Uyghur on medium and | |
large size texts. | |
* Incremental improvements for Farsi, Amharic, Russian, Hebrew and related | |
languages. | |
* Richer lattice structure (more alternatives) for "Romanization" of English | |
to support better matching to romanizations of other languages. | |
Changes output only when --chart option is specified. No change in output for | |
default 1-best output, which for ASCII characters is always the input string. | |
Changes in version 1.1 (major upgrade) | |
* Offers chart output (in JSON format) to represent alternative romanizations. | |
-- Location of first character is defined to be "line: 1, start:0, end:0". | |
* Incremental improvements of Hebrew and Greek romanization; Chinese numbers. | |
* Improved web-interface at http://www.isi.edu/~ulf/uroman.html | |
-- Shows corresponding original and romanization text in red | |
when hovering over a text segment. | |
-- Shows alternative romanizations when hovering over romanized text | |
marked by dotted underline. | |
-- Added right-to-left script detection and improved display for right-to-left | |
script text (as determined line by line). | |
-- On-page support for some scripts that are often not pre-installed on users' | |
computers (Burmese, Egyptian, Klingon). | |
Changes in version 1.0 (major upgrade) | |
* Upgraded principal internal data structure from string to lattice. | |
* Improvements mostly in vowelization of South and Southeast Asian languages. | |
* Vocalic 'r' more consistently treated as vowel (no additional vowel added). | |
* Repetition signs (Japanese/Chinese/Thai/Khmer/Lao) are mapped to superscript 2. | |
* Japanese Katakana middle dots now mapped to ASCII space. | |
* Tibetan intersyllabic mark now mapped to middle dot (U+00B7). | |
* Some corrections regarding analysis of Chinese numbers. | |
* Many more foreign diacritics and punctuation marks dropped or mapped to ASCII. | |
* Zero-width characters dropped, except line/sentence-initial byte order marks. | |
* Spaces normalized to ASCII space. | |
* Fixed bug that in some cases mapped signs (such as dagger or bullet) to their verbal descriptions. | |
* Tested against previous version of uroman with a new uroman visual diff tool. | |
* Almost an order of magnitude faster. | |
Changes in version 0.7 (minor upgrade) | |
* Added script uroman-quick.pl for Arabic script languages, incl. Uyghur. | |
Much faster, pre-caching mapping of Arabic to Latin characters, simple greedy processing. | |
Will not convert material from non-Arabic blocks such as any (somewhat unusual) Cyrillic | |
or Chinese characters in Uyghur texts. | |
Changes in version 0.6 (minor upgrade) | |
* Added support for two letter characters used in Uzbek: | |
(1) character "ʻ" ("modifier letter turned comma", which modifies preceding "g" and "u" letters) | |
(2) character "ʼ" ("modifier letter apostrophe", which Uzbek uses to mark a glottal stop). | |
Both are now mapped to "'" (plain ASCII apostrophe). | |
* Added support for Uyghur vowel characters such as "ې" (Arabic e) and "ۆ" (Arabic oe) | |
even when they are not preceded by "ئ" (yeh with hamza above). | |
* Added support for Arabic semicolon "؛", Arabic ligature forms for phrases such as "ﷺ" | |
("sallallahou alayhe wasallam" = "prayer of God be upon him and his family and peace") | |
* Added robustness for Arabic letter presentation forms (initial/medial/final/isolated). | |
However, it is strongly recommended to normalize any presentation form Arabic letters | |
to their non-presentation form before calling uroman. | |
* Added force flush directive ($|=1;). | |
Changes in version 0.5 (minor upgrade) | |
* Improvements for Uyghur (make sure to use language option: -l uig) | |
Changes in version 0.4 (minor upgrade) | |
* Improvements for Thai (special cases for vowel/consonant reordering, e.g. for "sara o"; dropped some aspiration 'h's) | |
* Minor change for Arabic (added "alef+fathatan" = "an") | |
New features in version 0.3 | |
* Covers Mandarin (Chinese) | |
* Improved romanization for numerous languages | |
* Preserves capitalization (e.g. from Latin, Cyrillic, Greek scripts) | |
* Maps from native digits to Western numbers | |
* Faster for South Asian languages | |
Other features | |
* Web interface: http://www.isi.edu/~ulf/uroman.html | |
* Vowelization is provided when locally computable, e.g. for many South Asian | |
languages and Tibetan. | |
Limitations | |
* This version of uroman assumes all CJK ideographs to be Mandarin (Chinese). | |
This means that Japanese kanji are incorrectly romanized; however, Japanese | |
hiragana and katakana are properly romanized. | |
* A romanizer is not a full transliterator. For example, this version of | |
uroman does not vowelize text that lacks explicit vowelization such as | |
normal text in Arabic and Hebrew (without diacritics/points). | |