utkarsh2299
commited on
Upload 97 files
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- Unified_parser/.vscode/tasks.json +28 -0
- Unified_parser/LICENSE +21 -0
- Unified_parser/README.md +34 -0
- Unified_parser/__pycache__/get_phone_mapped_python.cpython-311.pyc +0 -0
- Unified_parser/__pycache__/get_phone_mapped_python.cpython-37.pyc +0 -0
- Unified_parser/__pycache__/globals.cpython-310.pyc +0 -0
- Unified_parser/__pycache__/globals.cpython-311.pyc +0 -0
- Unified_parser/__pycache__/globals.cpython-37.pyc +0 -0
- Unified_parser/__pycache__/globals.cpython-38.pyc +0 -0
- Unified_parser/__pycache__/helpers.cpython-310.pyc +0 -0
- Unified_parser/__pycache__/helpers.cpython-311.pyc +0 -0
- Unified_parser/__pycache__/helpers.cpython-37.pyc +0 -0
- Unified_parser/__pycache__/helpers.cpython-38.pyc +0 -0
- Unified_parser/__pycache__/parallelparser.cpython-37.pyc +0 -0
- Unified_parser/__pycache__/parallelparser.cpython-38.pyc +0 -0
- Unified_parser/__pycache__/uparser.cpython-310.pyc +0 -0
- Unified_parser/__pycache__/uparser.cpython-37.pyc +0 -0
- Unified_parser/common.map +128 -0
- Unified_parser/common_hindi.map +128 -0
- Unified_parser/common_telugu.map +128 -0
- Unified_parser/dict/english.dict +0 -0
- Unified_parser/dict/english.dict_old +1 -0
- Unified_parser/dict/hindi.dict1 +1 -0
- Unified_parser/dict/malayalam.dict +1 -0
- Unified_parser/extract_words.py +33 -0
- Unified_parser/get_phone_mapped_python.py +76 -0
- Unified_parser/globals.py +71 -0
- Unified_parser/helpers.py +1031 -0
- Unified_parser/ply/__init__.py +5 -0
- Unified_parser/ply/__pycache__/__init__.cpython-310.pyc +0 -0
- Unified_parser/ply/__pycache__/__init__.cpython-311.pyc +0 -0
- Unified_parser/ply/__pycache__/__init__.cpython-37.pyc +0 -0
- Unified_parser/ply/__pycache__/__init__.cpython-38.pyc +0 -0
- Unified_parser/ply/__pycache__/lex.cpython-310.pyc +0 -0
- Unified_parser/ply/__pycache__/lex.cpython-311.pyc +0 -0
- Unified_parser/ply/__pycache__/lex.cpython-37.pyc +0 -0
- Unified_parser/ply/__pycache__/lex.cpython-38.pyc +0 -0
- Unified_parser/ply/__pycache__/yacc.cpython-310.pyc +0 -0
- Unified_parser/ply/__pycache__/yacc.cpython-311.pyc +0 -0
- Unified_parser/ply/__pycache__/yacc.cpython-37.pyc +0 -0
- Unified_parser/ply/__pycache__/yacc.cpython-38.pyc +0 -0
- Unified_parser/ply/lex.py +110 -0
- Unified_parser/ply/yacc.py +0 -0
- Unified_parser/punjabi/extract_punjabi.py +15 -0
- Unified_parser/punjabi/punjabi_asr_sample +0 -0
- Unified_parser/punjabi/punjabi_results.txt +0 -0
- Unified_parser/punjabi/punjabi_words.txt +0 -0
- Unified_parser/punjabi/runner_punjabi.py +13 -0
- Unified_parser/pypi_package/LICENSE +21 -0
- Unified_parser/pypi_package/README.md +34 -0
Unified_parser/.vscode/tasks.json
ADDED
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"tasks": [
|
3 |
+
{
|
4 |
+
"type": "cppbuild",
|
5 |
+
"label": "C/C++: gcc build active file",
|
6 |
+
"command": "/usr/bin/gcc",
|
7 |
+
"args": [
|
8 |
+
"-fdiagnostics-color=always",
|
9 |
+
"-g",
|
10 |
+
"${file}",
|
11 |
+
"-o",
|
12 |
+
"${fileDirname}/${fileBasenameNoExtension}"
|
13 |
+
],
|
14 |
+
"options": {
|
15 |
+
"cwd": "${fileDirname}"
|
16 |
+
},
|
17 |
+
"problemMatcher": [
|
18 |
+
"$gcc"
|
19 |
+
],
|
20 |
+
"group": {
|
21 |
+
"kind": "build",
|
22 |
+
"isDefault": true
|
23 |
+
},
|
24 |
+
"detail": "Task generated by Debugger."
|
25 |
+
}
|
26 |
+
],
|
27 |
+
"version": "2.0.0"
|
28 |
+
}
|
Unified_parser/LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MIT License
|
2 |
+
|
3 |
+
Copyright (c) 2022 vikram-kv
|
4 |
+
|
5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6 |
+
of this software and associated documentation files (the "Software"), to deal
|
7 |
+
in the Software without restriction, including without limitation the rights
|
8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9 |
+
copies of the Software, and to permit persons to whom the Software is
|
10 |
+
furnished to do so, subject to the following conditions:
|
11 |
+
|
12 |
+
The above copyright notice and this permission notice shall be included in all
|
13 |
+
copies or substantial portions of the Software.
|
14 |
+
|
15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21 |
+
SOFTWARE.
|
Unified_parser/README.md
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Python_Unified_Parser
|
2 |
+
|
3 |
+
This parser attempts to unify the languages based on the Common Label Set. It is designed across all the languages capitalising on the syllable structure of Indian languages. The Unified Parser converts UTF-8 text to common label set, applies letter-to-sound rules and generates the corresponding phoneme sequences. The effort is a step towards natural language understanding system that operates on Indian languages and generates the parsed output. This structured method requires only knowledge of the basic language. With good lexicons it is possible to get more than 95% correctness of words in a language. This method can be further extended for a number of other Indian languages in minimal time and effort. Given the unity in the diversity of Indian languages, developing parsers for new languages is easy using the unified approach.
|
4 |
+
|
5 |
+
Our python parser - [uparser.py](src/indic-unified-parser/uparser.py) - Combines lex and yacc functionality in a single python script using the [PLY](src/indic-unified-parser/ply) framework.
|
6 |
+
|
7 |
+
## Publications
|
8 |
+
[Baby, Arun, et al. "A unified parser for developing Indian language text to speech synthesizers." Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno, Czech Republic, September 12-16, 2016, Proceedings 19. Springer International Publishing, 2016.](https://www.iitm.ac.in/donlab/tts/downloads/unified/unified.pdf)
|
9 |
+
|
10 |
+
## Installation
|
11 |
+
|
12 |
+
```bash
|
13 |
+
pip install indic_unified_parser
|
14 |
+
```
|
15 |
+
|
16 |
+
## How to use
|
17 |
+
|
18 |
+
```bash
|
19 |
+
from indic_unified_parser.uparser import wordparse
|
20 |
+
parsed_output_string = wordparse(<word : str>, <lsflag : int>, <wfflag : int>, <clearflag : int>)
|
21 |
+
```
|
22 |
+
|
23 |
+
1. `lsflag`: always 0. Deprecated.
|
24 |
+
2. `wfflag`: 0 for Monophone parsing, 1 for syllable parsing, 2 for Akshara Parsing"
|
25 |
+
3. `clearflag`: 1 for removing the lisp like format of output and to just produce space separated output. Otherwise, 0.
|
26 |
+
|
27 |
+
## Examples
|
28 |
+
|
29 |
+
check run_parser_all_lang_all_opt.py file for the use of wordparse function.
|
30 |
+
|
31 |
+
|
32 |
+
|
33 |
+
## URLS
|
34 |
+
[Homepage](https://github.com/vikram-kv/Unified_Parser)
|
Unified_parser/__pycache__/get_phone_mapped_python.cpython-311.pyc
ADDED
Binary file (2.75 kB). View file
|
|
Unified_parser/__pycache__/get_phone_mapped_python.cpython-37.pyc
ADDED
Binary file (1.63 kB). View file
|
|
Unified_parser/__pycache__/globals.cpython-310.pyc
ADDED
Binary file (3.01 kB). View file
|
|
Unified_parser/__pycache__/globals.cpython-311.pyc
ADDED
Binary file (4.65 kB). View file
|
|
Unified_parser/__pycache__/globals.cpython-37.pyc
ADDED
Binary file (3.14 kB). View file
|
|
Unified_parser/__pycache__/globals.cpython-38.pyc
ADDED
Binary file (3.15 kB). View file
|
|
Unified_parser/__pycache__/helpers.cpython-310.pyc
ADDED
Binary file (21 kB). View file
|
|
Unified_parser/__pycache__/helpers.cpython-311.pyc
ADDED
Binary file (45.8 kB). View file
|
|
Unified_parser/__pycache__/helpers.cpython-37.pyc
ADDED
Binary file (24.3 kB). View file
|
|
Unified_parser/__pycache__/helpers.cpython-38.pyc
ADDED
Binary file (22.8 kB). View file
|
|
Unified_parser/__pycache__/parallelparser.cpython-37.pyc
ADDED
Binary file (5.17 kB). View file
|
|
Unified_parser/__pycache__/parallelparser.cpython-38.pyc
ADDED
Binary file (5.33 kB). View file
|
|
Unified_parser/__pycache__/uparser.cpython-310.pyc
ADDED
Binary file (5.22 kB). View file
|
|
Unified_parser/__pycache__/uparser.cpython-37.pyc
ADDED
Binary file (6.19 kB). View file
|
|
Unified_parser/common.map
ADDED
@@ -0,0 +1,128 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
0 mq q q q q ऀ mq mq mq mq M
|
2 |
+
1 mq q q q q ँ ঁ ઁ ଁ ਁ
|
3 |
+
2 q ം ஂ ం ಂ ं ং ં ଂ ਂ
|
4 |
+
3 h ഃ ஃ ః ಃ ः ঃ ઃ ଃ ਃ H
|
5 |
+
4 a a a a a ऄ a a a a
|
6 |
+
5 a അ அ అ ಅ अ অ અ ଅ ਅ
|
7 |
+
6 aa ആ ஆ ఆ ಆ आ আ આ ଆ ਆ A
|
8 |
+
7 i ഇ இ ఇ ಇ इ ই ઇ ଇ ਇ
|
9 |
+
8 ii ഈ ஈ ఈ ಈ ई ঈ ઈ ଈ ਈ I
|
10 |
+
9 u ഉ உ ఉ ಉ उ উ ઉ ଉ ਉ
|
11 |
+
10 uu ഊ ஊ ఊ ಊ ऊ ঊ ઊ ଊ ਊ U
|
12 |
+
11 rq ഋ rx ఋ ಋ ऋ ঋ ઋ ଋ r R
|
13 |
+
12 uu uu uu uu uu ऌ ঌ ઌ ଌ uu
|
14 |
+
13 ae e e e e ऍ ee ઍ ee ee ऍ
|
15 |
+
14 e എ எ ఎ ಎ ऎ ee ee ee ee
|
16 |
+
15 ee ഏ ஏ ఏ ಏ ए এ એ ଏ ਏ E
|
17 |
+
16 ei ഐ ஐ ఐ ಐ ऐ ঐ ઐ ଐ ਐ ऐ
|
18 |
+
17 ax a a a a ऑ a ઑ a a ऑ
|
19 |
+
18 o ഒ ஒ ఒ ಒ ऒ oo oo oo oo
|
20 |
+
19 oo ഓ ஓ ఓ ಓ ओ ও ઓ ଓ ਓ O
|
21 |
+
20 ou ഔ ஔ ఔ ಔ औ ঔ ઔ ଔ ਔ औ
|
22 |
+
21 k ക க క ಕ क ক ક କ ਕ
|
23 |
+
22 kh ഖ k ఖ ಖ ख খ ખ ଖ ਖ ख
|
24 |
+
23 g ഗ k గ ಗ ग গ ગ ଗ ਗ
|
25 |
+
24 gh ഘ k ఘ ಘ घ ঘ ઘ ଘ ਘ घ
|
26 |
+
25 ng ങ ங ఙ ಙ ङ ঙ ઙ ଙ ਙ ङ
|
27 |
+
26 c ച ச చ ಚ च চ ચ ଚ ਚ
|
28 |
+
27 ch ഛ c ఛ ಛ छ ছ છ ଛ ਛ C
|
29 |
+
28 j ജ ஜ జ ಜ ज জ જ ଜ ਜ
|
30 |
+
29 jh ഝ j ఝ ಝ झ ঝ ઝ ଝ ਝ J
|
31 |
+
30 nj ഞ ஞ ఞ ಞ ञ ঞ ઞ ଞ ਞ ञ
|
32 |
+
31 tx ട ட ట ಟ ट ট ટ ଟ ਟ ट
|
33 |
+
32 txh ഠ tx ఠ ಠ ठ ঠ ઠ ଠ ਠ ठ
|
34 |
+
33 dx ഡ tx డ ಡ ड ড ડ ଡ ਡ ड
|
35 |
+
34 dxh ഢ tx ఢ ಢ ढ ঢ ઢ ଢ ਢ ढ
|
36 |
+
35 nx ണ ண ణ ಣ ण ণ ણ ଣ ਣ ण
|
37 |
+
36 t ത த త ತ त ত ત ତ ਤ
|
38 |
+
37 th ഥ t థ ಥ थ থ થ ଥ ਥ थ
|
39 |
+
38 d ദ t ద ದ द দ દ ଦ ਦ
|
40 |
+
39 dh ധ t ధ ಧ ध ধ ધ ଧ ਧ ध
|
41 |
+
40 n ന ந న ನ न ন ન ନ ਨ
|
42 |
+
41 nd ഩ ன n n ऩ n n n n न
|
43 |
+
42 p പ ப ప ಪ प প પ ପ ਪ
|
44 |
+
43 ph ഫ p ఫ ಫ फ ফ ફ ଫ ਫ P
|
45 |
+
44 b ബ p బ ಬ ब ব બ ବ ਬ
|
46 |
+
45 bh ഭ p భ ಭ भ ভ ભ ଭ ਭ B
|
47 |
+
46 m മ ம మ ಮ म ম મ ମ ਮ
|
48 |
+
47 y യ ய య ಯ य য ય ୟ ਯ
|
49 |
+
48 r ര ர ర ರ र র ર ର ਰ
|
50 |
+
49 rx റ ற r r ऱ r r r r र
|
51 |
+
50 l ല ல ల ಲ ल ল લ ଲ ਲ
|
52 |
+
51 lx ള ள ళ ಳ ळ l ળ ଳ ਲ਼ ള
|
53 |
+
52 zh ഴ ழ lx lx ऴ lx lx lx lx Z
|
54 |
+
53 w വ வ వ ವ व b વ ଵ ਵ
|
55 |
+
54 sh ശ ஶ శ ಶ श শ શ ଶ ਸ਼ श
|
56 |
+
55 sx ഷ ஷ ష ಷ ष ষ ષ ଷ # ष
|
57 |
+
56 s സ ஸ స ಸ स স સ ସ ਸ
|
58 |
+
57 h ഹ ஹ హ ಹ ह হ હ ହ ਹ
|
59 |
+
58 a a a a a ऺ a a a a
|
60 |
+
59 aav aav aav aav aav ऻ aav aav aav aav
|
61 |
+
60 nk a a a a ़ ় ઼ ଼ ਼ Y
|
62 |
+
61 ag a a a a ऽ ঽ ઽ ଽ # ऽ
|
63 |
+
62 aav ാ ா ా ಾ ा া ા ା ਾ
|
64 |
+
63 iv ി ி ి ಿ ि ি િ ି ਿ
|
65 |
+
64 iiv ീ ீ ీ ೀ ी ী ી ୀ ੀ
|
66 |
+
65 uv ു ு ు ು ु ু ુ ୁ ੁ
|
67 |
+
66 uuv ൂ ூ ూ ೂ ू ূ ૂ ୂ ੂ
|
68 |
+
67 rqv ൃ uv ృ ೃ ृ ৃ ૃ ୃ uv
|
69 |
+
68 rqwv ൄ uuv ౄ ೄ ॄ ৄ ૄ rqv uuv ॠ
|
70 |
+
69 aev ev ev ev ev ॅ eev eev eev eev
|
71 |
+
70 ev െ ெ ె ೆೆ ॆ eev eev ୄ eev
|
72 |
+
71 eev േ ே ే ೇ े ে ે େ ੇ
|
73 |
+
72 eiv ൈ ை ై ೇೈ ै ৈ ૈ ୈ ੈ ऐ
|
74 |
+
73 axv aav aav aav aav ॉ aav ૉ aav aav ऑ
|
75 |
+
74 ov ൊ ொ ొ ೊ ॊ oov oov oov oov
|
76 |
+
75 oov ോ ோ ో ೋ ो ো ો ୋ ੋ O
|
77 |
+
76 ouv ൌ ௌ ౌ ೌ ौ ৌ ૌ ୌ ੌ औ
|
78 |
+
77 eu ് ் ్ ್ ् ্ ્ ୍ ੍ உ
|
79 |
+
78 tv a a a a ॎ ৎ a a a
|
80 |
+
79 $ ouv ouv ouv ouv ॏ ouv ouv ouv ouv
|
81 |
+
80 $ # # # # ॐ # ૐ # ੴ
|
82 |
+
81 $ a a a a ॓ a a a a
|
83 |
+
82 $ a a a a ॔ a a a a
|
84 |
+
83 $ # # # # # # # # #
|
85 |
+
84 $ # # # # # # # # #
|
86 |
+
85 aav a a a a ॕ a a a a
|
87 |
+
86 aav a a a a ॖ a a ୖ a
|
88 |
+
87 auv ൗ a a a ॗ ৗ a ୗ a औ
|
89 |
+
88 kq k k k k क़ k k k k क
|
90 |
+
89 khq kh kh kh kh ख़ kh kh kh ਖ਼ K
|
91 |
+
90 gq g g g g ग़ g g g ਗ਼ G
|
92 |
+
91 z j j j j ज़ j j j ਜ਼
|
93 |
+
92 dxq dx dx dx dx ड़ ড় dx ଡ଼ ੜ D
|
94 |
+
93 dxhq dxh dxh dxh dxh ढ़ ঢ় dxh ଢ଼ dxh T
|
95 |
+
94 f f f f f फ़ f f f ਫ਼
|
96 |
+
95 y y y y y य़ য় y ୟ y
|
97 |
+
96 rqw ൠ ற ౠ ೠ ॠ ৠ ૠ ୠ r ॠ
|
98 |
+
97 $ # # # # ॡ ৡ ૡ ୡ #
|
99 |
+
98 $ # # # # ॢ ৢ ૢ # #
|
100 |
+
99 $ # # # # ॣ ৣ ૣ ୢ #
|
101 |
+
100 $ # # # # । # # # #
|
102 |
+
101 $ # # # # ॥ # # ୣ #
|
103 |
+
102 0 ൦ ௦ ౦ ೦ ० ০ ૦ ୦ ੦
|
104 |
+
103 1 ൧ ௧ ౧ ೧ १ ১ ૧ ୧ ੧
|
105 |
+
104 2 ൨ ௨ ౨ ೨ २ ২ ૨ ୨ ੨
|
106 |
+
105 3 ൩ ௩ ౩ ೩ ३ ৩ ૩ ୩ ੩
|
107 |
+
106 4 ൪ ௪ ౪ ೪ ४ ৪ ૪ ୪ ੪
|
108 |
+
107 5 ൫ ௫ ౫ ೫ ५ ৫ ૫ ୫ ੫
|
109 |
+
108 6 ൬ ௬ ౬ ೬ ६ ৬ ૬ ୬ ੬
|
110 |
+
109 7 ൭ ௭ ౭ ೭ ७ ৭ ૭ ୭ ੭
|
111 |
+
110 8 ൮ ௮ ౮ ೮ ८ ৮ ૮ ୮ ੮
|
112 |
+
111 9 ൯ ௯ ౯ ೯ ९ ৯ ૯ ୯ ੯
|
113 |
+
112 rv r r r r ॰ ৰ ૰ ୰ r
|
114 |
+
113 wv w w w w ॱ ৱ ૱ ୱ w W
|
115 |
+
114 $ a a a a ॲ ৲ a ୲ a
|
116 |
+
115 $ a a a a ॳ ৳ a ୳ a
|
117 |
+
116 $ aa aa aa aa ॴ ৴ aa ୴ aa
|
118 |
+
117 $ ou ou ou ou ॵ ৵ ou ୵ ou
|
119 |
+
118 $ a a a a ॶ ৶ a ୶ a
|
120 |
+
119 $ a a a a ॷ ৷ a ୷ a
|
121 |
+
120 $ dx dx dx dx ॸ ৸ dx dx dx
|
122 |
+
121 $ j j j j ॹ ৹ z z z
|
123 |
+
122 nwv ൺ nx nx nx ॺ ৺ y y y ൺ
|
124 |
+
123 nnv ൻ n n n ॻ ৻ g g g N
|
125 |
+
124 rwv ർ rx rx rx ॼ j j j j ർ
|
126 |
+
125 lwv ൽ l l l ॽ sp sp sp sp ൽ
|
127 |
+
126 lnv ൾ l l l ॾ dx dx dx dx ൾ
|
128 |
+
127 $ b b b b ॿ b b b b
|
Unified_parser/common_hindi.map
ADDED
@@ -0,0 +1,128 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
0 mq q q q q ऀ mq mq mq mq M
|
2 |
+
1 mq q q q q ँ ঁ ઁ ଁ ਁ
|
3 |
+
2 q ം ஂ m ಂ ं ং ં ଂ ਂ
|
4 |
+
3 hq ഃ ஃ ః ಃ ः ঃ ઃ ଃ ਃ H
|
5 |
+
4 a a a aa a ऄ a a a a
|
6 |
+
5 a അ அ అ ಅ o অ અ ଅ ਅ
|
7 |
+
6 aa ആ ஆ ఆ ಆ आ আ આ ଆ ਆ A
|
8 |
+
7 i ഇ இ ఇ ಇ इ ই ઇ ଇ ਇ
|
9 |
+
8 ii ഈ ஈ ఈ ಈ ई ঈ ઈ ଈ ਈ I
|
10 |
+
9 u ഉ உ ఉ ಉ उ উ ઉ ଉ ਉ
|
11 |
+
10 uu ഊ ஊ ఊ ಊ ऊ ঊ ઊ ଊ ਊ U
|
12 |
+
11 rq ഋ rx ఋ ಋ ऋ ঋ ઋ ଋ r R
|
13 |
+
12 uu uu uu uu uu l&i ঌ ઌ ଌ uu
|
14 |
+
13 ae e e e e ऍ ee ઍ ee ee ऍ
|
15 |
+
14 e എ எ ఎ ಎ ऎ ee ee ee ee
|
16 |
+
15 ee ഏ ஏ ఏ ಏ ए এ એ ଏ ਏ E
|
17 |
+
16 ei ഐ ஐ ఐ ಐ o&i ঐ ઐ ଐ ਐ ऐ
|
18 |
+
17 ax a a a a ऑ a ઑ a a ऑ
|
19 |
+
18 o ഒ ஒ ఒ ಒ ऒ oo oo oo oo
|
20 |
+
19 oo ഓ ஓ ఓ ಓ ओ ও ઓ ଓ ਓ O
|
21 |
+
20 ou ഔ ஔ ఔ ಔ औ ঔ ઔ ଔ ਔ औ
|
22 |
+
21 k ക க క ಕ क ক ક କ ਕ
|
23 |
+
22 kh ഖ k ఖ ಖ ख খ ખ ଖ ਖ ख
|
24 |
+
23 g ഗ k గ ಗ ग গ ગ ଗ ਗ
|
25 |
+
24 gh ഘ k ఘ ಘ घ ঘ ઘ ଘ ਘ घ
|
26 |
+
25 ng ങ ங ఙ ಙ ङ ঙ ઙ ଙ ਙ ङ
|
27 |
+
26 c ച ச చ ಚ च চ ચ ଚ ਚ
|
28 |
+
27 ch ഛ c ఛ ಛ छ ছ છ ଛ ਛ C
|
29 |
+
28 j ജ ஜ జ ಜ ज জ જ ଜ ਜ
|
30 |
+
29 jh ഝ j ఝ ಝ झ ঝ ઝ ଝ ਝ J
|
31 |
+
30 nj ഞ ஞ ఞ ಞ ञ ঞ ઞ ଞ ਞ ञ
|
32 |
+
31 tx ട ட ట ಟ ट ট ટ ଟ ਟ ट
|
33 |
+
32 txh ഠ tx ఠ ಠ ठ ঠ ઠ ଠ ਠ ठ
|
34 |
+
33 dx ഡ tx డ ಡ ड ড ડ ଡ ਡ ड
|
35 |
+
34 dxh ഢ tx ఢ ಢ ढ ঢ ઢ ଢ ਢ ढ
|
36 |
+
35 nx ണ ண ణ ಣ ण ণ ણ ଣ ਣ ण
|
37 |
+
36 t ത த త ತ त ত ત ତ ਤ
|
38 |
+
37 th ഥ t థ ಥ थ থ થ ଥ ਥ थ
|
39 |
+
38 d ദ t ద ದ द দ દ ଦ ਦ
|
40 |
+
39 dh ധ t ధ ಧ ध ধ ધ ଧ ਧ ध
|
41 |
+
40 n ന ந న ನ न ন ન ନ ਨ
|
42 |
+
41 nd ഩ ன n n ऩ n n n n न
|
43 |
+
42 p പ ப ప ಪ प প પ ପ ਪ
|
44 |
+
43 ph ഫ p ఫ ಫ फ ফ ફ ଫ ਫ P
|
45 |
+
44 b ബ p బ ಬ ब ব બ ବ ਬ
|
46 |
+
45 bh ഭ p భ ಭ भ ভ ભ ଭ ਭ B
|
47 |
+
46 m മ ம మ ಮ म ম મ ମ ਮ
|
48 |
+
47 y യ ய య ಯ j য ય ୟ ਯ
|
49 |
+
48 r ര ர ర ರ र র ર ର ਰ
|
50 |
+
49 rx റ ற r r ऱ r r r r र
|
51 |
+
50 l ല ல ల ಲ ल ল લ ଲ ਲ
|
52 |
+
51 lx ള ள ళ ಳ ळ l ળ ଳ ਲ਼ ള
|
53 |
+
52 zh ഴ ழ lx lx ऴ lx lx lx lx Z
|
54 |
+
53 w വ வ వ ವ व b વ ଵ ਵ
|
55 |
+
54 sh ശ ஶ sx ಶ श শ શ ଶ ਸ਼ श
|
56 |
+
55 sx ഷ ஷ ష ಷ ष ষ ષ ଷ # ष
|
57 |
+
56 s സ ஸ స ಸ स স સ ସ ਸ
|
58 |
+
57 h ഹ ஹ హ ಹ ह হ હ ହ ਹ
|
59 |
+
58 a a a a a ऺ a a a a
|
60 |
+
59 aav aav aav aav aav ऻ aav aav aav aav
|
61 |
+
60 nk a a a a ़ ় ઼ ଼ ਼ Y
|
62 |
+
61 ag a a a a ऽ ঽ ઽ ଽ # ऽ
|
63 |
+
62 aav ാ ா ా ಾ ा া ા ା ਾ
|
64 |
+
63 iv ി ி ి ಿ ि ি િ ି ਿ
|
65 |
+
64 iiv ീ ீ ీ ೀ ी ী ી ୀ ੀ
|
66 |
+
65 uv ു ு ు ು ु ু ુ ୁ ੁ
|
67 |
+
66 uuv ൂ ூ ూ ೂ ू ূ ૂ ୂ ੂ
|
68 |
+
67 rqv ൃ uv ృ ೃ ृ ৃ ૃ ୃ uv
|
69 |
+
68 rqwv ൄ uuv ౄ ೄ ॄ ৄ ૄ rqv uuv ॠ
|
70 |
+
69 aev ev ev ev ev ॅ eev eev eev eev
|
71 |
+
70 ev െ ெ ె ೆೆ ॆ eev eev ୄ eev
|
72 |
+
71 eev േ ே ే ೇ े ে ે େ ੇ
|
73 |
+
72 eiv ൈ ை ై ೇೈ ै ৈ ૈ ୈ ੈ ऐ
|
74 |
+
73 axv aav aav aav aav ॉ aav ૉ aav aav ऑ
|
75 |
+
74 ov ൊ ொ ొ ೊ ॊ oov oov oov oov
|
76 |
+
75 oov ോ ோ ో ೋ ो ো ો ୋ ੋ O
|
77 |
+
76 ouv ൌ ௌ ౌ ೌ ौ ৌ ૌ ୌ ੌ औ
|
78 |
+
77 eu ് ் ్ ್ ् ্ ્ ୍ ੍ உ
|
79 |
+
78 tv a a a a ॎ ৎ a a a
|
80 |
+
79 $ ouv ouv ouv ouv ॏ ouv ouv ouv ouv
|
81 |
+
80 $ # # # # ॐ o&u&m ૐ # ੴ
|
82 |
+
81 $ a a a a ॓ a a a a
|
83 |
+
82 $ a a a a ॔ a a a a
|
84 |
+
83 $ # # # # # # # # #
|
85 |
+
84 $ # # # # # # # # #
|
86 |
+
85 aav a a a a ॕ a a a a
|
87 |
+
86 aav a a a a ॖ a a ୖ a
|
88 |
+
87 auv ൗ a a a ॗ ৗ a ୗ a औ
|
89 |
+
88 kq k k k k क़ k k k k क
|
90 |
+
89 khq kh kh kh kh ख़ kh kh kh ਖ਼ K
|
91 |
+
90 gq g g g g ग़ g g g ਗ਼ G
|
92 |
+
91 z j j j j ज़ j j j ਜ਼
|
93 |
+
92 dxq dx dx dx dx ड़ rx dx ଡ଼ ੜ D
|
94 |
+
93 dxhq dxh dxh dxh dxh ढ़ ঢ় dxh ଢ଼ dxh T
|
95 |
+
94 f f f f f फ़ f f f ਫ਼
|
96 |
+
95 y y y y y य़ য় y ୟ y
|
97 |
+
96 rqw ൠ ற ౠ ೠ ॠ ৠ ૠ ୠ r ॠ
|
98 |
+
97 $ # # # # ॡ ৡ ૡ ୡ #
|
99 |
+
98 $ # # # # ॢ ৢ ૢ # #
|
100 |
+
99 $ # # # # ॣ ৣ ૣ ୢ #
|
101 |
+
100 $ # # # # । # # # #
|
102 |
+
101 $ # # # # ॥ # # ୣ #
|
103 |
+
102 0 ൦ ௦ ౦ ೦ ० ০ ૦ ୦ ੦
|
104 |
+
103 1 ൧ ௧ ౧ ೧ १ ১ ૧ ୧ ੧
|
105 |
+
104 2 ൨ ௨ ౨ ೨ २ ২ ૨ ୨ ੨
|
106 |
+
105 3 ൩ ௩ ౩ ೩ ३ ৩ ૩ ୩ ੩
|
107 |
+
106 4 ൪ ௪ ౪ ೪ ४ ৪ ૪ ୪ ੪
|
108 |
+
107 5 ൫ ௫ ౫ ೫ ५ ৫ ૫ ୫ ੫
|
109 |
+
108 6 ൬ ௬ ౬ ೬ ६ ৬ ૬ ୬ ੬
|
110 |
+
109 7 ൭ ௭ ౭ ೭ ७ ৭ ૭ ୭ ੭
|
111 |
+
110 8 ൮ ௮ ౮ ೮ ८ ৮ ૮ ୮ ੮
|
112 |
+
111 9 ൯ ௯ ౯ ೯ ९ ৯ ૯ ୯ ੯
|
113 |
+
112 rv r r r r ॰ ৰ ૰ ୰ r
|
114 |
+
113 wv w w w w ॱ ৱ ૱ ୱ w W
|
115 |
+
114 $ a a a a ॲ ৲ a ୲ a
|
116 |
+
115 $ a a a a ॳ ৳ a ୳ a
|
117 |
+
116 $ aa aa aa aa ॴ ৴ aa ୴ aa
|
118 |
+
117 $ ou ou ou ou ॵ ৵ ou ୵ ou
|
119 |
+
118 $ a a a a ॶ ৶ a ୶ a
|
120 |
+
119 $ a a a a ॷ ৷ a ୷ a
|
121 |
+
120 $ dx dx dx dx ॸ ৸ dx dx dx
|
122 |
+
121 $ j j j j ॹ ৹ z z z
|
123 |
+
122 nwv ൺ nx nx nx ॺ ৺ y y y ൺ
|
124 |
+
123 nnv ൻ n n n ॻ ৻ g g g N
|
125 |
+
124 rwv ർ rx rx rx ॼ j j j j ർ
|
126 |
+
125 lwv ൽ l l l ॽ sp sp sp sp ൽ
|
127 |
+
126 lnv ൾ l l l ॾ dx dx dx dx ൾ
|
128 |
+
127 $ b b b b ॿ b b b b
|
Unified_parser/common_telugu.map
ADDED
@@ -0,0 +1,128 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
0 mq q q q q ऀ mq mq mq mq M
|
2 |
+
1 mq q q q q ँ ঁ ઁ ଁ ਁ
|
3 |
+
2 q ം ஂ ం ಂ ं ং ં ଂ ਂ
|
4 |
+
3 hq ഃ ஃ ః ಃ ः ঃ ઃ ଃ ਃ H
|
5 |
+
4 a a a a a ऄ a a a a
|
6 |
+
5 a അ அ అ ಅ अ অ અ ଅ ਅ
|
7 |
+
6 aa ആ ஆ ఆ ಆ आ আ આ ଆ ਆ A
|
8 |
+
7 i ഇ இ ఇ ಇ इ ই ઇ ଇ ਇ
|
9 |
+
8 ii ഈ ஈ ఈ ಈ ई ঈ ઈ ଈ ਈ I
|
10 |
+
9 u ഉ உ ఉ ಉ उ উ ઉ ଉ ਉ
|
11 |
+
10 uu ഊ ஊ ఊ ಊ ऊ ঊ ઊ ଊ ਊ U
|
12 |
+
11 rq ഋ rx ఋ ಋ ऋ ঋ ઋ ଋ r R
|
13 |
+
12 uu uu uu uu uu ऌ ঌ ઌ ଌ uu
|
14 |
+
13 ae e e e e ऍ ee ઍ ee ee ऍ
|
15 |
+
14 e എ எ ఎ ಎ ऎ ee ee ee ee
|
16 |
+
15 ee ഏ ஏ ఏ ಏ ए এ એ ଏ ਏ E
|
17 |
+
16 ei ഐ ஐ ee ಐ ऐ ঐ ઐ ଐ ਐ ऐ
|
18 |
+
17 ax a a a a ऑ a ઑ a a ऑ
|
19 |
+
18 o ഒ ஒ ఒ ಒ ऒ oo oo oo oo
|
20 |
+
19 oo ഓ ஓ ఓ ಓ ओ ও ઓ ଓ ਓ O
|
21 |
+
20 ou ഔ ஔ ఔ ಔ औ ঔ ઔ ଔ ਔ औ
|
22 |
+
21 k ക க క ಕ क ক ક କ ਕ
|
23 |
+
22 kh ഖ k ఖ ಖ ख খ ખ ଖ ਖ ख
|
24 |
+
23 g ഗ k గ ಗ ग গ ગ ଗ ਗ
|
25 |
+
24 gh ഘ k ఘ ಘ घ ঘ ઘ ଘ ਘ घ
|
26 |
+
25 ng ങ ங ఙ ಙ ङ ঙ ઙ ଙ ਙ ङ
|
27 |
+
26 c ച ச చ ಚ च চ ચ ଚ ਚ
|
28 |
+
27 ch ഛ c ఛ ಛ छ ছ છ ଛ ਛ C
|
29 |
+
28 j ജ ஜ జ ಜ ज জ જ ଜ ਜ
|
30 |
+
29 jh ഝ j ఝ ಝ झ ঝ ઝ ଝ ਝ J
|
31 |
+
30 nj ഞ ஞ ఞ ಞ ञ ঞ ઞ ଞ ਞ ञ
|
32 |
+
31 tx ട ட ట ಟ ट ট ટ ଟ ਟ ट
|
33 |
+
32 txh ഠ tx ఠ ಠ ठ ঠ ઠ ଠ ਠ ठ
|
34 |
+
33 dx ഡ tx డ ಡ ड ড ડ ଡ ਡ ड
|
35 |
+
34 dxh ഢ tx ఢ ಢ ढ ঢ ઢ ଢ ਢ ढ
|
36 |
+
35 nx ണ ண ణ ಣ ण ণ ણ ଣ ਣ ण
|
37 |
+
36 t ത த త ತ त ত ત ତ ਤ
|
38 |
+
37 th ഥ t థ ಥ थ থ થ ଥ ਥ थ
|
39 |
+
38 d ദ t ద ದ द দ દ ଦ ਦ
|
40 |
+
39 dh ധ t ధ ಧ ध ধ ધ ଧ ਧ ध
|
41 |
+
40 n ന ந న ನ न ন ન ନ ਨ
|
42 |
+
41 nd ഩ ன n n ऩ n n n n न
|
43 |
+
42 p പ ப ప ಪ प প પ ପ ਪ
|
44 |
+
43 ph ഫ p ఫ ಫ फ ফ ફ ଫ ਫ P
|
45 |
+
44 b ബ p బ ಬ ब ব બ ବ ਬ
|
46 |
+
45 bh ഭ p భ ಭ भ ভ ભ ଭ ਭ B
|
47 |
+
46 m മ ம మ ಮ म ম મ ମ ਮ
|
48 |
+
47 y യ ய య ಯ य য ય ୟ ਯ
|
49 |
+
48 r ര ர ర ರ र র ર ର ਰ
|
50 |
+
49 rx റ ற r r ऱ r r r r र
|
51 |
+
50 l ല ல ల ಲ ल ল લ ଲ ਲ
|
52 |
+
51 lx ള ள ళ ಳ ळ l ળ ଳ ਲ਼ ള
|
53 |
+
52 zh ഴ ழ lx lx ऴ lx lx lx lx Z
|
54 |
+
53 w വ வ వ ವ व b વ ଵ ਵ
|
55 |
+
54 sh ശ ஶ శ ಶ श শ શ ଶ ਸ਼ श
|
56 |
+
55 sx ഷ ஷ ష ಷ ष ষ ષ ଷ # ष
|
57 |
+
56 s സ ஸ స ಸ स স સ ସ ਸ
|
58 |
+
57 h ഹ ஹ హ ಹ ह হ હ ହ ਹ
|
59 |
+
58 a a a a a ऺ a a a a
|
60 |
+
59 aav aav aav aav aav ऻ aav aav aav aav
|
61 |
+
60 nk a a a a ़ ় ઼ ଼ ਼ Y
|
62 |
+
61 ag a a a a ऽ ঽ ઽ ଽ # ऽ
|
63 |
+
62 aav ാ ா ా ಾ ा া ા ା ਾ
|
64 |
+
63 iv ി ி ి ಿ ि ি િ ି ਿ
|
65 |
+
64 iiv ീ ீ ీ ೀ ी ী ી ୀ ੀ
|
66 |
+
65 uv ു ு ు ು ु ু ુ ୁ ੁ
|
67 |
+
66 uuv ൂ ூ ూ ೂ ू ূ ૂ ୂ ੂ
|
68 |
+
67 rqv ൃ uv ృ ೃ ृ ৃ ૃ ୃ uv
|
69 |
+
68 rqwv ൄ uuv ౄ ೄ ॄ ৄ ૄ rqv uuv ॠ
|
70 |
+
69 aev ev ev ev ev ॅ eev eev eev eev
|
71 |
+
70 ev െ ெ ె ೆೆ ॆ eev eev ୄ eev
|
72 |
+
71 eev േ ே ే ೇ े ে ે େ ੇ
|
73 |
+
72 eiv ൈ ை eev ೇೈ ै ৈ ૈ ୈ ੈ ऐ
|
74 |
+
73 axv aav aav aav aav ॉ aav ૉ aav aav ऑ
|
75 |
+
74 ov ൊ ொ ొ ೊ ॊ oov oov oov oov
|
76 |
+
75 oov ോ ோ ో ೋ ो ো ો ୋ ੋ O
|
77 |
+
76 ouv ൌ ௌ ౌ ೌ ौ ৌ ૌ ୌ ੌ औ
|
78 |
+
77 eu ് ் ్ ್ ् ্ ્ ୍ ੍ உ
|
79 |
+
78 tv a a a a ॎ ৎ a a a
|
80 |
+
79 $ ouv ouv ouv ouv ॏ ouv ouv ouv ouv
|
81 |
+
80 $ # # # # ॐ # ૐ # ੴ
|
82 |
+
81 $ a a a a ॓ a a a a
|
83 |
+
82 $ a a a a ॔ a a a a
|
84 |
+
83 $ # # # # # # # # #
|
85 |
+
84 $ # # # # # # # # #
|
86 |
+
85 aav a a a a ॕ a a a a
|
87 |
+
86 aav a a a a ॖ a a ୖ a
|
88 |
+
87 auv ൗ a a a ॗ ৗ a ୗ a औ
|
89 |
+
88 kq k k k k क़ k k k k क
|
90 |
+
89 khq kh kh kh kh ख़ kh kh kh ਖ਼ K
|
91 |
+
90 gq g g g g ग़ g g g ਗ਼ G
|
92 |
+
91 z j j j j ज़ j j j ਜ਼
|
93 |
+
92 dxq dx dx dx dx ड़ ড় dx ଡ଼ ੜ D
|
94 |
+
93 dxhq dxh dxh dxh dxh ढ़ ঢ় dxh ଢ଼ dxh T
|
95 |
+
94 f f f f f फ़ f f f ਫ਼
|
96 |
+
95 y y y y y य़ য় y ୟ y
|
97 |
+
96 rqw ൠ ற ౠ ೠ ॠ ৠ ૠ ୠ r ॠ
|
98 |
+
97 $ # # # # ॡ ৡ ૡ ୡ #
|
99 |
+
98 $ # # # # ॢ ৢ ૢ # #
|
100 |
+
99 $ # # # # ॣ ৣ ૣ ୢ #
|
101 |
+
100 $ # # # # । # # # #
|
102 |
+
101 $ # # # # ॥ # # ୣ #
|
103 |
+
102 0 ൦ ௦ ౦ ೦ ० ০ ૦ ୦ ੦
|
104 |
+
103 1 ൧ ௧ ౧ ೧ १ ১ ૧ ୧ ੧
|
105 |
+
104 2 ൨ ௨ ౨ ೨ २ ২ ૨ ୨ ੨
|
106 |
+
105 3 ൩ ௩ ౩ ೩ ३ ৩ ૩ ୩ ੩
|
107 |
+
106 4 ൪ ௪ ౪ ೪ ४ ৪ ૪ ୪ ੪
|
108 |
+
107 5 ൫ ௫ ౫ ೫ ५ ৫ ૫ ୫ ੫
|
109 |
+
108 6 ൬ ௬ ౬ ೬ ६ ৬ ૬ ୬ ੬
|
110 |
+
109 7 ൭ ௭ ౭ ೭ ७ ৭ ૭ ୭ ੭
|
111 |
+
110 8 ൮ ௮ ౮ ೮ ८ ৮ ૮ ୮ ੮
|
112 |
+
111 9 ൯ ௯ ౯ ೯ ९ ৯ ૯ ୯ ੯
|
113 |
+
112 rv r r r r ॰ ৰ ૰ ୰ r
|
114 |
+
113 wv w w w w ॱ ৱ ૱ ୱ w W
|
115 |
+
114 $ a a a a ॲ ৲ a ୲ a
|
116 |
+
115 $ a a a a ॳ ৳ a ୳ a
|
117 |
+
116 $ aa aa aa aa ॴ ৴ aa ୴ aa
|
118 |
+
117 $ ou ou ou ou ॵ ৵ ou ୵ ou
|
119 |
+
118 $ a a a a ॶ ৶ a ୶ a
|
120 |
+
119 $ a a a a ॷ ৷ a ୷ a
|
121 |
+
120 $ dx dx dx dx ॸ ৸ dx dx dx
|
122 |
+
121 $ j j j j ॹ ৹ z z z
|
123 |
+
122 nwv ൺ nx nx nx ॺ ৺ y y y ൺ
|
124 |
+
123 nnv ൻ n n n ॻ ৻ g g g N
|
125 |
+
124 rwv ർ rx rx rx ॼ j j j j ർ
|
126 |
+
125 lwv ൽ l l l ॽ sp sp sp sp ൽ
|
127 |
+
126 lnv ൾ l l l ॾ dx dx dx dx ൾ
|
128 |
+
127 $ b b b b ॿ b b b b
|
Unified_parser/dict/english.dict
ADDED
The diff for this file is too large to render.
See raw diff
|
|
Unified_parser/dict/english.dict_old
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
english noun ( (( "Ei" "n" "g" ) 0)(( "l" "i" "sh" ) 0) )
|
Unified_parser/dict/hindi.dict1
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
अंगारित ( (( "अं" ) 0) (( "गा" ) 0) (( "रित्" ) 0) ) ( (( "a" "q" ) 0) (( "g" "aa" ) 0) (( "r" "i" "t" ) 0) )
|
Unified_parser/dict/malayalam.dict
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
സ്ത്രീ ( (( "സ്ത്രീ" ) 0) ) ( (( "s" "t" "r" "ii" ) 0) )
|
Unified_parser/extract_words.py
ADDED
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os, shutil
|
2 |
+
from uparser import wordparse
|
3 |
+
from joblib import Parallel, delayed
|
4 |
+
from tqdm import tqdm
|
5 |
+
|
6 |
+
num_jobs = 20
|
7 |
+
infolder = 'Original'
|
8 |
+
outfolder = 'Words'
|
9 |
+
|
10 |
+
for fdr in [outfolder]:
|
11 |
+
if os.path.exists(fdr):
|
12 |
+
shutil.rmtree(fdr)
|
13 |
+
os.mkdir(fdr)
|
14 |
+
|
15 |
+
flist = os.listdir(infolder)
|
16 |
+
for fname in flist:
|
17 |
+
with open(f'{infolder}/{fname}', 'r') as f:
|
18 |
+
cnts = f.readlines()
|
19 |
+
|
20 |
+
i = 0
|
21 |
+
|
22 |
+
words = []
|
23 |
+
for l in cnts:
|
24 |
+
l = l.strip().split('\t')
|
25 |
+
words.append(l[0])
|
26 |
+
|
27 |
+
fout = fname.split('_')[1]
|
28 |
+
fout = fout.split('.')[0]
|
29 |
+
print(fout)
|
30 |
+
|
31 |
+
with open(f'{outfolder}/{fout}.words', 'w') as f:
|
32 |
+
for w in words:
|
33 |
+
f.write(w + '\n')
|
Unified_parser/get_phone_mapped_python.py
ADDED
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
class TextReplacer:
|
2 |
+
def __init__(self):
|
3 |
+
self.replacements = {
|
4 |
+
'aa':'A',
|
5 |
+
'ae':'ऍ',
|
6 |
+
'ag':'ऽ',
|
7 |
+
'ai':'ऐ',
|
8 |
+
'au':'औ',
|
9 |
+
'axx':'अ',
|
10 |
+
'ax':'ऑ',
|
11 |
+
'bh':'B',
|
12 |
+
'ch':'C',
|
13 |
+
'dh':'ध',
|
14 |
+
'dxhq':'T',
|
15 |
+
'dxh':'ढ',
|
16 |
+
'dxq':'D',
|
17 |
+
'dx':'ड',
|
18 |
+
'ee':'E',
|
19 |
+
'ei':'ऐ',
|
20 |
+
'eu':'உ',
|
21 |
+
'gh':'घ',
|
22 |
+
'gq':'G',
|
23 |
+
'hq':'H',
|
24 |
+
'ii':'I',
|
25 |
+
'jh':'J',
|
26 |
+
'khq':'K',
|
27 |
+
'kh':'ख',
|
28 |
+
'kq':'क',
|
29 |
+
'ln':'ൾ',
|
30 |
+
'lw':'ൽ',
|
31 |
+
'lx':'ള',
|
32 |
+
'mq':'M',
|
33 |
+
'nd':'ऩ',
|
34 |
+
'ng':'ङ',
|
35 |
+
'nj':'ञ',
|
36 |
+
'nk':'Y',
|
37 |
+
'nn':'N',
|
38 |
+
'nw':'ൺ',
|
39 |
+
'nx':'ण',
|
40 |
+
'oo':'O',
|
41 |
+
'ou':'औ',
|
42 |
+
'ph':'P',
|
43 |
+
'rqw':'ॠ',
|
44 |
+
'rq':'R',
|
45 |
+
'rw':'ർ',
|
46 |
+
'rx':'ऱ',
|
47 |
+
'sh':'श',
|
48 |
+
'sx':'ष',
|
49 |
+
'txh':'ठ',
|
50 |
+
'th':'थ',
|
51 |
+
'tx':'ट',
|
52 |
+
'uu':'U',
|
53 |
+
'wv':'W',
|
54 |
+
'zh':'Z'
|
55 |
+
|
56 |
+
# ... Add more replacements as needed
|
57 |
+
}
|
58 |
+
|
59 |
+
def apply_replacements(self, text):
|
60 |
+
for key, value in self.replacements.items():
|
61 |
+
# print('KEY AND VALUE OF PARSED OUTPUT',key, value)
|
62 |
+
text = text.replace(key, value)
|
63 |
+
temp=""
|
64 |
+
for i in range(len(text)):
|
65 |
+
if text[i]!=" ":
|
66 |
+
temp=temp+text[i]
|
67 |
+
|
68 |
+
return temp
|
69 |
+
|
70 |
+
def apply_replacements_by_phonems(self, text):
|
71 |
+
ans=self.replacements[text]
|
72 |
+
# for key, value in self.replacements.items():
|
73 |
+
# # print('KEY AND VALUE OF PARSED OUTPUT',key, value)
|
74 |
+
# text = text.replace(key, value)
|
75 |
+
return ans
|
76 |
+
|
Unified_parser/globals.py
ADDED
@@ -0,0 +1,71 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# global CONSTANTs for languages. Uses the same values as the enum at
|
2 |
+
# lines 11-13 of unified.y
|
3 |
+
|
4 |
+
import sys, os
|
5 |
+
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
|
6 |
+
|
7 |
+
class FLAGS:
|
8 |
+
DEBUG = False
|
9 |
+
parseLevel = 0
|
10 |
+
syllTagFlag = 0
|
11 |
+
LangSpecificCorrectionFlag = 1
|
12 |
+
writeFormat = 0
|
13 |
+
|
14 |
+
class WORDS:
|
15 |
+
wordCopy = ""
|
16 |
+
syllabifiedWord = ""
|
17 |
+
phonifiedWord = ""
|
18 |
+
unicodeWord = ""
|
19 |
+
syllabifiedWordOut = ""
|
20 |
+
outputText = ""
|
21 |
+
|
22 |
+
class STRINGS:
|
23 |
+
bi = 0
|
24 |
+
leftStr = ['' for _ in range(1100)]
|
25 |
+
rightStr = ['' for _ in range(1100)]
|
26 |
+
def refresh(self):
|
27 |
+
self.leftStr = ['' for _ in range(1100)]
|
28 |
+
self.rightStr = ['' for _ in range(1100)]
|
29 |
+
self.bi = 0
|
30 |
+
|
31 |
+
class GLOBALS:
|
32 |
+
def __init__(self):
|
33 |
+
self.flags = FLAGS()
|
34 |
+
self.words = WORDS()
|
35 |
+
self.combvars = STRINGS()
|
36 |
+
|
37 |
+
self.MALAYALAM = 1
|
38 |
+
self.TAMIL = 2
|
39 |
+
self.TELUGU = 3
|
40 |
+
self.KANNADA = 4
|
41 |
+
self.HINDI = 5
|
42 |
+
self.BENGALI = 6
|
43 |
+
self.GUJARATHI = 7
|
44 |
+
self.ODIYA = 8
|
45 |
+
self.PUNJABI = 9
|
46 |
+
self.ENGLISH = 10 # new value from 9 to 10
|
47 |
+
|
48 |
+
self.langId = 0
|
49 |
+
self.isSouth = False
|
50 |
+
self.syllableCount = 0
|
51 |
+
|
52 |
+
self.rootPath = SCRIPT_DIR+'/'
|
53 |
+
self.commonFile = "common.map"
|
54 |
+
self.outputFile = ""
|
55 |
+
|
56 |
+
self.symbolTable = [['' for _ in range(2)] for _ in range(128)]
|
57 |
+
self.ROW = 128
|
58 |
+
self.COL = 2
|
59 |
+
self.syllableList = []
|
60 |
+
|
61 |
+
self.VOWELSSIZE=18
|
62 |
+
self.CONSONANTSSIZE=34
|
63 |
+
self.SEMIVOWELSSIZE=4
|
64 |
+
|
65 |
+
self.VOWELS = ["a","e","i","o","u","aa","mq","aa","ii", "uu","rq","au","ee","ei","ou","oo","ax","ai"]
|
66 |
+
self.CONSONANTS = ["k","kh","g","gh","ng","c","ch","j","jh","nj","tx","txh","dx","dxh","nx","t","th","d","dh","n","p","ph","b","bh","m","sh","sx","zh","s","h","lx","rx","f","dxq"]
|
67 |
+
self.SEMIVOWELS = ["y","w","r","l",]
|
68 |
+
|
69 |
+
# variable to indicate current language being parsed.
|
70 |
+
self.currLang = self.ENGLISH
|
71 |
+
self.answer = ''
|
Unified_parser/helpers.py
ADDED
@@ -0,0 +1,1031 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# import sys, os
|
2 |
+
# SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
|
3 |
+
# sys.path.append(SCRIPT_DIR)
|
4 |
+
|
5 |
+
from Unified_parser.globals import *
|
6 |
+
# contains helper functions used in parser.py
|
7 |
+
|
8 |
+
# repeated replacement of a subtring sub with tar in input until no change happens
|
9 |
+
def rec_replace(input : str, sub : str, tar : str):
|
10 |
+
while True:
|
11 |
+
output = input.replace(sub, tar)
|
12 |
+
if output == input:
|
13 |
+
break
|
14 |
+
input = output
|
15 |
+
return output
|
16 |
+
|
17 |
+
# function - RemoveUnwanted() - referenced in lines 63 - 109 of unified.y
|
18 |
+
def RemoveUnwanted(input : str) -> str:
|
19 |
+
# ignore punctuations
|
20 |
+
punctuationList = ["!",";",":","@","#","$","%","^","&","*",",",".","/","'","’","”","“","।", "]", "[", "×", "ñ", "∙","•"]
|
21 |
+
|
22 |
+
# replacing problematic unicode characters that look the same but have different encodings.
|
23 |
+
# punjabi update
|
24 |
+
replaceDict = {"ऩ":"ऩ", "ऱ":"ऱ", "क़":"क़", "ख़":"ख़", "ग़":"ग़", "ज़":"ज़", "ड़":"ड़", "ढ़":"ढ़", "ढ़":"ढ़", "फ़":"फ़", "य़":"य़", "ऴ":"ऴ",
|
25 |
+
"ொ":"ொ", "ோ":"ோ",
|
26 |
+
"ൊ":"ൊ", "ോ":"ോ", "ല്":"ൽ", "ള്":"ൾ", "ര്":"ർ", "ന്":"ൻ", "ണ്":"ൺ"}
|
27 |
+
|
28 |
+
output = ""
|
29 |
+
for c in input:
|
30 |
+
if c in punctuationList:
|
31 |
+
continue
|
32 |
+
output += c
|
33 |
+
|
34 |
+
for k in replaceDict.keys():
|
35 |
+
output = rec_replace(output, k, replaceDict[k])
|
36 |
+
return output
|
37 |
+
|
38 |
+
# function to replace GetFile in lines 132 - 156 of unified.y
|
39 |
+
# gives the filename according to language and type
|
40 |
+
def GetFile(g : GLOBALS, LangId : int, type : int) -> str:
|
41 |
+
fileName = g.rootPath
|
42 |
+
|
43 |
+
# return common file that contains the CPS mapping
|
44 |
+
if type == 0:
|
45 |
+
fileName += g.commonFile
|
46 |
+
#print("file",fileName)
|
47 |
+
return fileName
|
48 |
+
|
49 |
+
elif type == 1:
|
50 |
+
fileName += "dict/"
|
51 |
+
|
52 |
+
elif type == 2:
|
53 |
+
fileName += "rules/"
|
54 |
+
|
55 |
+
langIdNameMapping = { 1 : "malayalam", 2 : "tamil", 3 : "telugu",
|
56 |
+
4 : "kannada", 5 : "hindi", 6 : "bengali",
|
57 |
+
7 : "gujarathi", 8 : "odiya", 9 : "punjabi", 10 : "english" }
|
58 |
+
|
59 |
+
if LangId in langIdNameMapping.keys():
|
60 |
+
fileName += langIdNameMapping[LangId]
|
61 |
+
|
62 |
+
if type == 1:
|
63 |
+
fileName += ".dict"
|
64 |
+
elif type == 2:
|
65 |
+
fileName += ".rules"
|
66 |
+
|
67 |
+
return fileName
|
68 |
+
|
69 |
+
# function to replace SetlangId in lines 62-80 of unified.y
|
70 |
+
def SetlangId(g : GLOBALS, fl : str):
|
71 |
+
id = ord(fl)
|
72 |
+
if(id>=3328 and id<=3455):
|
73 |
+
g.currLang = g.MALAYALAM; #malayalam
|
74 |
+
elif(id>=2944 and id<=3055):
|
75 |
+
g.currLang = g.TAMIL; #tamil
|
76 |
+
elif(id>=3202 and id<=3311):
|
77 |
+
g.currLang = g.KANNADA; #KANNADA
|
78 |
+
elif(id>=3072 and id<=3198):
|
79 |
+
g.currLang = g.TELUGU; #telugu
|
80 |
+
elif(id>=2304 and id<=2431):
|
81 |
+
g.currLang = g.HINDI; #hindi
|
82 |
+
elif(id>=2432 and id<=2559):
|
83 |
+
g.currLang = g.BENGALI; #BENGALI
|
84 |
+
elif(id>=2688 and id<=2815):
|
85 |
+
g.currLang = g.GUJARATHI; #gujarathi
|
86 |
+
elif(id>=2816 and id<=2943):
|
87 |
+
g.currLang = g.ODIYA; #odia
|
88 |
+
elif(id>=2560 and id <= 2687): # punjabi
|
89 |
+
g.currLang = g.PUNJABI
|
90 |
+
elif(id>=64 and id<=123):
|
91 |
+
g.currLang = g.ENGLISH; #english
|
92 |
+
|
93 |
+
g.langId = g.currLang
|
94 |
+
|
95 |
+
if(g.langId < 5):
|
96 |
+
g.isSouth = 1
|
97 |
+
if(g.langId == 0):
|
98 |
+
print(f"UNKNOWN LANGUAGE - id = {fl}")
|
99 |
+
exit(0)
|
100 |
+
return 1
|
101 |
+
|
102 |
+
# replacement for function in lins 158 - 213. Sets the lanuage features
|
103 |
+
def SetlanguageFeat(g : GLOBALS, input : str) -> int:
|
104 |
+
|
105 |
+
# open common file
|
106 |
+
#print("entered here")
|
107 |
+
try:
|
108 |
+
with open(GetFile(g, 0,0), 'r') as infile:
|
109 |
+
lines = infile.readlines()
|
110 |
+
#print("linessss", lines)
|
111 |
+
|
112 |
+
except:
|
113 |
+
print("Couldn't open common file for reading")
|
114 |
+
return 0
|
115 |
+
|
116 |
+
str1 = input
|
117 |
+
length = len(str1)
|
118 |
+
if (length == 0):
|
119 |
+
length = 1
|
120 |
+
|
121 |
+
for j in range(0,length):
|
122 |
+
# for skipping invisible char
|
123 |
+
if (ord(str1[j]) < 8204):
|
124 |
+
firstLet = str1[j]
|
125 |
+
break
|
126 |
+
|
127 |
+
SetlangId(g, firstLet) # set global langId
|
128 |
+
for i in range(len(lines)):
|
129 |
+
l = lines[i].strip().split('\t')
|
130 |
+
g.symbolTable[i][1] = l[1]
|
131 |
+
g.symbolTable[i][0] = l[1 + g.langId]
|
132 |
+
|
133 |
+
return 1
|
134 |
+
|
135 |
+
# replacement for function in lines 52 - 59. Check if symbol is in symbolTable
|
136 |
+
def CheckSymbol(g : GLOBALS, input : str) -> int:
|
137 |
+
i = 0
|
138 |
+
for i in range(g.ROW):
|
139 |
+
if (g.symbolTable[i][1] == input):
|
140 |
+
return 1
|
141 |
+
return 0
|
142 |
+
|
143 |
+
# replacement for function in lines 249 - 276. Convert utf-8 to cps symbols
|
144 |
+
def ConvertToSymbols(g : GLOBALS, input : str) -> str:
|
145 |
+
str1 = input
|
146 |
+
|
147 |
+
g.words.syllabifiedWord = "&"
|
148 |
+
for j in range(len(str1)):
|
149 |
+
if (ord(str1[j]) < 8204):
|
150 |
+
g.words.syllabifiedWord += "&" + g.symbolTable[ord(str1[j])%128][1]
|
151 |
+
|
152 |
+
g.words.syllabifiedWord = g.words.syllabifiedWord[1:]
|
153 |
+
return g.words.syllabifiedWord
|
154 |
+
|
155 |
+
# function in lines 1278 - 1299. save answer in g.answer
|
156 |
+
def WriteFile(g : GLOBALS, text : str):
|
157 |
+
g.answer = f"(set! wordstruct '( {text}))"
|
158 |
+
|
159 |
+
# function in lines 588-597. checnk if vowel is in input. 'q' special case, 'rq' special case
|
160 |
+
def CheckVowel(input : str, q : int, rq : int) -> int:
|
161 |
+
if (input.find("a") != -1):
|
162 |
+
return 1
|
163 |
+
if (input.find("e") != -1):
|
164 |
+
return 1
|
165 |
+
if (input.find("i") != -1):
|
166 |
+
return 1
|
167 |
+
if (input.find("o") != -1):
|
168 |
+
return 1
|
169 |
+
if (input.find("u") != -1):
|
170 |
+
return 1
|
171 |
+
if (q and input.find("q") != -1):
|
172 |
+
return 1
|
173 |
+
if (rq and input.find("rq") != -1):
|
174 |
+
return 1
|
175 |
+
return 0
|
176 |
+
|
177 |
+
# function in lines 599-602.
|
178 |
+
def Checkeuv(input : str) -> int:
|
179 |
+
if (input.find("euv") != -1):
|
180 |
+
return 1
|
181 |
+
return 0
|
182 |
+
|
183 |
+
# function in lines 605-613
|
184 |
+
def CheckSingleVowel(input : str, q : int) -> int:
|
185 |
+
if (input in ['a', 'e', 'i', 'o', 'u']):
|
186 |
+
return 1
|
187 |
+
if (q != 0 and input == 'q'):
|
188 |
+
return 1
|
189 |
+
return 0
|
190 |
+
|
191 |
+
# function in lines 616 - 629. get the type of phone in the position
|
192 |
+
def GetPhoneType(g : GLOBALS, input : str, pos : int) -> int:
|
193 |
+
phone = input
|
194 |
+
phone = phone.split('&')
|
195 |
+
phone = list(filter(lambda x : x != '', phone))
|
196 |
+
pos = min(pos, len(phone))
|
197 |
+
pch = phone[pos - 1]
|
198 |
+
|
199 |
+
if (g.flags.DEBUG):
|
200 |
+
print(f'input : {input}')
|
201 |
+
print(f"str : {pch} {GetType(g, pch)}")
|
202 |
+
|
203 |
+
return GetType(g, pch)
|
204 |
+
|
205 |
+
# function in lines 631 - 637. get the type of given input
|
206 |
+
def GetType(g : GLOBALS, input : str):
|
207 |
+
for i in range(g.VOWELSSIZE):
|
208 |
+
if g.VOWELS[i] == input:
|
209 |
+
return 1
|
210 |
+
for i in range(g.CONSONANTSSIZE):
|
211 |
+
if g.CONSONANTS[i] == input:
|
212 |
+
return 2
|
213 |
+
for i in range(g.SEMIVOWELSSIZE):
|
214 |
+
if g.SEMIVOWELS[i] == input:
|
215 |
+
return 3
|
216 |
+
return 0
|
217 |
+
|
218 |
+
# function in lines 640 - 647. check if chillaksharas are there --for malayalam
|
219 |
+
def CheckChillu(input : str) -> int:
|
220 |
+
l = ["nwv", "nnv", "rwv", "lwv", "lnv"]
|
221 |
+
for x in l:
|
222 |
+
if (input.find(x) != -1):
|
223 |
+
return 1
|
224 |
+
|
225 |
+
return 0
|
226 |
+
|
227 |
+
# function in lines 650 - 660. get UTF-8 from CPS
|
228 |
+
def GetUTF(g : GLOBALS, input : str) -> str :
|
229 |
+
for i in range(g.ROW):
|
230 |
+
if (input == g.symbolTable[i][1]):
|
231 |
+
return g.symbolTable[i][0]
|
232 |
+
|
233 |
+
return 0
|
234 |
+
|
235 |
+
# function in lines 663 - 666. verify the letter is english char -- CLS
|
236 |
+
def isEngLetter(p : str) -> int:
|
237 |
+
if (ord(p) >= 97 and ord(p) <= 122):
|
238 |
+
return 1
|
239 |
+
return 0
|
240 |
+
|
241 |
+
# function in lines 669-682. remove unwanted Symbols from word
|
242 |
+
def CleanseWord(phone : str) -> str:
|
243 |
+
phonecopy = ""
|
244 |
+
for c in phone:
|
245 |
+
if (c != '&' and isEngLetter(c) == 0):
|
246 |
+
c = '#'
|
247 |
+
phonecopy += c
|
248 |
+
phonecopy = rec_replace(phonecopy, '$','')
|
249 |
+
phonecopy = rec_replace(phonecopy, '&&','&')
|
250 |
+
return phonecopy
|
251 |
+
|
252 |
+
# replacement for funciton in lines 321 - 356. Correct if there is a vowel in the middle
|
253 |
+
def MiddleVowel(g : GLOBALS, phone : str) -> str:
|
254 |
+
|
255 |
+
c1 = ''
|
256 |
+
c2 = ''
|
257 |
+
phonecopy = phone
|
258 |
+
for i in range(g.CONSONANTSSIZE):
|
259 |
+
for j in range(g.VOWELSSIZE):
|
260 |
+
c1 = f'&{g.CONSONANTS[i]}&{g.VOWELS[j]}&'
|
261 |
+
c2 = f'&{g.CONSONANTS[i]}&av&{g.VOWELS[j]}&'
|
262 |
+
|
263 |
+
phonecopy = phonecopy.replace(c1, c2)
|
264 |
+
|
265 |
+
for i in range(g.SEMIVOWELSSIZE):
|
266 |
+
for j in range(g.VOWELSSIZE):
|
267 |
+
c1 = f'&{g.SEMIVOWELS[i]}&{g.VOWELS[j]}&'
|
268 |
+
c2 = f'&{g.SEMIVOWELS[i]}&av&{g.VOWELS[j]}&'
|
269 |
+
|
270 |
+
phonecopy = phonecopy.replace(c1, c2)
|
271 |
+
|
272 |
+
return phonecopy
|
273 |
+
|
274 |
+
# replacement for function in lines 435 - 459. //cant use this as break syllable rules.
|
275 |
+
# NOT USED ANYWHERE
|
276 |
+
def DoubleModifierCorrection(phone : str) -> str:
|
277 |
+
|
278 |
+
doubleModifierList = ["&nwv&","&nnv&","&rwv&","&lwv&","&lnv&","&aav&","&iiv&","&uuv&","&rqv&","&eev&",
|
279 |
+
"&eiv&","&ouv&","&axv&","&oov&","&aiv&","&auv&","&aev&",
|
280 |
+
"&iv&","&ov&","&ev&","&uv&"]
|
281 |
+
|
282 |
+
phonecopy = phone
|
283 |
+
for i in range(0,21):
|
284 |
+
for j in range(0,21):
|
285 |
+
c1 = f'{doubleModifierList[i]}#{doubleModifierList[j]}'
|
286 |
+
c2 = f'{doubleModifierList[i]}{doubleModifierList[j]}#&'
|
287 |
+
phonecopy = phonecopy.replace(c1, c2)
|
288 |
+
|
289 |
+
phonecopy = rec_replace(phonecopy, "&#&hq&","&hq&#&")
|
290 |
+
phonecopy = rec_replace(phonecopy, "&&","&")
|
291 |
+
return phonecopy
|
292 |
+
|
293 |
+
# replacement for funciton in lines 462 - 495. //for eu&C&C&V
|
294 |
+
def SchwaDoubleConsonent(phone : str) -> str:
|
295 |
+
consonentList = ["k","kh","lx","rx","g","gh","ng","c","ch","j","jh","nj","tx","txh","dx","dxh","nx","t","th","d","dh","n","p","ph","b","bh","m","y","r","l","w","sh","sx","zh","y","s","h","f","dxq"]
|
296 |
+
vowelList = ["av&","nwv&","nnv&","rwv&","lwv&","lnv&","aav&","iiv&","uuv&","rqv&","eev&","eiv&","ouv&",
|
297 |
+
"axv&","oov&","aiv&","nnx&","nxx&","rrx&","llx&","lxx&",
|
298 |
+
"aa&","iv&","ov&","mq&","aa&","ii&","uu&","rq&",
|
299 |
+
"ee&","ei&","ou&","oo&","ax&","ai&","ev&","uv&",
|
300 |
+
"a&","e&","i&","o&","u&"]
|
301 |
+
|
302 |
+
phonecopy = phone
|
303 |
+
for i in range(0,39):
|
304 |
+
for j in range(0,39):
|
305 |
+
for k in range(0,42):
|
306 |
+
c1 = f'&euv&{consonentList[i]}&{consonentList[j]}&{vowelList[k]}'
|
307 |
+
c2 = f'&euv&{consonentList[i]}&av&{consonentList[j]}&{vowelList[k]}'
|
308 |
+
phonecopy = phonecopy.replace(c1, c2)
|
309 |
+
phonecopy = rec_replace(phonecopy, "$","")
|
310 |
+
return phonecopy
|
311 |
+
|
312 |
+
# replacement for function in lines 498 - 585. //halant specific correction for aryan langs
|
313 |
+
def SchwaSpecificCorrection(g : GLOBALS, phone : str) -> str:
|
314 |
+
schwaList = ["k","kh","g","gh","ng","c","ch","j","jh","nj","tx","txh","dx","dxh",
|
315 |
+
"nx","t","th","d","dh","n","p","ph","b","bh","m","y",
|
316 |
+
"r","l","s","w","sh","sx","zh","h","lx","rx","f","dxq"]
|
317 |
+
|
318 |
+
vowelList = ["av&","nwv&","nnv&","rwv&","lwv&","lnv&","aav&","iiv&","uuv&","rqv&","eev&","eiv&","ouv&",
|
319 |
+
"axv&","oov&","aiv&","nnx&","nxx&","rrx&","llx&","lxx&",
|
320 |
+
"aa&","iv&","ov&","mq&","aa&","ii&","uu&","rq&",
|
321 |
+
"ee&","ei&","ou&","oo&","ax&","ai&","ev&","uv&",
|
322 |
+
"a&","e&","i&","o&","u&"]
|
323 |
+
|
324 |
+
if (g.flags.DEBUG):
|
325 |
+
print(f'{len(phone)}')
|
326 |
+
|
327 |
+
phonecopy = phone + '!'
|
328 |
+
|
329 |
+
if (g.flags.DEBUG):
|
330 |
+
print(f'phone cur - {phonecopy}')
|
331 |
+
|
332 |
+
# // for end correction &av&t&aav&. //dont want av
|
333 |
+
for i in range(0,38):
|
334 |
+
for j in range(1,42):
|
335 |
+
c1 = f'&av&{schwaList[i]}&{vowelList[j]}!'
|
336 |
+
c2 = f'&euv&{schwaList[i]}&{vowelList[j]}!'
|
337 |
+
phonecopy = phonecopy.replace(c1, c2)
|
338 |
+
|
339 |
+
phonecopy = rec_replace(phonecopy, '!', '')
|
340 |
+
|
341 |
+
for i in range(0,38):
|
342 |
+
c1 = f'&av&{schwaList[i]}&av&'
|
343 |
+
c2 = f'&euv$&{schwaList[i]}&av$&'
|
344 |
+
phonecopy = phonecopy.replace(c1, c2)
|
345 |
+
|
346 |
+
if(g.flags.DEBUG):
|
347 |
+
print(f"inside schwa {phonecopy}")
|
348 |
+
|
349 |
+
for i in range(0,38):
|
350 |
+
c1 = f'&av&{schwaList[i]}&'
|
351 |
+
c3 = f'&{schwaList[i]}&'
|
352 |
+
|
353 |
+
for j in range(0,41):
|
354 |
+
c4 = f'&euv&{c3}${vowelList[j]}'
|
355 |
+
c2 = f'{c1}{vowelList[j]}'
|
356 |
+
phonecopy = phonecopy.replace(c2, c4)
|
357 |
+
|
358 |
+
phonecopy = rec_replace(phonecopy, '$', '')
|
359 |
+
|
360 |
+
#//&q&w&eu& - CORRECTED TO 38 - CHECK
|
361 |
+
for i in range(0,38):
|
362 |
+
c1 = f'&q&{schwaList[i]}&euv&'
|
363 |
+
c2 = f'&q&{schwaList[i]}&av&'
|
364 |
+
phonecopy = phonecopy.replace(c1, c2)
|
365 |
+
|
366 |
+
return phonecopy
|
367 |
+
|
368 |
+
# replacement for function in lines . //correct the geminate syllabification ,isReverse --reverse correction
|
369 |
+
def GeminateCorrection(phone : str, isReverse : int) -> str:
|
370 |
+
geminateList = ["k","kh","lx","rx","g","gh","ng","c","ch","j","jh","nj","tx","txh","dx","dxh","nx","t","th","d","dh","n","p","ph","b","bh","m","y",
|
371 |
+
"r","l","w","sh","sx","zh","y","s","h","f","dxq"]
|
372 |
+
|
373 |
+
phonecopy = phone
|
374 |
+
for i in range(0, 39):
|
375 |
+
c1 = f'&{geminateList[i]}&eu&{geminateList[i]}&'
|
376 |
+
c2 = f'&{geminateList[i]}&{geminateList[i]}&'
|
377 |
+
phonecopy = rec_replace(phonecopy, c2, c1) if isReverse != 0 else rec_replace(phonecopy, c1, c2)
|
378 |
+
|
379 |
+
return phonecopy
|
380 |
+
|
381 |
+
# replacement for function in lines 356 - 430. //Syllabilfy the words
|
382 |
+
def Syllabilfy(phone : str) -> str:
|
383 |
+
|
384 |
+
phonecopy = phone
|
385 |
+
phonecopy = rec_replace(phonecopy, "&&","&")
|
386 |
+
phonecopy = phonecopy.replace("&eu&","&eu&#&")
|
387 |
+
phonecopy = phonecopy.replace("&euv&","&euv&#&")
|
388 |
+
phonecopy = rec_replace(phonecopy, "&avq","&q&av")
|
389 |
+
phonecopy = phonecopy.replace("&av&","&av&#&")
|
390 |
+
phonecopy = phonecopy.replace("&q","&q&#")
|
391 |
+
|
392 |
+
removeList = ["&nwv&","&nnv&","&rwv&","&lwv&","&lnv&","&aav&","&iiv&","&uuv&","&rqv&","&eev&",
|
393 |
+
"&eiv&","&ouv&","&axv&","&oov&","&aiv&","&auv&","&aev&",
|
394 |
+
"&nnx&","&nxx&","&rrx&","&llx&","&lxx&",
|
395 |
+
"&aa&","&iv&","&ov&","&mq&","&aa&","&ii&","&uu&","&rq&","&au&","&ee&",
|
396 |
+
"&ei&","&ou&","&oo&","&ax&","&ai&","&ev&","&uv&","&ae&",
|
397 |
+
"&a&","&e&","&i&","&o&","&u&"]
|
398 |
+
|
399 |
+
for i in range(0,45):
|
400 |
+
c1 = removeList[i]
|
401 |
+
c2 = c1 + '#&'
|
402 |
+
phonecopy = phonecopy.replace(c1, c2)
|
403 |
+
phonecopy = rec_replace(phonecopy, "&#&hq&","&hq&#&")
|
404 |
+
|
405 |
+
# //for vowel in between correction
|
406 |
+
pureVowelList = ["&a&","&e&","&i&","&o&","&u&"]
|
407 |
+
for i in range(0,5):
|
408 |
+
c1 = f'&#{pureVowelList[i]}'
|
409 |
+
phonecopy = phonecopy.replace(pureVowelList[i], c1)
|
410 |
+
|
411 |
+
consonantList = ["k","kh","g","gh","ng","c","ch","j","jh","nj","tx","txh","dx","dxh",
|
412 |
+
"nx","t","th","d","dh","n","p","ph","b","bh","m","y",
|
413 |
+
"r","l","w","sh","sx","zh","y","s","h","lx","rx","f","dxq"]
|
414 |
+
|
415 |
+
# // &eu&#&r&eu&#& syllabification correction
|
416 |
+
|
417 |
+
for i in range(0,39):
|
418 |
+
c1 = f'&eu&#&{consonantList[i]}&euv&#&'
|
419 |
+
c2 = f'&eu&{consonantList[i]}&av&#&'
|
420 |
+
phonecopy = phonecopy.replace(c1, c2)
|
421 |
+
|
422 |
+
for i in range(0,39):
|
423 |
+
c1 = f'&euv&#&{consonantList[i]}&euv&#&'
|
424 |
+
c2 = f'&euv&{consonantList[i]}&av&#&'
|
425 |
+
phonecopy = phonecopy.replace(c1, c2)
|
426 |
+
|
427 |
+
phonecopy = phonecopy.replace("&eu&","&eu&#&")
|
428 |
+
return phonecopy
|
429 |
+
|
430 |
+
# replacement for function in lines 279 - 317. //check the word in Dict.
|
431 |
+
# REMOVED EXIT(1) ON ENGLISH. WAS USELESS
|
432 |
+
def CheckDictionary(g : GLOBALS, input : str) -> int:
|
433 |
+
|
434 |
+
fileName = GetFile(g, g.langId, 1)
|
435 |
+
if (g.flags.DEBUG):
|
436 |
+
print(f'dict : {fileName}')
|
437 |
+
try:
|
438 |
+
with open(fileName, 'r') as output:
|
439 |
+
cnts = output.readlines()
|
440 |
+
except:
|
441 |
+
if g.flags.DEBUG:
|
442 |
+
print(f'Dict not found')
|
443 |
+
if(g.langId == g.ENGLISH):
|
444 |
+
exit(1)
|
445 |
+
return 0
|
446 |
+
|
447 |
+
if (g.langId == g.ENGLISH):
|
448 |
+
input1 = ''
|
449 |
+
for c in input:
|
450 |
+
if ord(c) < 97:
|
451 |
+
c = c.lower()
|
452 |
+
input1 += c
|
453 |
+
input = input1
|
454 |
+
|
455 |
+
for l in cnts:
|
456 |
+
l = l.strip().split('\t')
|
457 |
+
assert(len(l) == 3)
|
458 |
+
if g.flags.DEBUG:
|
459 |
+
print(f"word : {l[0]}")
|
460 |
+
if input == l[0]:
|
461 |
+
if g.flags.DEBUG:
|
462 |
+
print(f"match found")
|
463 |
+
print(f'Syllables : {l[1]}')
|
464 |
+
print(f'monophones : {l[2]}')
|
465 |
+
if g.flags.writeFormat == 1:
|
466 |
+
WriteFile(g, l[1])
|
467 |
+
if g.flags.writeFormat == 0:
|
468 |
+
WriteFile(g, l[2])
|
469 |
+
return 1
|
470 |
+
|
471 |
+
return 0
|
472 |
+
|
473 |
+
# replacement for function in lines 801-821.
|
474 |
+
def PositionCorrection(phone : str, left : str, right :str, isReverse:int) -> str:
|
475 |
+
geminateList = ["k","kh","lx","rx","g","gh","ng","c","ch","j","jh","nj","tx","txh","dx","dxh","nx","t","th","d","dh",
|
476 |
+
"n","p","ph","b","bh","m","y","r","l","w","sh","sx","zh","y","s","h","f","dxq"]
|
477 |
+
phonecopy = phone
|
478 |
+
for i in range(0,39):
|
479 |
+
c1 = left
|
480 |
+
c2 = right
|
481 |
+
c1 = c1.replace('@', geminateList[i])
|
482 |
+
c2 = c2.replace('@', geminateList[i])
|
483 |
+
phonecopy = rec_replace(phonecopy, c2, c1) if isReverse != 0 else rec_replace(phonecopy, c1, c2)
|
484 |
+
return phonecopy
|
485 |
+
|
486 |
+
# replacement for function in lines 711 - 713.
|
487 |
+
def CountChars(s : str, c : str) -> int:
|
488 |
+
count = 0
|
489 |
+
for x in s:
|
490 |
+
if x == c:
|
491 |
+
count += 1
|
492 |
+
return count
|
493 |
+
|
494 |
+
# replacement for function in lines 719 - 744.
|
495 |
+
def GenerateAllCombinations(g : GLOBALS, j : int, s : str, c : list, isRight : int):
|
496 |
+
t = ''
|
497 |
+
if (c[j][0][0] == '#'):
|
498 |
+
if isRight == 1:
|
499 |
+
g.combvars.rightStr[g.combvars.bi] = s + '&'
|
500 |
+
g.combvars.bi += 1
|
501 |
+
else:
|
502 |
+
g.combvars.leftStr[g.combvars.bi] = s + '&'
|
503 |
+
g.combvars.bi += 1
|
504 |
+
else:
|
505 |
+
i = 0
|
506 |
+
while (c[j][i][0] != '#'):
|
507 |
+
t = s + '&' + c[j][i]
|
508 |
+
GenerateAllCombinations(g, j+1, t, c, isRight)
|
509 |
+
i += 1
|
510 |
+
|
511 |
+
# replacement for function in lines 746 - 768.
|
512 |
+
def GenerateMatrix(g : GLOBALS, combMatrix : list, regex : str):
|
513 |
+
row, col, item = 0, 0, 0
|
514 |
+
for i in range(0, len(regex)):
|
515 |
+
if regex[i] == '&':
|
516 |
+
combMatrix[row][col+1] = '#'
|
517 |
+
row += 1
|
518 |
+
col = 0
|
519 |
+
item = 0
|
520 |
+
combMatrix[row][col] = ''
|
521 |
+
elif regex[i] == '|':
|
522 |
+
col += 1
|
523 |
+
item = 0
|
524 |
+
combMatrix[row][col] = ''
|
525 |
+
else:
|
526 |
+
combMatrix[row][col] = combMatrix[row][col][:item] + regex[i] + combMatrix[row][col][(item+1):]
|
527 |
+
item += 1
|
528 |
+
if g.flags.DEBUG:
|
529 |
+
print(f'{row} {col} {combMatrix[row][col]}')
|
530 |
+
|
531 |
+
combMatrix[row][col+1] = '#'
|
532 |
+
combMatrix[row+1][0] = '#'
|
533 |
+
|
534 |
+
# replacement for function in lines 770 - 799.
|
535 |
+
def CombinationCorrection(g : GLOBALS, phone : str, left : str, right : str, isReverse : int) -> str:
|
536 |
+
leftComb = [['' for _ in range(256)] for _ in range(256)]
|
537 |
+
rightComb = [['' for _ in range(256)] for _ in range(256)]
|
538 |
+
GenerateMatrix(g, leftComb, left)
|
539 |
+
GenerateMatrix(g, rightComb, right)
|
540 |
+
|
541 |
+
g.combvars.bi = 0
|
542 |
+
GenerateAllCombinations(g, 0, '', leftComb, 0)
|
543 |
+
g.combvars.bi = 0
|
544 |
+
GenerateAllCombinations(g, 0, '', rightComb, 1)
|
545 |
+
|
546 |
+
i = 0
|
547 |
+
phonecopy = phone
|
548 |
+
while g.combvars.leftStr[i] != '':
|
549 |
+
if isReverse != 0:
|
550 |
+
phonecopy = phonecopy.replace(g.combvars.rightStr[i], g.combvars.leftStr[i])
|
551 |
+
else:
|
552 |
+
phonecopy = phonecopy.replace(g.combvars.leftStr[i], g.combvars.rightStr[i])
|
553 |
+
|
554 |
+
if g.flags.DEBUG:
|
555 |
+
print(f'{g.combvars.leftStr[i]} {g.combvars.rightStr[i]}')
|
556 |
+
|
557 |
+
i += 1
|
558 |
+
|
559 |
+
g.combvars.refresh()
|
560 |
+
return phonecopy
|
561 |
+
|
562 |
+
# replacement for function in lines 825 - 930. //Language specific corrections
|
563 |
+
def LangSpecificCorrection(g : GLOBALS, phone : str, langSpecFlag : int) -> str:
|
564 |
+
phonecopy = phone
|
565 |
+
if g.isSouth:
|
566 |
+
phonecopy = rec_replace(phonecopy,"&ei&","&ai&")
|
567 |
+
phonecopy = rec_replace(phonecopy,"&eiv&","&aiv&")
|
568 |
+
else:
|
569 |
+
phonecopy = rec_replace(phonecopy,"&oo&","&o&")
|
570 |
+
phonecopy = rec_replace(phonecopy,"&oov&","&ov&")
|
571 |
+
|
572 |
+
phonecopy = phonecopy.replace("&q&","&av&q&")
|
573 |
+
phonecopy = rec_replace(phonecopy, "&a&av&","&a&")
|
574 |
+
phonecopy = rec_replace(phonecopy, "&e&av&","&e&")
|
575 |
+
phonecopy = rec_replace(phonecopy, "&i&av&","&i&")
|
576 |
+
phonecopy = rec_replace(phonecopy, "&o&av&","&o&")
|
577 |
+
phonecopy = rec_replace(phonecopy, "&u&av&","&u&")
|
578 |
+
phonecopy = rec_replace(phonecopy,"&a&rqv&","&rq&")
|
579 |
+
phonecopy = rec_replace(phonecopy,"&aa&av&","&aa&")
|
580 |
+
phonecopy = rec_replace(phonecopy,"&ae&av&","&ae&")
|
581 |
+
phonecopy = rec_replace(phonecopy,"&ax&av&","&ax&")
|
582 |
+
phonecopy = rec_replace(phonecopy,"&ee&av&","&ee&")
|
583 |
+
phonecopy = rec_replace(phonecopy,"&ii&av&","&ii&")
|
584 |
+
phonecopy = rec_replace(phonecopy,"&ai&av&","&ai&")
|
585 |
+
phonecopy = rec_replace(phonecopy,"&au&av&","&au&")
|
586 |
+
phonecopy = rec_replace(phonecopy,"&oo&av&","&oo&")
|
587 |
+
phonecopy = rec_replace(phonecopy,"&uu&av&","&uu&")
|
588 |
+
phonecopy = rec_replace(phonecopy,"&rq&av&","&rq&")
|
589 |
+
phonecopy = rec_replace(phonecopy,"&av&av&","&av&")
|
590 |
+
phonecopy = rec_replace(phonecopy,"&ev&av&","&ev&")
|
591 |
+
phonecopy = rec_replace(phonecopy,"&iv&av&","&iv&")
|
592 |
+
phonecopy = rec_replace(phonecopy,"&ov&av&","&ov&")
|
593 |
+
phonecopy = rec_replace(phonecopy,"&uv&av&","&uv&")
|
594 |
+
|
595 |
+
phonecopy = rec_replace(phonecopy, "&av&rqv&","&rqv&")
|
596 |
+
phonecopy = rec_replace(phonecopy, "&aav&av&","&aav&")
|
597 |
+
phonecopy = rec_replace(phonecopy, "&aev&av&","&aev&")
|
598 |
+
phonecopy = rec_replace(phonecopy, "&auv&av&","&auv&")
|
599 |
+
phonecopy = rec_replace(phonecopy, "&axv&av&","&axv&")
|
600 |
+
phonecopy = rec_replace(phonecopy, "&aiv&av&","&aiv&")
|
601 |
+
phonecopy = rec_replace(phonecopy, "&eev&av&","&eev&")
|
602 |
+
phonecopy = rec_replace(phonecopy, "&eiv&av&","&eiv&")
|
603 |
+
phonecopy = rec_replace(phonecopy, "&iiv&av&","&iiv&")
|
604 |
+
phonecopy = rec_replace(phonecopy, "&oov&av&","&oov&")
|
605 |
+
phonecopy = rec_replace(phonecopy, "&ouv&av&","&ouv&")
|
606 |
+
phonecopy = rec_replace(phonecopy, "&uuv&av&","&uuv&")
|
607 |
+
phonecopy = rec_replace(phonecopy, "&rqv&av&","&rqv&")
|
608 |
+
|
609 |
+
if langSpecFlag == 0:
|
610 |
+
return phonecopy
|
611 |
+
|
612 |
+
fileName = GetFile(g, g.langId, 2)
|
613 |
+
with open(fileName, 'r') as output:
|
614 |
+
cnts = output.readlines()
|
615 |
+
|
616 |
+
left = ''
|
617 |
+
right = ''
|
618 |
+
phonecopy = '^' + phonecopy + '$'
|
619 |
+
|
620 |
+
if (g.flags.DEBUG):
|
621 |
+
print(f'phone : {phonecopy}')
|
622 |
+
|
623 |
+
for l in cnts:
|
624 |
+
l = l.strip()
|
625 |
+
if (l.find('#') != -1):
|
626 |
+
continue
|
627 |
+
|
628 |
+
l = l.split('\t')
|
629 |
+
assert(len(l) == 2)
|
630 |
+
left, right = l[0], l[1]
|
631 |
+
|
632 |
+
if left.find('|') != -1:
|
633 |
+
a1 = left[1:-1]
|
634 |
+
a2 = right[1:-1]
|
635 |
+
phonecopy = CombinationCorrection(g, phonecopy, a1, a2, 0)
|
636 |
+
if g.flags.DEBUG:
|
637 |
+
print(f'{a1}\t{a2}')
|
638 |
+
elif left.find('@') != -1:
|
639 |
+
phonecopy = PositionCorrection(phonecopy, left, right, 0)
|
640 |
+
else:
|
641 |
+
phonecopy = phonecopy.replace(left, right)
|
642 |
+
|
643 |
+
# //remove head and tail in phone
|
644 |
+
phonecopy = phonecopy.replace('^', '')
|
645 |
+
phonecopy = phonecopy.replace('$', '')
|
646 |
+
# //end correction
|
647 |
+
count = 0
|
648 |
+
for i in range(len(phonecopy)):
|
649 |
+
if phonecopy[i] == '&':
|
650 |
+
count = i
|
651 |
+
return phonecopy[:(count+1)]
|
652 |
+
|
653 |
+
# Replacement for function in lines 934 - 991. //Reverse syllable correction for syllable parsing
|
654 |
+
def SyllableReverseCorrection(g : GLOBALS, phone : str, langSpecFlag : int) -> str:
|
655 |
+
phonecopy = phone
|
656 |
+
|
657 |
+
if g.isSouth:
|
658 |
+
phonecopy = rec_replace(phonecopy, "&ai&","&ei&")
|
659 |
+
phonecopy = rec_replace(phonecopy, "&aiv&","&eiv&")
|
660 |
+
else:
|
661 |
+
phonecopy = rec_replace(phonecopy, "&o&","&oo&")
|
662 |
+
phonecopy = rec_replace(phonecopy, "&ov&","&oov&")
|
663 |
+
|
664 |
+
if langSpecFlag == 0:
|
665 |
+
return phonecopy
|
666 |
+
|
667 |
+
fileName = GetFile(g, g.langId, 2)
|
668 |
+
with open(fileName, 'r') as output:
|
669 |
+
cnts = output.readlines()
|
670 |
+
|
671 |
+
left = ''
|
672 |
+
right = ''
|
673 |
+
# //update head and tail in phone
|
674 |
+
phonecopy = '^' + phonecopy + '$'
|
675 |
+
|
676 |
+
if g.flags.DEBUG:
|
677 |
+
print(f'before phone : {phonecopy}')
|
678 |
+
|
679 |
+
for l in cnts:
|
680 |
+
l = l.strip()
|
681 |
+
if (l.find('#') != -1):
|
682 |
+
continue
|
683 |
+
|
684 |
+
l = l.split('\t')
|
685 |
+
assert(len(l) == 2)
|
686 |
+
left, right = l[0], l[1]
|
687 |
+
|
688 |
+
if left.find('|') != -1:
|
689 |
+
a1 = left[1:-1]
|
690 |
+
a2 = right[1:-1]
|
691 |
+
phonecopy = CombinationCorrection(g, phonecopy, a1, a2, 1)
|
692 |
+
if g.flags.DEBUG:
|
693 |
+
print(f'{a1}\t{a2}')
|
694 |
+
elif left.find('@') != -1:
|
695 |
+
phonecopy = PositionCorrection(phonecopy, left, right, 1)
|
696 |
+
else:
|
697 |
+
phonecopy = phonecopy.replace(right, left)
|
698 |
+
|
699 |
+
# //remove head and tail in phone
|
700 |
+
phonecopy = phonecopy.replace('^', '')
|
701 |
+
phonecopy = phonecopy.replace('$', '')
|
702 |
+
# //end correction
|
703 |
+
if (g.flags.DEBUG):
|
704 |
+
print(f'after phone : {phonecopy}')
|
705 |
+
return phonecopy
|
706 |
+
|
707 |
+
# //language specific syllable correction
|
708 |
+
def LangSyllableCorrection(input : str) -> int:
|
709 |
+
if input == "&av&q&":
|
710 |
+
return 1
|
711 |
+
else:
|
712 |
+
return 0
|
713 |
+
|
714 |
+
# replacement for function in lines 1000 - 1160. //split into syllable array
|
715 |
+
def SplitSyllables(g : GLOBALS, input : str) -> int:
|
716 |
+
incopy = input
|
717 |
+
|
718 |
+
if g.flags.writeFormat == 2:
|
719 |
+
i = 0
|
720 |
+
j = 0
|
721 |
+
fullList = ["k","kh","lx","rx","g","gh","ng","c","ch","j","jh","nj","tx","txh","dx","dxh","nx","t","th","d","dh","n","p","ph","b","bh","m","y","r","l","w","sh","sx","zh","y","s","h","f","dxq"]
|
722 |
+
|
723 |
+
for i in range(0,39):
|
724 |
+
for j in range(0,39):
|
725 |
+
c1 = f'&{fullList[i]}&{fullList[j]}&'
|
726 |
+
c2 = f'&{fullList[i]}&euv&#&{fullList[j]}&'
|
727 |
+
incopy = incopy.replace(c1, c2)
|
728 |
+
|
729 |
+
incopy = rec_replace(incopy, "&#&mq&","&mq&")
|
730 |
+
incopy = rec_replace(incopy, "&#&q&","&q&")
|
731 |
+
|
732 |
+
pch = incopy.split('#')
|
733 |
+
g.syllableList = []
|
734 |
+
for c in pch:
|
735 |
+
if c != '&':
|
736 |
+
g.syllableList.append(c)
|
737 |
+
|
738 |
+
# ln -> len
|
739 |
+
ln = len(g.syllableList)
|
740 |
+
if (ln == 0):
|
741 |
+
return 1
|
742 |
+
|
743 |
+
if g.flags.DEBUG:
|
744 |
+
for i in range(ln):
|
745 |
+
print(f"initStack : {g.syllableList[i]}")
|
746 |
+
|
747 |
+
# //south specific av addition
|
748 |
+
if CheckVowel(g.syllableList[ln-1],1,0) == 0 and CheckChillu(g.syllableList[ln-1]) == 0:
|
749 |
+
if g.isSouth:
|
750 |
+
g.syllableList[ln-1] += '&av&'
|
751 |
+
else:
|
752 |
+
g.syllableList[ln-1] += '&euv&'
|
753 |
+
|
754 |
+
# //round 2 correction
|
755 |
+
if g.flags.writeFormat == 2:
|
756 |
+
g.syllableCount = ln
|
757 |
+
g.flags.writeFormat = 1
|
758 |
+
return 1
|
759 |
+
|
760 |
+
euFlag = 1
|
761 |
+
if ln > 1:
|
762 |
+
for i in range(ln-1,-1,-1):
|
763 |
+
if LangSyllableCorrection(g.syllableList[i]) == 1:
|
764 |
+
g.syllableList[i-1] += g.syllableList[i]
|
765 |
+
g.syllableList[i] = ''
|
766 |
+
|
767 |
+
if g.syllableList[i].find("&eu&") != -1:
|
768 |
+
g.syllableList[i] = g.syllableList[i].replace("&eu&", "!")
|
769 |
+
euFlag = 1
|
770 |
+
|
771 |
+
if g.syllableList[i].find("&euv&") != -1:
|
772 |
+
g.syllableList[i] = g.syllableList[i].replace("&euv&", "!")
|
773 |
+
euFlag = 2
|
774 |
+
|
775 |
+
if CheckVowel(g.syllableList[i],0,1) == 0:
|
776 |
+
if i-1 >= 0:
|
777 |
+
g.syllableList[i-1] += g.syllableList[i]
|
778 |
+
g.syllableList[i] = ''
|
779 |
+
else:
|
780 |
+
g.syllableList[i] += g.syllableList[i+1]
|
781 |
+
g.syllableList[i+1] = ''
|
782 |
+
|
783 |
+
if i-1 > 0:
|
784 |
+
if euFlag == 1:
|
785 |
+
g.syllableList[i-1] = g.syllableList[i-1].replace("!","&eu&")
|
786 |
+
elif euFlag == 2:
|
787 |
+
g.syllableList[i-1] = g.syllableList[i-1].replace("!","&euv&")
|
788 |
+
g.syllableList[i-1] = rec_replace(g.syllableList[i-1], "&&","&")
|
789 |
+
|
790 |
+
if euFlag == 1:
|
791 |
+
g.syllableList[i] = g.syllableList[i].replace("!","&eu&")
|
792 |
+
elif euFlag == 2:
|
793 |
+
g.syllableList[i] = g.syllableList[i].replace("!","&euv&")
|
794 |
+
else:
|
795 |
+
if (CheckVowel(g.syllableList[0],1,0) == 0 and g.flags.writeFormat != 3) or Checkeuv(g.syllableList[0]) != 0:
|
796 |
+
g.syllableList[0] += '&av'
|
797 |
+
|
798 |
+
if g.flags.DEBUG:
|
799 |
+
for i in range(ln):
|
800 |
+
print(f'syllablifiedStack : {g.syllableList[i]}')
|
801 |
+
|
802 |
+
# //round 3 double syllable correction
|
803 |
+
for i in range(ln):
|
804 |
+
# //corrections
|
805 |
+
g.syllableList[i] = g.syllableList[i].replace('1','')
|
806 |
+
if g.flags.DEBUG:
|
807 |
+
print(f'LenStack : {len(g.syllableList[i])}')
|
808 |
+
|
809 |
+
if len(g.syllableList[i]) > 0:
|
810 |
+
if g.syllableList[i].find("&eu&") != -1:
|
811 |
+
g.syllableList[i] = g.syllableList[i].replace("&eu&", "!")
|
812 |
+
euFlag = 1
|
813 |
+
|
814 |
+
if g.syllableList[i].find("&euv&") != -1:
|
815 |
+
g.syllableList[i] = g.syllableList[i].replace("&euv&", "!")
|
816 |
+
euFlag = 2
|
817 |
+
|
818 |
+
if CheckVowel(g.syllableList[i],0,1) == 0 and g.flags.writeFormat != 3:
|
819 |
+
if g.flags.DEBUG:
|
820 |
+
print(f'Stack : {g.syllableList[i]}')
|
821 |
+
g.syllableList[i] += '&av'
|
822 |
+
|
823 |
+
if g.syllableList[i].find('!') != -1:
|
824 |
+
if euFlag == 1:
|
825 |
+
g.syllableList[i] = g.syllableList[i].replace("!","&eu&")
|
826 |
+
elif euFlag == 2:
|
827 |
+
g.syllableList[i] = g.syllableList[i].replace("!","&euv&")
|
828 |
+
g.syllableList[i] = g.syllableList[i].replace('!', 'eu')
|
829 |
+
|
830 |
+
g.syllableList[i] = rec_replace(g.syllableList[i], '&&', '&')
|
831 |
+
g.syllableList[i] = GeminateCorrection(g.syllableList[i],1)
|
832 |
+
|
833 |
+
if g.flags.DEBUG:
|
834 |
+
for i in range(ln):
|
835 |
+
print(f'syllablifiedStack1 : {g.syllableList[i]}')
|
836 |
+
print(f'No of syllables : {ln}')
|
837 |
+
|
838 |
+
g.syllableCount = ln
|
839 |
+
if g.flags.writeFormat == 3:
|
840 |
+
g.flags.writeFormat = 0
|
841 |
+
return 1
|
842 |
+
|
843 |
+
# replacement for function in lines 1164 - 1275. //make to write format
|
844 |
+
def WritetoFiles(g : GLOBALS) -> int:
|
845 |
+
if g.flags.DEBUG:
|
846 |
+
for i in range(0,g.syllableCount):
|
847 |
+
print(f'syllablifiedStackfinal : {g.syllableList[i]}')
|
848 |
+
|
849 |
+
validSyllable = 0
|
850 |
+
for i in range(0,g.syllableCount):
|
851 |
+
if g.syllableList[i] != '':
|
852 |
+
validSyllable += 1
|
853 |
+
|
854 |
+
if g.flags.DEBUG:
|
855 |
+
print(f'a correction {g.syllableList[0]}')
|
856 |
+
|
857 |
+
g.words.outputText = ''
|
858 |
+
|
859 |
+
# //phone
|
860 |
+
j = 0
|
861 |
+
if g.flags.writeFormat == 0:
|
862 |
+
syllablesPrint = 0
|
863 |
+
for i in range(g.syllableCount):
|
864 |
+
g.words.outputText += '(( '
|
865 |
+
l = g.syllableList[i].split('&')
|
866 |
+
for pch in l:
|
867 |
+
if pch == '':
|
868 |
+
continue
|
869 |
+
if g.flags.DEBUG:
|
870 |
+
print(f'syl {pch}')
|
871 |
+
j = 1
|
872 |
+
g.words.outputText += f'"{pch}" '
|
873 |
+
if j != 0:
|
874 |
+
if g.flags.syllTagFlag != 0:
|
875 |
+
if syllablesPrint == 0:
|
876 |
+
g.words.outputText += '_beg'
|
877 |
+
elif syllablesPrint == validSyllable - 1:
|
878 |
+
g.words.outputText += '_end'
|
879 |
+
else:
|
880 |
+
g.words.outputText += '_mid'
|
881 |
+
syllablesPrint += 1
|
882 |
+
g.words.outputText += ') 0) '
|
883 |
+
else:
|
884 |
+
g.words.outputText = g.words.outputText[:(len(g.words.outputText) - 3)]
|
885 |
+
j = 0
|
886 |
+
|
887 |
+
g.words.outputText = g.words.outputText.replace('v', '')
|
888 |
+
g.words.outputText = g.words.outputText.replace(" \"eu\"","")
|
889 |
+
g.words.outputText = g.words.outputText.replace('!', '')
|
890 |
+
|
891 |
+
# //syllable
|
892 |
+
elif g.flags.writeFormat == 1:
|
893 |
+
syllablesPrint = 0
|
894 |
+
for i in range(g.syllableCount):
|
895 |
+
g.syllableList[i] = rec_replace(g.syllableList[i], 'euv', 'eu')
|
896 |
+
g.syllableList[i] = SyllableReverseCorrection(g, g.syllableList[i], g.flags.LangSpecificCorrectionFlag)
|
897 |
+
if g.flags.DEBUG:
|
898 |
+
print(f'{g.syllableList[i]}')
|
899 |
+
g.words.outputText += '(( "'
|
900 |
+
l = g.syllableList[i].split('&')
|
901 |
+
for pch in l:
|
902 |
+
if pch == '':
|
903 |
+
continue
|
904 |
+
if g.flags.DEBUG:
|
905 |
+
print(f'syl {pch}')
|
906 |
+
j = 1
|
907 |
+
if CheckSymbol(g, pch) != 0:
|
908 |
+
g.words.outputText += GetUTF(g, pch)
|
909 |
+
if pch == 'av' and g.flags.DEBUG:
|
910 |
+
print('av found')
|
911 |
+
if j != 0:
|
912 |
+
if g.flags.syllTagFlag != 0:
|
913 |
+
if syllablesPrint == 0:
|
914 |
+
g.words.outputText += '_beg'
|
915 |
+
elif syllablesPrint == validSyllable - 1:
|
916 |
+
g.words.outputText += '_end'
|
917 |
+
else:
|
918 |
+
g.words.outputText += '_mid'
|
919 |
+
syllablesPrint += 1
|
920 |
+
g.words.outputText += '" ) 0) '
|
921 |
+
else:
|
922 |
+
g.words.outputText = g.words.outputText[:(len(g.words.outputText) - 4)]
|
923 |
+
j = 0
|
924 |
+
|
925 |
+
g.words.outputText = g.words.outputText.replace('#', '')
|
926 |
+
g.words.outputText = g.words.outputText.replace(' ', ' ')
|
927 |
+
if g.flags.DEBUG:
|
928 |
+
print(f'Print text : {g.words.outputText}')
|
929 |
+
|
930 |
+
WriteFile(g, g.words.outputText)
|
931 |
+
return 1
|
932 |
+
|
933 |
+
|
934 |
+
def load_mapping_file(g: GLOBALS):
|
935 |
+
# open common file
|
936 |
+
try:
|
937 |
+
# print('1.entered')
|
938 |
+
with open("/speech/utkarsh/tts_api/Unified_parser/common_hindi.map", 'r') as infile:
|
939 |
+
lines = infile.readlines()
|
940 |
+
# print(lines)
|
941 |
+
except:
|
942 |
+
print("Couldn't open common file for reading")
|
943 |
+
return 0
|
944 |
+
|
945 |
+
table=[]
|
946 |
+
for i in range(len(lines)):
|
947 |
+
l = lines[i].strip().split('\t')
|
948 |
+
table.append(l)
|
949 |
+
|
950 |
+
# g.symbolTable[i][1] = l[1]
|
951 |
+
# g.symbolTable[i][0] = l[1 + g.langId]
|
952 |
+
|
953 |
+
return table
|
954 |
+
|
955 |
+
def set_lang_id(language):
|
956 |
+
if language == "malayalam":
|
957 |
+
lang_id=1
|
958 |
+
elif language == "tamil":
|
959 |
+
lang_id=2
|
960 |
+
elif language == "telugu":
|
961 |
+
lang_id=3
|
962 |
+
elif language == "kannada":
|
963 |
+
lang_id=4
|
964 |
+
elif language == "hindi":
|
965 |
+
lang_id=5
|
966 |
+
elif language == "bengali":
|
967 |
+
lang_id=6
|
968 |
+
elif language == "gujrathi":
|
969 |
+
lang_id=7
|
970 |
+
elif language == "odiya":
|
971 |
+
lang_id=8
|
972 |
+
elif language == "punjabi":
|
973 |
+
lang_id=9
|
974 |
+
return lang_id
|
975 |
+
|
976 |
+
|
977 |
+
def convert_to_main_lang(g : GLOBALS,input_str,final_lang:str):
|
978 |
+
s= input_str
|
979 |
+
final_lang = "telugu"
|
980 |
+
# print("input_str:",input_str)
|
981 |
+
final_lang_id=set_lang_id(final_lang)
|
982 |
+
c=1
|
983 |
+
# print(s,final_lang_id)
|
984 |
+
temp_string=''
|
985 |
+
new_string='&'
|
986 |
+
table=load_mapping_file(g)
|
987 |
+
# print(final_lang_id)
|
988 |
+
# print(table)
|
989 |
+
for i in range(1,len(s)):
|
990 |
+
if s[i]=="&":
|
991 |
+
c=1
|
992 |
+
continue
|
993 |
+
if c==1:
|
994 |
+
temp_string+=s[i]
|
995 |
+
if s[i+1]=="&":
|
996 |
+
c=0
|
997 |
+
# print("new_string_1:",new_string)
|
998 |
+
# print("old_string_1:",temp_string)
|
999 |
+
if temp_string=="#":
|
1000 |
+
new_string+=temp_string+"&"
|
1001 |
+
temp_string=''
|
1002 |
+
continue
|
1003 |
+
if temp_string =='av':
|
1004 |
+
new_string+=temp_string+"&"
|
1005 |
+
temp_string=''
|
1006 |
+
# print("new_string_1-av/aiv:",new_string)
|
1007 |
+
continue
|
1008 |
+
if temp_string =='eu' or temp_string =='euv'or temp_string =='aiv':
|
1009 |
+
new_string+=temp_string+"&"
|
1010 |
+
# print("new_string_1-eu:",new_string)
|
1011 |
+
# print("old_string_1-euv:",s)
|
1012 |
+
temp_string=''
|
1013 |
+
continue
|
1014 |
+
|
1015 |
+
# print("new_string_before_table:",new_string)
|
1016 |
+
# print("old_string_before_table:",s)
|
1017 |
+
for j in range(len(table)):
|
1018 |
+
if table[j][1]==temp_string:
|
1019 |
+
# print("2:",table[j][1],temp_string)
|
1020 |
+
# print("3:",table[j][final_lang_id+1],ord(table[j][final_lang_id+1][0]))
|
1021 |
+
if ord(table[j][final_lang_id+1][0]) < 122:
|
1022 |
+
new_string=new_string+table[j][final_lang_id+1]+"&"
|
1023 |
+
temp_string=''
|
1024 |
+
# print("new string_2:",new_string)
|
1025 |
+
break
|
1026 |
+
else:
|
1027 |
+
new_string+=temp_string+"&"
|
1028 |
+
# print("new string_3:",new_string)
|
1029 |
+
temp_string=''
|
1030 |
+
break
|
1031 |
+
return new_string
|
Unified_parser/ply/__init__.py
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# PLY package
|
2 |
+
# Author: David Beazley (dave@dabeaz.com)
|
3 |
+
# https://github.com/dabeaz/ply
|
4 |
+
|
5 |
+
__version__ = '2022_01_02'
|
Unified_parser/ply/__pycache__/__init__.cpython-310.pyc
ADDED
Binary file (196 Bytes). View file
|
|
Unified_parser/ply/__pycache__/__init__.cpython-311.pyc
ADDED
Binary file (213 Bytes). View file
|
|
Unified_parser/ply/__pycache__/__init__.cpython-37.pyc
ADDED
Binary file (168 Bytes). View file
|
|
Unified_parser/ply/__pycache__/__init__.cpython-38.pyc
ADDED
Binary file (168 Bytes). View file
|
|
Unified_parser/ply/__pycache__/lex.cpython-310.pyc
ADDED
Binary file (5.17 kB). View file
|
|
Unified_parser/ply/__pycache__/lex.cpython-311.pyc
ADDED
Binary file (6.76 kB). View file
|
|
Unified_parser/ply/__pycache__/lex.cpython-37.pyc
ADDED
Binary file (5.16 kB). View file
|
|
Unified_parser/ply/__pycache__/lex.cpython-38.pyc
ADDED
Binary file (5.2 kB). View file
|
|
Unified_parser/ply/__pycache__/yacc.cpython-310.pyc
ADDED
Binary file (40.8 kB). View file
|
|
Unified_parser/ply/__pycache__/yacc.cpython-311.pyc
ADDED
Binary file (82.8 kB). View file
|
|
Unified_parser/ply/__pycache__/yacc.cpython-37.pyc
ADDED
Binary file (41.3 kB). View file
|
|
Unified_parser/ply/__pycache__/yacc.cpython-38.pyc
ADDED
Binary file (41.2 kB). View file
|
|
Unified_parser/ply/lex.py
ADDED
@@ -0,0 +1,110 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from typing import NamedTuple
|
2 |
+
import re
|
3 |
+
|
4 |
+
def t_kaki_c(t):
|
5 |
+
r'(&)*(dxhq|txh|khq|dxq|dxh|zh|tx|th|sx|sh|rx|ph|nx|nj|ng|lx|kq|kh|jh|gq|gh|dx|dh|ch|bh|z|y|y|w|t|s|r|p|n|m|l|k|j|h|g|f|d|c|b)((&)(dxhq|txh|khq|dxq|dxh|zh|tx|th|sx|sh|rx|ph|nx|nj|ng|lx|kq|kh|jh|gq|gh|ex|dx|dh|ch|bh|z|y|w|t|s|r|p|n|m|l|k|j|h|g|f|d|c|b))*'
|
6 |
+
s = t
|
7 |
+
ans = ''
|
8 |
+
i = 1
|
9 |
+
if s[0] == '&':
|
10 |
+
ans += '&'
|
11 |
+
l = s.split('&')
|
12 |
+
for pch in l:
|
13 |
+
if pch == '':
|
14 |
+
continue
|
15 |
+
ans += f'{pch}&av&#&&'
|
16 |
+
i += 1
|
17 |
+
ans = ans[:(len(ans) - 7)]
|
18 |
+
return ans
|
19 |
+
|
20 |
+
def t_conjsyll2_c(t):
|
21 |
+
r'(eu)'
|
22 |
+
return 'eu&#'
|
23 |
+
|
24 |
+
def t_fullvowel_b(t):
|
25 |
+
r'(&)*(k|kh|g|gh|c|ch|j|jh|ng|nj|tx|txh|dx|dxh|nx|t|th|d|dh|n|p|ph|b|bh|m|y|r|l|w|sh|sx|s|lx|h|kq|khq|gq|z|dxq|dxhq|f|y)(&)(uu&mq|uu&hq|rq&mq|rq&hq|ou&mq|ou&hq|ii&mq|ii&hq|ei&mq|ei&hq|ee&mq|ee&hq|aa&mq|aa&hq|uu&q|u&mq|u&hq|rq&q|ou&q|o&mq|o&hq|ii&q|i&mq|i&hq|ei&q|ee&q|aa&q|a&mq|a&hq|u&q|o&q|i&q|a&q|uu|rq|ou|ii|ei|ee|ax|aa|u|o|i|a)'
|
26 |
+
return t
|
27 |
+
|
28 |
+
def t_kaki_a(t):
|
29 |
+
r'(&)*(dxhq|txh|khq|dxq|dxh|tx|th|sx|sh|ph|nx|nj|ng|lx|kq|kh|jh|gq|gh|dx|dh|ch|bh|z|y|w|t|s|r|p|n|m|l|k|j|h|g|f|d|c|b)(&)(uuv|rqv|ouv|iiv|eiv|eev|aev|aav|uv|ov|mq|iv|hq|ax|q)(&)(mq|hq|q)*'
|
30 |
+
return t
|
31 |
+
|
32 |
+
def t_kaki_b(t):
|
33 |
+
r'(&)*(dxq&uuv|dxq&rqv|dxq&ouv|dxq&iiv|dxq&eiv|dxq&eev|dxq&aav|dxq&uv|dxq&ov|dxq&mq|dxq&iv|dxq&hq|dxq&q|dxq)'
|
34 |
+
return t
|
35 |
+
|
36 |
+
def t_conjsyll2_b(t):
|
37 |
+
r'(&)*(txh&eu|dxh&eu|tx&eu|th&eu|sx&eu|sh&eu|ph&eu|nx&eu|nj&eu|ng&eu|lx&eu|kh&eu|jh&eu|gh&eu|dx&eu|dh&eu|ch&eu|bh&eu|y&eu|w&eu|t&eu|s&eu|r&eu|p&eu|n&eu|m&eu|l&eu|k&eu|j&eu|h&eu|g&eu|d&eu|c&eu|b&eu)'
|
38 |
+
return t
|
39 |
+
|
40 |
+
def t_conjsyll2_a(t):
|
41 |
+
r'(&)*(dxhq|khq|dxq|kq|gq|z|y|f)(&)eu'
|
42 |
+
return t
|
43 |
+
|
44 |
+
def t_conjsyll1(t):
|
45 |
+
r'(&)*(dxhq|txh|khq|dxq|dxh|tx|th|sx|sh|ph|nx|nj|ng|lx|kq|kh|jh|gq|gh|dx|dh|ch|bh|z|y|w|t|s|r|p|n|m|l|k|j|h|g|f|d|c|b)(&)(uu|rq|ou|ii|ei|ee|ax|aa|u|o|i)(&)(dxhq|uuv|txh|rqv|ouv|khq|iiv|eiv|eev|dxq|dxh|aev|aav|uv|uu|tx|th|sx|sh|rq|ph|ov|ou|nx|nj|ng|mq|kq|kh|jh|iv|ii|hq|gq|gh|ei|ee|dx|dh|ch|bh|ax|aa|z|y|w|u|t|s|r|q|p|o|n|m|l|k|j|i|h|g|f|d|c|b)(&)eu(&)(dxhq|txh|khq|dxq|dxh|tx|th|sx|sh|ph|nx|nj|ng|kq|kh|jh|gq|gh|dx|dh|ch|bh|z|y|y|w|t|s|r|p|n|m|l|k|j|h|g|f|d|c|b)'
|
46 |
+
return t
|
47 |
+
|
48 |
+
def t_nukchan_b(t):
|
49 |
+
r'(&)*(txh|dxh|tx|th|sx|sh|ph|nx|nj|ng|lx|kh|jh|gh|dx|dh|ch|bh|y|w|t|s|r|p|n|m|l|k|j|h|g|d|c|b)(&)(mq|hq|q)'
|
50 |
+
return t
|
51 |
+
|
52 |
+
def t_nukchan_a(t):
|
53 |
+
r'(&)*(dxhq|khq|dxq|kq|gq|z|y|f)(&)(mq|hq|q)'
|
54 |
+
return t
|
55 |
+
|
56 |
+
def t_yarule(t):
|
57 |
+
r'(&)*(uuv|rqv|iiv|uv|iv)(&)(y)'
|
58 |
+
return t
|
59 |
+
|
60 |
+
def t_vowel(t):
|
61 |
+
r'(&)*(uu&mq|uu&hq|rq&mq|rq&hq|ou&mq|ou&hq|ii&mq|ii&hq|ei&mq|ei&hq|ee&mq|ee&hq|aa&mq|aa&hq|uu&q|u&mq|u&hq|rq&q|ou&q|o&mq|o&hq|ii&q|i&mq|i&hq|ei&q|ee&q|aa&q|a&mq|a&hq|u&q|o&q|i&q|a&q|uu|rq|ou|ii|ei|ee|ax|aa|u|o|i|a)'
|
62 |
+
return t
|
63 |
+
|
64 |
+
def t_fullvowel_a(t):
|
65 |
+
r'.'
|
66 |
+
return t
|
67 |
+
|
68 |
+
class Token(NamedTuple):
|
69 |
+
type: str
|
70 |
+
value: str
|
71 |
+
|
72 |
+
class Lexer:
|
73 |
+
def __init__(self):
|
74 |
+
# tokens identified by the lexer
|
75 |
+
self.tokens = ('kaki_c', 'conjsyll2_c', 'fullvowel_b', 'kaki_a', 'kaki_b', 'conjsyll2_b', 'conjsyll2_a',
|
76 |
+
'conjsyll1', 'nukchan_b','nukchan_a', 'yarule', 'fullvowel_a', 'vowel')
|
77 |
+
self.token_specification = []
|
78 |
+
for tkn in self.tokens:
|
79 |
+
self.token_specification += [(tkn, r'{}'.format(eval('t_'+tkn).__doc__), eval('t_'+tkn))]
|
80 |
+
|
81 |
+
self.patterns = []
|
82 |
+
for pr in self.token_specification:
|
83 |
+
pn = re.compile(pr[1])
|
84 |
+
self.patterns += [pn]
|
85 |
+
self.tokencount = len(self.token_specification)
|
86 |
+
self.data = ''
|
87 |
+
self.idx = 0
|
88 |
+
|
89 |
+
def input(self,data):
|
90 |
+
self.data = data
|
91 |
+
self.idx = 0
|
92 |
+
|
93 |
+
def token(self):
|
94 |
+
maxlen = 0
|
95 |
+
maxidx = -1
|
96 |
+
maxmo = None
|
97 |
+
for i in range(self.tokencount):
|
98 |
+
mo = self.patterns[i].match(self.data, self.idx)
|
99 |
+
if mo != None:
|
100 |
+
molen = mo.end() - mo.start()
|
101 |
+
if molen > maxlen:
|
102 |
+
maxlen = molen
|
103 |
+
maxidx = i
|
104 |
+
maxmo = mo
|
105 |
+
|
106 |
+
if maxlen == 0:
|
107 |
+
return None
|
108 |
+
self.idx += maxlen
|
109 |
+
tok = self.token_specification[maxidx][2](maxmo.group())
|
110 |
+
return Token(type = self.tokens[maxidx], value=tok)
|
Unified_parser/ply/yacc.py
ADDED
The diff for this file is too large to render.
See raw diff
|
|
Unified_parser/punjabi/extract_punjabi.py
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
words = set()
|
2 |
+
with open('text', 'r') as f:
|
3 |
+
cnts = f.readlines()
|
4 |
+
for l in cnts:
|
5 |
+
l = l.strip('\n').split(' ')
|
6 |
+
for wd in l[1:]:
|
7 |
+
wd = wd.strip('.,|? ')
|
8 |
+
if wd != '':
|
9 |
+
words.add(wd)
|
10 |
+
|
11 |
+
words = list(words)
|
12 |
+
words = sorted(words)
|
13 |
+
with open('punjabi_words.txt', 'w') as f:
|
14 |
+
for w in words:
|
15 |
+
f.write(f'{w}\n')
|
Unified_parser/punjabi/punjabi_asr_sample
ADDED
The diff for this file is too large to render.
See raw diff
|
|
Unified_parser/punjabi/punjabi_results.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
Unified_parser/punjabi/punjabi_words.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
Unified_parser/punjabi/runner_punjabi.py
ADDED
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from parser import wordparse
|
2 |
+
from joblib import Parallel, delayed
|
3 |
+
from tqdm import tqdm
|
4 |
+
|
5 |
+
with open('punjabi_words.txt', 'r') as f:
|
6 |
+
words = f.readlines()
|
7 |
+
|
8 |
+
words = [wd.strip() for wd in words]
|
9 |
+
anslist = Parallel(n_jobs=1)(delayed(wordparse)(wd, 0, 0) for wd in tqdm(words))
|
10 |
+
|
11 |
+
with open('punjabi_results.txt', 'w') as f:
|
12 |
+
for i in range(len(words)):
|
13 |
+
f.write(f'{words[i]} = {anslist[i]}\n')
|
Unified_parser/pypi_package/LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MIT License
|
2 |
+
|
3 |
+
Copyright (c) 2022 vikram-kv
|
4 |
+
|
5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6 |
+
of this software and associated documentation files (the "Software"), to deal
|
7 |
+
in the Software without restriction, including without limitation the rights
|
8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9 |
+
copies of the Software, and to permit persons to whom the Software is
|
10 |
+
furnished to do so, subject to the following conditions:
|
11 |
+
|
12 |
+
The above copyright notice and this permission notice shall be included in all
|
13 |
+
copies or substantial portions of the Software.
|
14 |
+
|
15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21 |
+
SOFTWARE.
|
Unified_parser/pypi_package/README.md
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Python_Unified_Parser
|
2 |
+
|
3 |
+
This parser attempts to unify the languages based on the Common Label Set. It is designed across all the languages capitalising on the syllable structure of Indian languages. The Unified Parser converts UTF-8 text to common label set, applies letter-to-sound rules and generates the corresponding phoneme sequences. The effort is a step towards natural language understanding system that operates on Indian languages and generates the parsed output. This structured method requires only knowledge of the basic language. With good lexicons it is possible to get more than 95% correctness of words in a language. This method can be further extended for a number of other Indian languages in minimal time and effort. Given the unity in the diversity of Indian languages, developing parsers for new languages is easy using the unified approach.
|
4 |
+
|
5 |
+
Our python parser - [uparser.py](src/indic-unified-parser/uparser.py) - Combines lex and yacc functionality in a single python script using the [PLY](src/indic-unified-parser/ply) framework.
|
6 |
+
|
7 |
+
## Publications
|
8 |
+
[Baby, Arun, et al. "A unified parser for developing Indian language text to speech synthesizers." Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno, Czech Republic, September 12-16, 2016, Proceedings 19. Springer International Publishing, 2016.](https://www.iitm.ac.in/donlab/tts/downloads/unified/unified.pdf)
|
9 |
+
|
10 |
+
## Installation
|
11 |
+
|
12 |
+
```bash
|
13 |
+
pip install indic_unified_parser
|
14 |
+
```
|
15 |
+
|
16 |
+
## How to use
|
17 |
+
|
18 |
+
```bash
|
19 |
+
from indic_unified_parser.uparser import wordparse
|
20 |
+
parsed_output_string = wordparse(<word : str>, <lsflag : int>, <wfflag : int>, <clearflag : int>)
|
21 |
+
```
|
22 |
+
|
23 |
+
1. `lsflag`: always 0. Deprecated.
|
24 |
+
2. `wfflag`: 0 for Monophone parsing, 1 for syllable parsing, 2 for Akshara Parsing"
|
25 |
+
3. `clearflag`: 1 for removing the lisp like format of output and to just produce space separated output. Otherwise, 0.
|
26 |
+
|
27 |
+
## Examples
|
28 |
+
|
29 |
+
## URLS
|
30 |
+
[Homepage](https://github.com/vikram-kv/Unified_Parser)
|
31 |
+
|
32 |
+
## Authors
|
33 |
+
|
34 |
+
Vikram K V, Dual Degree, Computer Science Dept, IIT Madras.
|