Spaces:
Build error
Build error
freemt
commited on
Commit
•
844aef2
1
Parent(s):
f469d4d
Update ubee uclas
Browse files- .gitignore +1 -1
- data/test_en.txt +69 -0
- data/test_zh.txt +74 -0
- packages.txt +3 -0
- requirements.txt +12 -1
- ubee/__main__.py +55 -20
- ubee/gradiobee.py +7 -0
- ubee/seg_text.py +121 -0
- ubee/ubee.py +39 -5
- ubee/uclas.py +81 -0
.gitignore
CHANGED
@@ -140,4 +140,4 @@ cython_debug/
|
|
140 |
*.swp
|
141 |
links/
|
142 |
# .gitignore
|
143 |
-
|
|
|
140 |
*.swp
|
141 |
links/
|
142 |
# .gitignore
|
143 |
+
node_modulescachedir
|
data/test_en.txt
ADDED
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Wuthering Heights
|
2 |
+
|
3 |
+
|
4 |
+
--------------------------------------------------------------------------------
|
5 |
+
|
6 |
+
Chapter 2
|
7 |
+
|
8 |
+
Chinese
|
9 |
+
|
10 |
+
|
11 |
+
Yesterday afternoon set in misty and cold. I had half a mind to spend it by my study fire, instead of wading through heath and mud to Wuthering Heights. On coming up from dinner, however (N.B. I dine between twelve and one o'clock; the housekeeper, a matronly lady, taken as a fixture along with the house, could not, or would not, comprehend my request that I might be served at five), on mounting the stairs with this lazy intention, and stepping into the room, I saw a servant girl on her knees surrounded by brushes and coal-scuttles, and raising an infernal dust as she extinguished the flames with heaps of cinders. This spectacle drove me back immediately; I took my hat, and, after a four-miles' walk, arrived at Heathcliff's garden gate just in time to escape the first feathery flakes of a snow shower.
|
12 |
+
|
13 |
+
On that bleak hill top the earth was hard with a black frost, and the air made me shiver through every limb. Being unable to remove the chain, I jumped over, and, running up the flagged causeway bordered with straggling gooseberry bushes, knocked vainly for admittance, till my knuckles tingled and the dogs howled.
|
14 |
+
|
15 |
+
`Wretched inmates!' I ejaculated mentally, `you deserve perpetual isolation from your species for your churlish inhospitality. At least, I would not keep my doors barred in the day time. I don't care--I will get in!' So resolved, I grasped the latch and shook it vehemently. Vinegar-faced Joseph projected his head from a round window of the barn.
|
16 |
+
|
17 |
+
`Whet are ye for?' he shouted. `T' maister's dahn i' t' fowld. Go rahnd by th' end ut' laith, if yah went tuh spake tull him.'
|
18 |
+
|
19 |
+
`Is there nobody inside to open the door?' I hallooed, responsively.
|
20 |
+
|
21 |
+
`They's nobbut t' missis; and shoo'll nut oppen't an ye mak yer flaysome dins till neeght.'
|
22 |
+
|
23 |
+
`Why? Cannot you tell her who I am, eh, Joseph?'
|
24 |
+
|
25 |
+
`Nor-ne me! Aw'll hae noa hend wi't,' muttered the head, vanishing.
|
26 |
+
|
27 |
+
The snow began to drive thickly. I seized the handle to essay another trial; when a young man without coat, and shouldering a pitchfork, appeared in the yard behind. He hailed me to follow him, and, after marching through a wash-house, and a paved area containing a coal shed, pump, and pigeon cot, we at length arrived in the huge, warm, cheerful apartment, where I was formerly received. It glowed delightfully in the radiance of an immense fire, compounded of coal, peat, and wood; and near the table, laid for a plentiful evening meal, I was pleased to observe the `missis', an individual whose existence I had never previously suspected. I bowed and waited, thinking she would bid me take a seat. She looked at me, leaning back in her chair, and remained motionless and mute.
|
28 |
+
|
29 |
+
`Rough weather!' I remarked. `I'm afraid, Mrs Heathcliff, the door must bear the consequence of your servants' leisure attendance: I had hard work to make them hear me.'
|
30 |
+
|
31 |
+
She never opened her mouth. I stared--she stared also: at any rate, she kept her eyes on me in a cool, regardless manner, exceedingly embarrassing and disagreeable.
|
32 |
+
|
33 |
+
`Sit down,' said the young man gruffly. `He'll be in soon.'
|
34 |
+
|
35 |
+
I obeyed; and hemmed, and called the villain Juno, who deigned, at this second interview, to move the extreme tip of her tail, in token of owning my acquaintance.
|
36 |
+
|
37 |
+
`A beautiful animal!' I commenced again. `Do you intend parting with the little ones, madam?'
|
38 |
+
|
39 |
+
`They are not mine,' said the amiable hostess, more repellingly than Heathcliff himself could have replied.
|
40 |
+
|
41 |
+
`Ah, your favourites are among these?' I continued, turning to an obscure cushion full of something like cats.
|
42 |
+
|
43 |
+
`A strange choice of favourites!' she observed scornfully.
|
44 |
+
|
45 |
+
Unluckily, it was a heap of dead rabbits. I hemmed once more, and drew closer to the hearth, repeating my comment on the wildness of the evening.
|
46 |
+
|
47 |
+
`You should not have come out,' she said, rising and reaching from the chimney-piece two of the painted canisters.
|
48 |
+
|
49 |
+
Her position before was sheltered from the light; now, I had a distinct view of her whole figure and countenance. She was slender, and apparently scarcely past girlhood: an admirable form, and the most exquisite little face that I have ever had the pleasure of beholding; small features, very fair; flaxen ringlets, or rather golden, hanging loose on her delicate neck; and eyes, had they been agreeable in expression, they would have been irresistible: fortunately for my susceptible heart, the only sentiment they evinced hovered between scorn, and a kind of desperation, singularly unnatural to be detected there. The canisters were almost out of her reach; I made a motion to aid her; she turned upon me as a miser might turn if anyone attempted to assist him in counting his gold.
|
50 |
+
|
51 |
+
`I don't want your help,' she snapped; `I can get them for myself.'
|
52 |
+
|
53 |
+
`I beg your pardon!' I hastened to reply.
|
54 |
+
|
55 |
+
`Were you asked to tea?' she demanded, tying an apron over her neat black frock, and standing with a spoonful of the leaf poised over the pot.
|
56 |
+
|
57 |
+
`I shall be glad to have a cup,' I answered.
|
58 |
+
|
59 |
+
`Were you asked?' she repeated.
|
60 |
+
|
61 |
+
`No,' I said, half smiling. `You are the proper person to ask me.'
|
62 |
+
|
63 |
+
|
64 |
+
|
65 |
+
Contents PreviousChapter
|
66 |
+
NextChapter
|
67 |
+
|
68 |
+
|
69 |
+
Homepage
|
data/test_zh.txt
ADDED
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
呼啸山庄
|
2 |
+
|
3 |
+
--------------------------------------------------------------------------------
|
4 |
+
|
5 |
+
第二章
|
6 |
+
|
7 |
+
英文
|
8 |
+
|
9 |
+
|
10 |
+
昨天下午又冷又有雾。我想就在书房炉边消磨一下午,不想踩着杂草污泥到呼啸山庄了。
|
11 |
+
|
12 |
+
但是,吃过午饭(注意——我在十二点与一点钟之间吃午饭,而可以当作这所房子的附属物的管家婆,一位慈祥的太太却不能,或者并不愿理解我请求在五点钟开饭的用意),在我怀着这个懒惰的想法上了楼,迈进屋子的时候,看见一个女仆跪在地上,身边是扫帚和煤斗。她正在用一堆堆煤渣封火,搞起一片弥漫的灰尘。这景象立刻把我赶回头了。我拿了帽子,走了四里路,到达了希刺克厉夫的花园口口,刚好躲过了一场今年初降的鹅毛大雪。
|
13 |
+
|
14 |
+
在那荒凉的山顶上,土地由于结了一层黑冰而冻得坚硬,冷空气使我四肢发抖。我弄不开门链,就跳进去,顺着两边种着蔓延的醋栗树丛的石路跑去。我白白地敲了半天门,一直敲到我的手指骨都痛了,狗也狂吠起来。
|
15 |
+
|
16 |
+
“倒霉的人家!”我心里直叫,“只为你这样无礼待客,就该一辈子跟人群隔离。我至少还不会在白天把门闩住。我才不管呢——我要进去!”如此决定了。我就抓住门闩,使劲摇它。苦脸的约瑟夫从谷仓的一个圆窗里探出头来。
|
17 |
+
|
18 |
+
“你干吗?”他大叫。“主人在牛栏里,你要是找他说话,就从这条路口绕过去。”
|
19 |
+
|
20 |
+
“屋里没人开门吗?”我也叫起来。
|
21 |
+
|
22 |
+
“除了太太没有别人。你就是闹腾到夜里,她也不会开。”
|
23 |
+
|
24 |
+
“为什么?你就不能告诉她我是谁吗,呃,约瑟夫?”
|
25 |
+
|
26 |
+
“别找我!我才不管这些闲事呢,”这个脑袋咕噜着,又不见了。
|
27 |
+
|
28 |
+
雪开始下大了。我握住门柄又试一回。这时一个没穿外衣的年轻人,扛着一根草耙,在后面院子里出现了。他招呼我跟着他走,穿过了一个洗衣房和一片铺平的地,那儿有煤棚、抽水机和鸽笼,我们终于到了我上次被接待过的那间温暖的、热闹的大屋子。煤、炭和木材混合在一起燃起的熊熊炉火,使这屋子放着光彩。在准备摆上丰盛晚餐的桌旁,我很高兴地看到了那位“太太”,以前我从未料想到会有这么一个人存在的。我鞠躬等候,以为她会叫我坐下。她望望我,往她的椅背一靠,不动,也不出声。
|
29 |
+
|
30 |
+
“天气真坏!”我说,“希刺克厉夫太太,恐怕大门因为您的仆人偷懒而大吃苦头,我费了好大劲才使他们听见我敲门!”
|
31 |
+
|
32 |
+
她死不开口。我瞪眼——她也瞪眼。反正她总是以一种冷冷的、漠不关心的神气盯住我,使人十分窘,而且不愉快。
|
33 |
+
|
34 |
+
“坐下吧,”那年轻人粗声粗气地说,“他就要来了。”
|
35 |
+
|
36 |
+
我服从了;轻轻咳了一下,叫唤那恶狗朱诺。临到第二次会面,它总算赏脸,摇起尾巴尖,表示认我是熟人了。
|
37 |
+
|
38 |
+
“好漂亮的狗!”我又开始说话。“您是不是打算不要这些小的呢,夫人?”
|
39 |
+
|
40 |
+
“那些不是我的,”这可爱可亲的女主人说,比希刺克厉夫本人所能回答的腔调还要更冷淡些。
|
41 |
+
|
42 |
+
“啊,您所心爱的是在这一堆里啦!”我转身指着一个看不清楚的靠垫上那一堆像猫似的东西,接着说下去。
|
43 |
+
|
44 |
+
“谁会爱这些东西那才怪呢!”她轻蔑地说。
|
45 |
+
|
46 |
+
倒霉,原来那是堆死兔子。我又轻咳一声,向火炉凑近些,又把今晚天气不好的话评论一通。
|
47 |
+
|
48 |
+
“你本来就不该出来。”她说,站起来去拿壁炉台上的两个彩色茶叶罐。
|
49 |
+
|
50 |
+
她原先坐在光线被遮住的地方,现在我把她的全身和面貌都看得清清楚楚。她苗条,显然还没有过青春期。挺好看的体态,还有一张我生平从未有幸见过的绝妙的小脸蛋。五官纤丽,非常漂亮。淡黄色的卷发,或者不如说是金黄色的,松松地垂在她那细嫩的颈上。至于眼睛,要是眼神能显得和悦些,就要使人无法抗拒了。对我这容易动情的心说来倒是常事,因为它们所表现的只是在轻蔑与近似绝望之间的一种情绪,而在那张脸上看见那样的眼神是特别不自然的。
|
51 |
+
|
52 |
+
她简直够不到茶叶罐。我动了一动,想帮她一下。她猛地扭转身向我,像守财奴看见别人打算帮他数他的金子一样。
|
53 |
+
|
54 |
+
“我不要你帮忙,”她怒气冲冲地说,“我自己拿得到。”
|
55 |
+
|
56 |
+
“对不起!”我连忙回答。
|
57 |
+
|
58 |
+
“是请你来吃茶的吗?”她问,把一条围裙系在她那干净的黑衣服上,就这样站着,拿一匙茶叶正要往茶壶里倒。
|
59 |
+
|
60 |
+
“我很想喝杯茶。”我回答。
|
61 |
+
|
62 |
+
“是请你来的吗?”她又问。
|
63 |
+
|
64 |
+
“没有,”我说,勉强笑一笑。“您正好请我喝茶。”
|
65 |
+
|
66 |
+
|
67 |
+
|
68 |
+
|
69 |
+
目录
|
70 |
+
上一章
|
71 |
+
下一章
|
72 |
+
|
73 |
+
|
74 |
+
返回首页
|
packages.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
icu-doc
|
2 |
+
libicu-dev
|
3 |
+
pkg-config
|
requirements.txt
CHANGED
@@ -1,11 +1,22 @@
|
|
1 |
gradio
|
|
|
2 |
transformers
|
3 |
sentencepiece
|
4 |
sklearn
|
5 |
-
logzero
|
6 |
git+https://github.com/ffreemt/fast-langid
|
7 |
git+https://github.com/ffreemt/align-model-pool
|
8 |
sentence-transformers
|
9 |
sentence_splitter
|
|
|
10 |
icecream
|
11 |
alive-progress
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
gradio
|
2 |
+
install
|
3 |
transformers
|
4 |
sentencepiece
|
5 |
sklearn
|
|
|
6 |
git+https://github.com/ffreemt/fast-langid
|
7 |
git+https://github.com/ffreemt/align-model-pool
|
8 |
sentence-transformers
|
9 |
sentence_splitter
|
10 |
+
logzero
|
11 |
icecream
|
12 |
alive-progress
|
13 |
+
more_itertools
|
14 |
+
#
|
15 |
+
openpyxl
|
16 |
+
# --- seg_text
|
17 |
+
Morfessor
|
18 |
+
pyicu
|
19 |
+
pycld2
|
20 |
+
tqdm
|
21 |
+
polyglot
|
22 |
+
sentence_splitter
|
ubee/__main__.py
CHANGED
@@ -1,9 +1,14 @@
|
|
1 |
"""Gen ubee main."""
|
2 |
# pylint: disable=unused-import, wrong-import-position
|
3 |
|
|
|
|
|
|
|
4 |
import sys
|
5 |
-
from
|
6 |
-
|
|
|
|
|
7 |
|
8 |
import gradio as gr
|
9 |
|
@@ -18,6 +23,7 @@ if "." not in sys.path:
|
|
18 |
|
19 |
from ubee.ubee import ubee
|
20 |
|
|
|
21 |
ic_install()
|
22 |
ic.configureOutput(
|
23 |
includeContext=True,
|
@@ -32,7 +38,11 @@ def greet1(name):
|
|
32 |
return "Hello " + name + "!!"
|
33 |
|
34 |
|
35 |
-
def greet(
|
|
|
|
|
|
|
|
|
36 |
"""Take inputs, return outputs.
|
37 |
|
38 |
Args:
|
@@ -44,12 +54,25 @@ def greet(text1, text2) -> pd.DataFrame:
|
|
44 |
res1 = [elm.strip() for elm in text1.splitlines() if elm.strip()]
|
45 |
res2 = [elm.strip() for elm in text2.splitlines() if elm.strip()]
|
46 |
|
47 |
-
_ = pd.DataFrame(zip_longest(res1, res2), columns=["text1", "text2"])
|
48 |
-
return _
|
|
|
|
|
|
|
|
|
49 |
|
50 |
|
51 |
def main():
|
52 |
"""Create main entry."""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
title = "Ultimatumbee Aligner"
|
54 |
theme = "dark-grass"
|
55 |
description = """WIP showcasing a novel aligner"""
|
@@ -62,20 +85,8 @@ def main():
|
|
62 |
|
63 |
lines = 15
|
64 |
placeholder = "Type or paste text here"
|
65 |
-
default1 =
|
66 |
-
|
67 |
-
test 1
|
68 |
-
abc
|
69 |
-
love you
|
70 |
-
"""
|
71 |
-
)
|
72 |
-
default2 = dedent(
|
73 |
-
"""
|
74 |
-
爱你
|
75 |
-
甲乙丙
|
76 |
-
测试 1
|
77 |
-
"""
|
78 |
-
)
|
79 |
label1 = "text1"
|
80 |
label2 = "text2"
|
81 |
inputs = [
|
@@ -85,6 +96,12 @@ def main():
|
|
85 |
gr.inputs.Textbox(
|
86 |
lines=lines, placeholder=placeholder, default=default2, label=label2
|
87 |
),
|
|
|
|
|
|
|
|
|
|
|
|
|
88 |
]
|
89 |
|
90 |
out_df = gr.outputs.Dataframe(
|
@@ -95,8 +112,26 @@ def main():
|
|
95 |
type="auto",
|
96 |
label="To be aligned",
|
97 |
)
|
98 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
99 |
out_df,
|
|
|
|
|
100 |
]
|
101 |
|
102 |
iface = gr.Interface(
|
|
|
1 |
"""Gen ubee main."""
|
2 |
# pylint: disable=unused-import, wrong-import-position
|
3 |
|
4 |
+
from typing import Tuple
|
5 |
+
|
6 |
+
from pathlib import Path
|
7 |
import sys
|
8 |
+
from random import shuffle
|
9 |
+
|
10 |
+
# from itertools import zip_longest
|
11 |
+
# from textwrap import dedent
|
12 |
|
13 |
import gradio as gr
|
14 |
|
|
|
23 |
|
24 |
from ubee.ubee import ubee
|
25 |
|
26 |
+
logzero.loglevel(10)
|
27 |
ic_install()
|
28 |
ic.configureOutput(
|
29 |
includeContext=True,
|
|
|
38 |
return "Hello " + name + "!!"
|
39 |
|
40 |
|
41 |
+
def greet(
|
42 |
+
text1,
|
43 |
+
text2,
|
44 |
+
thresh: float
|
45 |
+
) -> Tuple[pd.DataFrame, pd.DataFrame]:
|
46 |
"""Take inputs, return outputs.
|
47 |
|
48 |
Args:
|
|
|
54 |
res1 = [elm.strip() for elm in text1.splitlines() if elm.strip()]
|
55 |
res2 = [elm.strip() for elm in text2.splitlines() if elm.strip()]
|
56 |
|
57 |
+
# _ = pd.DataFrame(zip_longest(res1, res2), columns=["text1", "text2"])
|
58 |
+
# return _
|
59 |
+
|
60 |
+
res1_, res2_ = ubee(res1, res2, thresh)
|
61 |
+
|
62 |
+
return pd.DataFrame(res1_, columns=["text1", "text2", "likelihood"]), pd.DataFrame(res2_, columns=["text1", "text2"])
|
63 |
|
64 |
|
65 |
def main():
|
66 |
"""Create main entry."""
|
67 |
+
text_zh = Path("data/test_zh.txt").read_text("utf8")
|
68 |
+
text_en = [
|
69 |
+
elm.strip()
|
70 |
+
for elm in Path("data/test_en.txt").read_text("utf8").splitlines()
|
71 |
+
if elm.strip()
|
72 |
+
]
|
73 |
+
shuffle(text_en)
|
74 |
+
text_en = "\n\n".join(text_en)
|
75 |
+
|
76 |
title = "Ultimatumbee Aligner"
|
77 |
theme = "dark-grass"
|
78 |
description = """WIP showcasing a novel aligner"""
|
|
|
85 |
|
86 |
lines = 15
|
87 |
placeholder = "Type or paste text here"
|
88 |
+
default1 = text_zh
|
89 |
+
default2 = text_en
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
90 |
label1 = "text1"
|
91 |
label2 = "text2"
|
92 |
inputs = [
|
|
|
96 |
gr.inputs.Textbox(
|
97 |
lines=lines, placeholder=placeholder, default=default2, label=label2
|
98 |
),
|
99 |
+
gr.inputs.Slider(
|
100 |
+
minimum=0.0,
|
101 |
+
maximum=1.0,
|
102 |
+
step=0.1,
|
103 |
+
default=0.5,
|
104 |
+
),
|
105 |
]
|
106 |
|
107 |
out_df = gr.outputs.Dataframe(
|
|
|
112 |
type="auto",
|
113 |
label="To be aligned",
|
114 |
)
|
115 |
+
aligned = gr.outputs.Dataframe(
|
116 |
+
headers=None,
|
117 |
+
max_rows=lines, # 20
|
118 |
+
max_cols=None,
|
119 |
+
overflow_row_behaviour="paginate",
|
120 |
+
type="auto",
|
121 |
+
label="Aligned",
|
122 |
+
)
|
123 |
+
leftover = gr.outputs.Dataframe(
|
124 |
+
headers=None,
|
125 |
+
max_rows=lines, # 20
|
126 |
+
max_cols=None,
|
127 |
+
overflow_row_behaviour="paginate",
|
128 |
+
type="auto",
|
129 |
+
label="Leftover",
|
130 |
+
)
|
131 |
+
outputs = [ # tot. 3
|
132 |
out_df,
|
133 |
+
aligned,
|
134 |
+
leftover,
|
135 |
]
|
136 |
|
137 |
iface = gr.Interface(
|
ubee/gradiobee.py
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""Function for gr's fn."""
|
2 |
+
# pylint=disable=
|
3 |
+
|
4 |
+
|
5 |
+
def ubee():
|
6 |
+
"""Gen a dummy."""
|
7 |
+
...
|
ubee/seg_text.py
ADDED
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""Split text to sentences.
|
2 |
+
|
3 |
+
Use sentence_splitter if supported,
|
4 |
+
else use polyglot.text.Text
|
5 |
+
|
6 |
+
!apt install libicu-dev
|
7 |
+
!install pyicu pycld2 Morfessor
|
8 |
+
!pip install polyglot sentence_splitter
|
9 |
+
"""
|
10 |
+
# pylint: disable=
|
11 |
+
|
12 |
+
from typing import List, Optional, Union
|
13 |
+
|
14 |
+
import re
|
15 |
+
from tqdm.auto import tqdm
|
16 |
+
from polyglot.detect.base import logger as polyglot_logger
|
17 |
+
from polyglot.text import Detector, Text
|
18 |
+
from sentence_splitter import split_text_into_sentences
|
19 |
+
|
20 |
+
from logzero import logger
|
21 |
+
|
22 |
+
# turn of polyglot.text.Detector warning
|
23 |
+
polyglot_logger.setLevel("ERROR")
|
24 |
+
|
25 |
+
|
26 |
+
# fmt: off
|
27 |
+
# use sentence_splitter if supported
|
28 |
+
LANG_S = ["ca", "cs", "da", "nl", "en", "fi", "fr", "de",
|
29 |
+
"el", "hu", "is", "it", "lv", "lt", "no", "pl",
|
30 |
+
"pt", "ro", "ru", "sk", "sl", "es", "sv", "tr"]
|
31 |
+
|
32 |
+
|
33 |
+
def _seg_text(
|
34 |
+
text: str,
|
35 |
+
lang: Optional[str] = None,
|
36 |
+
# qmode: bool = False,
|
37 |
+
maxlines: int = 1000
|
38 |
+
) -> List[str]:
|
39 |
+
# fmt: on
|
40 |
+
"""Split text to sentences.
|
41 |
+
|
42 |
+
Use sentence_splitter if supported,
|
43 |
+
else use polyglot.text.Text.sentences
|
44 |
+
Blank lines will be removed.
|
45 |
+
|
46 |
+
qmode: quick mode, skip split_text_into_sentences if True, default False
|
47 |
+
vectors for all books are based on qmode=False.
|
48 |
+
qmode=True is for quick test purpose only
|
49 |
+
|
50 |
+
maxlines (default 1000), threshold for turn on tqdm progressbar
|
51 |
+
set to <1 or a large number to turn it off
|
52 |
+
"""
|
53 |
+
if lang is None:
|
54 |
+
try:
|
55 |
+
lang = Detector(text).language.code
|
56 |
+
except Exception as exc:
|
57 |
+
logger.info("text[:30]: %s", text[:30])
|
58 |
+
logger.warning(
|
59 |
+
"polyglot.text.Detector exc: %s, setting to 'en'",
|
60 |
+
exc
|
61 |
+
)
|
62 |
+
lang = "en"
|
63 |
+
|
64 |
+
# if not qmode and lang in LANG_S:
|
65 |
+
if lang in LANG_S:
|
66 |
+
_ = []
|
67 |
+
lines = text.splitlines()
|
68 |
+
# if maxlines > 1 and len(lines) > maxlines:
|
69 |
+
if len(lines) > maxlines > 1:
|
70 |
+
for para in tqdm(lines):
|
71 |
+
if para.strip():
|
72 |
+
_.extend(split_text_into_sentences(para, lang))
|
73 |
+
else:
|
74 |
+
for para in lines:
|
75 |
+
if para.strip():
|
76 |
+
_.extend(split_text_into_sentences(para, lang))
|
77 |
+
return _
|
78 |
+
|
79 |
+
# return split_text_into_sentences(text, lang)
|
80 |
+
|
81 |
+
# empty "" text or blank to avoid Exception
|
82 |
+
if not text.strip():
|
83 |
+
return []
|
84 |
+
|
85 |
+
return [elm.string for elm in Text(text, lang).sentences]
|
86 |
+
|
87 |
+
|
88 |
+
# fmt: off
|
89 |
+
def seg_text(
|
90 |
+
lst: Union[str, List[str]],
|
91 |
+
lang: Optional[str] = None,
|
92 |
+
maxlines: int = 1000,
|
93 |
+
extra: Optional[str] = None,
|
94 |
+
) -> List[str]:
|
95 |
+
# fmt:on
|
96 |
+
"""Split a list of text.
|
97 |
+
|
98 |
+
Arguments:
|
99 |
+
lst: text or text list
|
100 |
+
lang: optional lang code
|
101 |
+
maxlines: (default 1000), threshold for turn on tqdm progressbar, set to <1 or a large number to turn it off
|
102 |
+
extra: re.split(rf"{extra}, text) first
|
103 |
+
Returns:
|
104 |
+
list of splitted text.
|
105 |
+
"""
|
106 |
+
if isinstance(lst, str):
|
107 |
+
lst = [lst]
|
108 |
+
|
109 |
+
if extra:
|
110 |
+
# insert \n
|
111 |
+
lst = [re.sub(rf"({extra})", r"\1\n", elm) for elm in lst]
|
112 |
+
|
113 |
+
res = []
|
114 |
+
for elm in lst:
|
115 |
+
res.extend(_seg_text(
|
116 |
+
elm,
|
117 |
+
lang=lang,
|
118 |
+
maxlines=maxlines,
|
119 |
+
))
|
120 |
+
|
121 |
+
return res
|
ubee/ubee.py
CHANGED
@@ -1,7 +1,41 @@
|
|
1 |
-
"""
|
2 |
-
# pylint
|
|
|
|
|
3 |
|
|
|
|
|
4 |
|
5 |
-
|
6 |
-
|
7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""Align via ubee,"""
|
2 |
+
# pylint: disable=
|
3 |
+
from typing import Iterable, List, Tuple
|
4 |
+
from itertools import zip_longest
|
5 |
|
6 |
+
from logzero import logger
|
7 |
+
from ubee.uclas import uclas
|
8 |
|
9 |
+
|
10 |
+
def ubee(
|
11 |
+
sents_zh: Iterable,
|
12 |
+
sents_en: Iterable,
|
13 |
+
thresh: float = 0.5,
|
14 |
+
) -> Tuple[List[Tuple[str, str, float]], List[Tuple[str, str]]]:
|
15 |
+
"""Align blocks.
|
16 |
+
|
17 |
+
Args:
|
18 |
+
sents_zh: list of text, can be any langauge supported by clas-l-user
|
19 |
+
sents_zh: ditto
|
20 |
+
Returns:
|
21 |
+
three tuples of aligned blocked
|
22 |
+
leftovers (unaligned)
|
23 |
+
"""
|
24 |
+
res = []
|
25 |
+
labels = [*sents_en]
|
26 |
+
|
27 |
+
lo1 = []
|
28 |
+
lo2 = labels[:]
|
29 |
+
|
30 |
+
for seq in sents_zh:
|
31 |
+
label, likelihood = uclas(seq, labels, thresh=thresh)
|
32 |
+
if label:
|
33 |
+
res.append((seq, label, likelihood))
|
34 |
+
try:
|
35 |
+
lo2.remove(label)
|
36 |
+
except Exception as exc:
|
37 |
+
logger.error(exc)
|
38 |
+
logger.info("seq: %s, lable: %s", seq, label)
|
39 |
+
else:
|
40 |
+
lo1.append(seq)
|
41 |
+
return res, [*zip_longest(lo1, lo2)]
|
ubee/uclas.py
ADDED
@@ -0,0 +1,81 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""Define uclas."""
|
2 |
+
# pylint: disable=invalid-name
|
3 |
+
|
4 |
+
from typing import List, Tuple, Union
|
5 |
+
import numpy as np
|
6 |
+
from sklearn.metrics.pairwise import cosine_similarity
|
7 |
+
from joblib import Memory
|
8 |
+
|
9 |
+
from model_pool import fetch_check_aux # pylint: disable=import-error
|
10 |
+
from model_pool.model_s import load_model_s # pylint: disable=import-error
|
11 |
+
from model_pool.load_model import load_model # pylint: disable=import-error
|
12 |
+
|
13 |
+
import logzero
|
14 |
+
from logzero import logger
|
15 |
+
|
16 |
+
logzero.loglevel(10)
|
17 |
+
|
18 |
+
fetch_check_aux("/home/user")
|
19 |
+
model_s = load_model_s()
|
20 |
+
clas = load_model("clas-l-user")
|
21 |
+
|
22 |
+
location = "./cachedir"
|
23 |
+
memory = Memory(location, verbose=0)
|
24 |
+
|
25 |
+
|
26 |
+
@memory.cache
|
27 |
+
def cached_clas(*args, **kw):
|
28 |
+
"""Cache clas-l-user."""
|
29 |
+
return clas(*args, **kw)
|
30 |
+
|
31 |
+
|
32 |
+
# cached_clas = memory.cache(cached_clas)
|
33 |
+
|
34 |
+
|
35 |
+
@memory.cache
|
36 |
+
def encode(*args, **kw):
|
37 |
+
"""Cache model_s.encode."""
|
38 |
+
return model_s.encode(*args, **kw)
|
39 |
+
|
40 |
+
|
41 |
+
def uclas(
|
42 |
+
seq: str,
|
43 |
+
labels: Union[List[str], np.ndarray, Tuple[str, ...]],
|
44 |
+
thresh: float = 0.5,
|
45 |
+
multi_label: bool = False,
|
46 |
+
) -> Tuple[str, Union[float, str]]:
|
47 |
+
"""Classify seq with a filter.
|
48 |
+
|
49 |
+
if clas > thresh, return
|
50 |
+
if clas * csim > thresh return
|
51 |
+
if csim > thresh return
|
52 |
+
return ""
|
53 |
+
"""
|
54 |
+
# _ = clas(seq, labels, multi_label=multi_label)
|
55 |
+
_ = cached_clas(seq, labels, multi_label=multi_label)
|
56 |
+
|
57 |
+
logger.debug("1 %s, %s", _.get("labels")[0], round(_.get("scores")[0], 2))
|
58 |
+
|
59 |
+
if _.get("scores")[0] > thresh:
|
60 |
+
return _.get("labels")[0], round(_.get("scores")[0], 2)
|
61 |
+
|
62 |
+
_ = dict(zip(_.get("labels"), _.get("scores")))
|
63 |
+
|
64 |
+
corr = np.array([_.get(elm) for elm in labels])
|
65 |
+
|
66 |
+
csim = cosine_similarity(encode([seq]), encode(labels))
|
67 |
+
|
68 |
+
corr = corr * csim
|
69 |
+
|
70 |
+
logger.debug("2 %s, %s", corr.argmax(), round(corr.max(), 2))
|
71 |
+
|
72 |
+
if corr.max() > thresh:
|
73 |
+
return labels[corr.argmax()], round(corr.max(), 2)
|
74 |
+
|
75 |
+
logger.debug("3 %s, %s, %s", csim.argmax(), round(csim.max(), 2), thresh / 2)
|
76 |
+
|
77 |
+
logger.debug("T or F: %s", csim.max() > (thresh / 2))
|
78 |
+
if csim.max() > (thresh / 2):
|
79 |
+
return labels[csim.argmax()], round(csim.max(), 2)
|
80 |
+
|
81 |
+
return "", ""
|