freemt commited on
Commit
844aef2
1 Parent(s): f469d4d

Update ubee uclas

Browse files
.gitignore CHANGED
@@ -140,4 +140,4 @@ cython_debug/
140
  *.swp
141
  links/
142
  # .gitignore
143
- node_modules
 
140
  *.swp
141
  links/
142
  # .gitignore
143
+ node_modulescachedir
data/test_en.txt ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Wuthering Heights
2
+
3
+
4
+ --------------------------------------------------------------------------------
5
+
6
+ Chapter 2
7
+
8
+ Chinese
9
+
10
+
11
+ Yesterday afternoon set in misty and cold. I had half a mind to spend it by my study fire, instead of wading through heath and mud to Wuthering Heights. On coming up from dinner, however (N.B. I dine between twelve and one o'clock; the housekeeper, a matronly lady, taken as a fixture along with the house, could not, or would not, comprehend my request that I might be served at five), on mounting the stairs with this lazy intention, and stepping into the room, I saw a servant girl on her knees surrounded by brushes and coal-scuttles, and raising an infernal dust as she extinguished the flames with heaps of cinders. This spectacle drove me back immediately; I took my hat, and, after a four-miles' walk, arrived at Heathcliff's garden gate just in time to escape the first feathery flakes of a snow shower.
12
+
13
+ On that bleak hill top the earth was hard with a black frost, and the air made me shiver through every limb. Being unable to remove the chain, I jumped over, and, running up the flagged causeway bordered with straggling gooseberry bushes, knocked vainly for admittance, till my knuckles tingled and the dogs howled.
14
+
15
+ `Wretched inmates!' I ejaculated mentally, `you deserve perpetual isolation from your species for your churlish inhospitality. At least, I would not keep my doors barred in the day time. I don't care--I will get in!' So resolved, I grasped the latch and shook it vehemently. Vinegar-faced Joseph projected his head from a round window of the barn.
16
+
17
+ `Whet are ye for?' he shouted. `T' maister's dahn i' t' fowld. Go rahnd by th' end ut' laith, if yah went tuh spake tull him.'
18
+
19
+ `Is there nobody inside to open the door?' I hallooed, responsively.
20
+
21
+ `They's nobbut t' missis; and shoo'll nut oppen't an ye mak yer flaysome dins till neeght.'
22
+
23
+ `Why? Cannot you tell her who I am, eh, Joseph?'
24
+
25
+ `Nor-ne me! Aw'll hae noa hend wi't,' muttered the head, vanishing.
26
+
27
+ The snow began to drive thickly. I seized the handle to essay another trial; when a young man without coat, and shouldering a pitchfork, appeared in the yard behind. He hailed me to follow him, and, after marching through a wash-house, and a paved area containing a coal shed, pump, and pigeon cot, we at length arrived in the huge, warm, cheerful apartment, where I was formerly received. It glowed delightfully in the radiance of an immense fire, compounded of coal, peat, and wood; and near the table, laid for a plentiful evening meal, I was pleased to observe the `missis', an individual whose existence I had never previously suspected. I bowed and waited, thinking she would bid me take a seat. She looked at me, leaning back in her chair, and remained motionless and mute.
28
+
29
+ `Rough weather!' I remarked. `I'm afraid, Mrs Heathcliff, the door must bear the consequence of your servants' leisure attendance: I had hard work to make them hear me.'
30
+
31
+ She never opened her mouth. I stared--she stared also: at any rate, she kept her eyes on me in a cool, regardless manner, exceedingly embarrassing and disagreeable.
32
+
33
+ `Sit down,' said the young man gruffly. `He'll be in soon.'
34
+
35
+ I obeyed; and hemmed, and called the villain Juno, who deigned, at this second interview, to move the extreme tip of her tail, in token of owning my acquaintance.
36
+
37
+ `A beautiful animal!' I commenced again. `Do you intend parting with the little ones, madam?'
38
+
39
+ `They are not mine,' said the amiable hostess, more repellingly than Heathcliff himself could have replied.
40
+
41
+ `Ah, your favourites are among these?' I continued, turning to an obscure cushion full of something like cats.
42
+
43
+ `A strange choice of favourites!' she observed scornfully.
44
+
45
+ Unluckily, it was a heap of dead rabbits. I hemmed once more, and drew closer to the hearth, repeating my comment on the wildness of the evening.
46
+
47
+ `You should not have come out,' she said, rising and reaching from the chimney-piece two of the painted canisters.
48
+
49
+ Her position before was sheltered from the light; now, I had a distinct view of her whole figure and countenance. She was slender, and apparently scarcely past girlhood: an admirable form, and the most exquisite little face that I have ever had the pleasure of beholding; small features, very fair; flaxen ringlets, or rather golden, hanging loose on her delicate neck; and eyes, had they been agreeable in expression, they would have been irresistible: fortunately for my susceptible heart, the only sentiment they evinced hovered between scorn, and a kind of desperation, singularly unnatural to be detected there. The canisters were almost out of her reach; I made a motion to aid her; she turned upon me as a miser might turn if anyone attempted to assist him in counting his gold.
50
+
51
+ `I don't want your help,' she snapped; `I can get them for myself.'
52
+
53
+ `I beg your pardon!' I hastened to reply.
54
+
55
+ `Were you asked to tea?' she demanded, tying an apron over her neat black frock, and standing with a spoonful of the leaf poised over the pot.
56
+
57
+ `I shall be glad to have a cup,' I answered.
58
+
59
+ `Were you asked?' she repeated.
60
+
61
+ `No,' I said, half smiling. `You are the proper person to ask me.'
62
+  
63
+
64
+
65
+ Contents PreviousChapter
66
+ NextChapter
67
+
68
+
69
+ Homepage
data/test_zh.txt ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 呼啸山庄
2
+
3
+ --------------------------------------------------------------------------------
4
+
5
+ 第二章
6
+
7
+ 英文
8
+
9
+
10
+ 昨天下午又冷又有雾。我想就在书房炉边消磨一下午,不想踩着杂草污泥到呼啸山庄了。
11
+
12
+ 但是,吃过午饭(注意——我在十二点与一点钟之间吃午饭,而可以当作这所房子的附属物的管家婆,一位慈祥的太太却不能,或者并不愿理解我请求在五点钟开饭的用意),在我怀着这个懒惰的想法上了楼,迈进屋子的时候,看见一个女仆跪在地上,身边是扫帚和煤斗。她正在用一堆堆煤渣封火,搞起一片弥漫的灰尘。这景象立刻把我赶回头了。我拿了帽子,走了四里路,到达了希刺克厉夫的花园口口,刚好躲过了一场今年初降的鹅毛大雪。
13
+
14
+ 在那荒凉的山顶上,土地由于结了一层黑冰而冻得坚硬,冷空气使我四肢发抖。我弄不开门链,就跳进去,顺着两边种着蔓延的醋栗树丛的石路跑去。我白白地敲了半天门,一直敲到我的手指骨都痛了,狗也狂吠起来。
15
+
16
+ “倒霉的人家!”我心里直叫,“只为你这样无礼待客,就该一辈子跟人群隔离。我至少还不会在白天把门闩住。我才不管呢——我要进去!”如此决定了。我就抓住门闩,使劲摇它。苦脸的约瑟夫从谷仓的一个圆窗里探出头来。
17
+
18
+ “你干吗?”他大叫。“主人在牛栏里,你要是找他说话,就从这条路口绕过去。”
19
+
20
+ “屋里没人开门吗?”我也叫起来。
21
+
22
+ “除了太太没有别人。你就是闹腾到夜里,她也不会开。”
23
+
24
+ “为什么?你就不能告诉她我是谁吗,呃,约瑟夫?”
25
+
26
+ “别找我!我才不管这些闲事呢,”这个脑袋咕噜着,又不见了。
27
+
28
+ 雪开始下大了。我握住门柄又试一回。这时一个没穿外衣的年轻人,扛着一根草耙,在后面院子里出现了。他招呼我跟着他走,穿过了一个洗衣房和一片铺平的地,那儿有煤棚、抽水机和鸽笼,我们终于到了我上次被接待过的那间温暖的、热闹的大屋子。煤、炭和木材混合在一起燃起的熊熊炉火,使这屋子放着光彩。在准备摆上丰盛晚餐的桌旁,我很高兴地看到了那位“太太”,以前我从未料想到会有这么一个人存在的。我鞠躬等候,以为她会叫我坐下。她望望我,往她的椅背一靠,不动,也不出声。
29
+
30
+ “天气真坏!”我说,“希刺克厉夫太太,恐怕大门因为您的仆人偷懒而大吃苦头,我费了好大劲才使他们听见我敲门!”
31
+
32
+ 她死不开口。我瞪眼——她也瞪眼。反正她总是以一种冷冷的、漠不关心的神气盯住我,使人十分窘,而且不愉快。
33
+
34
+ “坐下吧,”那年轻人粗声粗气地说,“他就要来了。”
35
+
36
+ 我服从了;轻轻咳了一下,叫唤那恶狗朱诺。临到第二次会面,它总算赏脸,摇起尾巴尖,表示认我是熟人了。
37
+
38
+ “好漂亮的狗!”我又开始说话。“您是不是打算不要这些小的呢,夫人?”
39
+
40
+ “那些不是我的,”这可爱可亲的女主人说,比希刺克厉夫本人所能回答的腔调还要更冷淡些。
41
+
42
+ “啊,您所心爱的是在这一堆里啦!”我转身指着一个看不清楚的靠垫上那一堆像猫似的东西,接着说下去。
43
+
44
+ “谁会爱这些东西那才怪呢!”她轻蔑地说。
45
+
46
+ 倒霉,原来那是堆死兔子。我又轻咳一声,向火炉凑近些,又把今晚天气不好的话评论一通。
47
+
48
+ “你本来就不该出来。”她说,站起来去拿壁炉台上的两个彩色茶叶罐。
49
+
50
+ 她原先坐在光线被遮住的地方,现在我把她的全身和面貌都看得清清楚楚。她苗条,显然还没有过青春期。挺好看的体态,还有一张我生平从未有幸见过的绝妙的小脸蛋。五官纤丽,非常漂亮。淡黄色的卷发,或者不如说是金黄色的,松松地垂在她那细嫩的颈上。至于眼睛,要是眼神能显得和悦些,就要使人无法抗拒了。对我这容易动情的心说来倒是常事,因为它们所表现的只是在轻蔑与近似绝望之间的一种情绪,而在那张脸上看见那样的眼神是特别不自然的。
51
+
52
+ 她简直够不到茶叶罐。我动了一动,想帮她一下。她猛地扭转身向我,像守财奴看见别人打算帮他数他的金子一样。
53
+
54
+ “我不要你帮忙,”她怒气冲冲地说,“我自己拿得到。”
55
+
56
+ “对不起!”我连忙回答。
57
+
58
+ “是请你来吃茶的吗?”她问,把一条围裙系在她那干净的黑衣服上,就这样站着,拿一匙茶叶正要往茶壶里倒。
59
+
60
+ “我很想喝杯茶。”我回答。
61
+
62
+ “是请你来的吗?”她又问。
63
+
64
+ “没有,”我说,勉强笑一笑。“您正好请我喝茶。”
65
+
66
+  
67
+
68
+
69
+ 目录
70
+ 上一章
71
+ 下一章
72
+
73
+
74
+ 返回首页
packages.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ icu-doc
2
+ libicu-dev
3
+ pkg-config
requirements.txt CHANGED
@@ -1,11 +1,22 @@
1
  gradio
 
2
  transformers
3
  sentencepiece
4
  sklearn
5
- logzero
6
  git+https://github.com/ffreemt/fast-langid
7
  git+https://github.com/ffreemt/align-model-pool
8
  sentence-transformers
9
  sentence_splitter
 
10
  icecream
11
  alive-progress
 
 
 
 
 
 
 
 
 
 
 
1
  gradio
2
+ install
3
  transformers
4
  sentencepiece
5
  sklearn
 
6
  git+https://github.com/ffreemt/fast-langid
7
  git+https://github.com/ffreemt/align-model-pool
8
  sentence-transformers
9
  sentence_splitter
10
+ logzero
11
  icecream
12
  alive-progress
13
+ more_itertools
14
+ #
15
+ openpyxl
16
+ # --- seg_text
17
+ Morfessor
18
+ pyicu
19
+ pycld2
20
+ tqdm
21
+ polyglot
22
+ sentence_splitter
ubee/__main__.py CHANGED
@@ -1,9 +1,14 @@
1
  """Gen ubee main."""
2
  # pylint: disable=unused-import, wrong-import-position
3
 
 
 
 
4
  import sys
5
- from itertools import zip_longest
6
- from textwrap import dedent
 
 
7
 
8
  import gradio as gr
9
 
@@ -18,6 +23,7 @@ if "." not in sys.path:
18
 
19
  from ubee.ubee import ubee
20
 
 
21
  ic_install()
22
  ic.configureOutput(
23
  includeContext=True,
@@ -32,7 +38,11 @@ def greet1(name):
32
  return "Hello " + name + "!!"
33
 
34
 
35
- def greet(text1, text2) -> pd.DataFrame:
 
 
 
 
36
  """Take inputs, return outputs.
37
 
38
  Args:
@@ -44,12 +54,25 @@ def greet(text1, text2) -> pd.DataFrame:
44
  res1 = [elm.strip() for elm in text1.splitlines() if elm.strip()]
45
  res2 = [elm.strip() for elm in text2.splitlines() if elm.strip()]
46
 
47
- _ = pd.DataFrame(zip_longest(res1, res2), columns=["text1", "text2"])
48
- return _
 
 
 
 
49
 
50
 
51
  def main():
52
  """Create main entry."""
 
 
 
 
 
 
 
 
 
53
  title = "Ultimatumbee Aligner"
54
  theme = "dark-grass"
55
  description = """WIP showcasing a novel aligner"""
@@ -62,20 +85,8 @@ def main():
62
 
63
  lines = 15
64
  placeholder = "Type or paste text here"
65
- default1 = dedent(
66
- """
67
- test 1
68
- abc
69
- love you
70
- """
71
- )
72
- default2 = dedent(
73
- """
74
- 爱你
75
- 甲乙丙
76
- 测试 1
77
- """
78
- )
79
  label1 = "text1"
80
  label2 = "text2"
81
  inputs = [
@@ -85,6 +96,12 @@ def main():
85
  gr.inputs.Textbox(
86
  lines=lines, placeholder=placeholder, default=default2, label=label2
87
  ),
 
 
 
 
 
 
88
  ]
89
 
90
  out_df = gr.outputs.Dataframe(
@@ -95,8 +112,26 @@ def main():
95
  type="auto",
96
  label="To be aligned",
97
  )
98
- outputs = [ # tot. 1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  out_df,
 
 
100
  ]
101
 
102
  iface = gr.Interface(
 
1
  """Gen ubee main."""
2
  # pylint: disable=unused-import, wrong-import-position
3
 
4
+ from typing import Tuple
5
+
6
+ from pathlib import Path
7
  import sys
8
+ from random import shuffle
9
+
10
+ # from itertools import zip_longest
11
+ # from textwrap import dedent
12
 
13
  import gradio as gr
14
 
 
23
 
24
  from ubee.ubee import ubee
25
 
26
+ logzero.loglevel(10)
27
  ic_install()
28
  ic.configureOutput(
29
  includeContext=True,
 
38
  return "Hello " + name + "!!"
39
 
40
 
41
+ def greet(
42
+ text1,
43
+ text2,
44
+ thresh: float
45
+ ) -> Tuple[pd.DataFrame, pd.DataFrame]:
46
  """Take inputs, return outputs.
47
 
48
  Args:
 
54
  res1 = [elm.strip() for elm in text1.splitlines() if elm.strip()]
55
  res2 = [elm.strip() for elm in text2.splitlines() if elm.strip()]
56
 
57
+ # _ = pd.DataFrame(zip_longest(res1, res2), columns=["text1", "text2"])
58
+ # return _
59
+
60
+ res1_, res2_ = ubee(res1, res2, thresh)
61
+
62
+ return pd.DataFrame(res1_, columns=["text1", "text2", "likelihood"]), pd.DataFrame(res2_, columns=["text1", "text2"])
63
 
64
 
65
  def main():
66
  """Create main entry."""
67
+ text_zh = Path("data/test_zh.txt").read_text("utf8")
68
+ text_en = [
69
+ elm.strip()
70
+ for elm in Path("data/test_en.txt").read_text("utf8").splitlines()
71
+ if elm.strip()
72
+ ]
73
+ shuffle(text_en)
74
+ text_en = "\n\n".join(text_en)
75
+
76
  title = "Ultimatumbee Aligner"
77
  theme = "dark-grass"
78
  description = """WIP showcasing a novel aligner"""
 
85
 
86
  lines = 15
87
  placeholder = "Type or paste text here"
88
+ default1 = text_zh
89
+ default2 = text_en
 
 
 
 
 
 
 
 
 
 
 
 
90
  label1 = "text1"
91
  label2 = "text2"
92
  inputs = [
 
96
  gr.inputs.Textbox(
97
  lines=lines, placeholder=placeholder, default=default2, label=label2
98
  ),
99
+ gr.inputs.Slider(
100
+ minimum=0.0,
101
+ maximum=1.0,
102
+ step=0.1,
103
+ default=0.5,
104
+ ),
105
  ]
106
 
107
  out_df = gr.outputs.Dataframe(
 
112
  type="auto",
113
  label="To be aligned",
114
  )
115
+ aligned = gr.outputs.Dataframe(
116
+ headers=None,
117
+ max_rows=lines, # 20
118
+ max_cols=None,
119
+ overflow_row_behaviour="paginate",
120
+ type="auto",
121
+ label="Aligned",
122
+ )
123
+ leftover = gr.outputs.Dataframe(
124
+ headers=None,
125
+ max_rows=lines, # 20
126
+ max_cols=None,
127
+ overflow_row_behaviour="paginate",
128
+ type="auto",
129
+ label="Leftover",
130
+ )
131
+ outputs = [ # tot. 3
132
  out_df,
133
+ aligned,
134
+ leftover,
135
  ]
136
 
137
  iface = gr.Interface(
ubee/gradiobee.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ """Function for gr's fn."""
2
+ # pylint=disable=
3
+
4
+
5
+ def ubee():
6
+ """Gen a dummy."""
7
+ ...
ubee/seg_text.py ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Split text to sentences.
2
+
3
+ Use sentence_splitter if supported,
4
+ else use polyglot.text.Text
5
+
6
+ !apt install libicu-dev
7
+ !install pyicu pycld2 Morfessor
8
+ !pip install polyglot sentence_splitter
9
+ """
10
+ # pylint: disable=
11
+
12
+ from typing import List, Optional, Union
13
+
14
+ import re
15
+ from tqdm.auto import tqdm
16
+ from polyglot.detect.base import logger as polyglot_logger
17
+ from polyglot.text import Detector, Text
18
+ from sentence_splitter import split_text_into_sentences
19
+
20
+ from logzero import logger
21
+
22
+ # turn of polyglot.text.Detector warning
23
+ polyglot_logger.setLevel("ERROR")
24
+
25
+
26
+ # fmt: off
27
+ # use sentence_splitter if supported
28
+ LANG_S = ["ca", "cs", "da", "nl", "en", "fi", "fr", "de",
29
+ "el", "hu", "is", "it", "lv", "lt", "no", "pl",
30
+ "pt", "ro", "ru", "sk", "sl", "es", "sv", "tr"]
31
+
32
+
33
+ def _seg_text(
34
+ text: str,
35
+ lang: Optional[str] = None,
36
+ # qmode: bool = False,
37
+ maxlines: int = 1000
38
+ ) -> List[str]:
39
+ # fmt: on
40
+ """Split text to sentences.
41
+
42
+ Use sentence_splitter if supported,
43
+ else use polyglot.text.Text.sentences
44
+ Blank lines will be removed.
45
+
46
+ qmode: quick mode, skip split_text_into_sentences if True, default False
47
+ vectors for all books are based on qmode=False.
48
+ qmode=True is for quick test purpose only
49
+
50
+ maxlines (default 1000), threshold for turn on tqdm progressbar
51
+ set to <1 or a large number to turn it off
52
+ """
53
+ if lang is None:
54
+ try:
55
+ lang = Detector(text).language.code
56
+ except Exception as exc:
57
+ logger.info("text[:30]: %s", text[:30])
58
+ logger.warning(
59
+ "polyglot.text.Detector exc: %s, setting to 'en'",
60
+ exc
61
+ )
62
+ lang = "en"
63
+
64
+ # if not qmode and lang in LANG_S:
65
+ if lang in LANG_S:
66
+ _ = []
67
+ lines = text.splitlines()
68
+ # if maxlines > 1 and len(lines) > maxlines:
69
+ if len(lines) > maxlines > 1:
70
+ for para in tqdm(lines):
71
+ if para.strip():
72
+ _.extend(split_text_into_sentences(para, lang))
73
+ else:
74
+ for para in lines:
75
+ if para.strip():
76
+ _.extend(split_text_into_sentences(para, lang))
77
+ return _
78
+
79
+ # return split_text_into_sentences(text, lang)
80
+
81
+ # empty "" text or blank to avoid Exception
82
+ if not text.strip():
83
+ return []
84
+
85
+ return [elm.string for elm in Text(text, lang).sentences]
86
+
87
+
88
+ # fmt: off
89
+ def seg_text(
90
+ lst: Union[str, List[str]],
91
+ lang: Optional[str] = None,
92
+ maxlines: int = 1000,
93
+ extra: Optional[str] = None,
94
+ ) -> List[str]:
95
+ # fmt:on
96
+ """Split a list of text.
97
+
98
+ Arguments:
99
+ lst: text or text list
100
+ lang: optional lang code
101
+ maxlines: (default 1000), threshold for turn on tqdm progressbar, set to <1 or a large number to turn it off
102
+ extra: re.split(rf"{extra}, text) first
103
+ Returns:
104
+ list of splitted text.
105
+ """
106
+ if isinstance(lst, str):
107
+ lst = [lst]
108
+
109
+ if extra:
110
+ # insert \n
111
+ lst = [re.sub(rf"({extra})", r"\1\n", elm) for elm in lst]
112
+
113
+ res = []
114
+ for elm in lst:
115
+ res.extend(_seg_text(
116
+ elm,
117
+ lang=lang,
118
+ maxlines=maxlines,
119
+ ))
120
+
121
+ return res
ubee/ubee.py CHANGED
@@ -1,7 +1,41 @@
1
- """Function for gr's fn."""
2
- # pylint=disable=
 
 
3
 
 
 
4
 
5
- def ubee():
6
- """Gen a dummy."""
7
- ...
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Align via ubee,"""
2
+ # pylint: disable=
3
+ from typing import Iterable, List, Tuple
4
+ from itertools import zip_longest
5
 
6
+ from logzero import logger
7
+ from ubee.uclas import uclas
8
 
9
+
10
+ def ubee(
11
+ sents_zh: Iterable,
12
+ sents_en: Iterable,
13
+ thresh: float = 0.5,
14
+ ) -> Tuple[List[Tuple[str, str, float]], List[Tuple[str, str]]]:
15
+ """Align blocks.
16
+
17
+ Args:
18
+ sents_zh: list of text, can be any langauge supported by clas-l-user
19
+ sents_zh: ditto
20
+ Returns:
21
+ three tuples of aligned blocked
22
+ leftovers (unaligned)
23
+ """
24
+ res = []
25
+ labels = [*sents_en]
26
+
27
+ lo1 = []
28
+ lo2 = labels[:]
29
+
30
+ for seq in sents_zh:
31
+ label, likelihood = uclas(seq, labels, thresh=thresh)
32
+ if label:
33
+ res.append((seq, label, likelihood))
34
+ try:
35
+ lo2.remove(label)
36
+ except Exception as exc:
37
+ logger.error(exc)
38
+ logger.info("seq: %s, lable: %s", seq, label)
39
+ else:
40
+ lo1.append(seq)
41
+ return res, [*zip_longest(lo1, lo2)]
ubee/uclas.py ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Define uclas."""
2
+ # pylint: disable=invalid-name
3
+
4
+ from typing import List, Tuple, Union
5
+ import numpy as np
6
+ from sklearn.metrics.pairwise import cosine_similarity
7
+ from joblib import Memory
8
+
9
+ from model_pool import fetch_check_aux # pylint: disable=import-error
10
+ from model_pool.model_s import load_model_s # pylint: disable=import-error
11
+ from model_pool.load_model import load_model # pylint: disable=import-error
12
+
13
+ import logzero
14
+ from logzero import logger
15
+
16
+ logzero.loglevel(10)
17
+
18
+ fetch_check_aux("/home/user")
19
+ model_s = load_model_s()
20
+ clas = load_model("clas-l-user")
21
+
22
+ location = "./cachedir"
23
+ memory = Memory(location, verbose=0)
24
+
25
+
26
+ @memory.cache
27
+ def cached_clas(*args, **kw):
28
+ """Cache clas-l-user."""
29
+ return clas(*args, **kw)
30
+
31
+
32
+ # cached_clas = memory.cache(cached_clas)
33
+
34
+
35
+ @memory.cache
36
+ def encode(*args, **kw):
37
+ """Cache model_s.encode."""
38
+ return model_s.encode(*args, **kw)
39
+
40
+
41
+ def uclas(
42
+ seq: str,
43
+ labels: Union[List[str], np.ndarray, Tuple[str, ...]],
44
+ thresh: float = 0.5,
45
+ multi_label: bool = False,
46
+ ) -> Tuple[str, Union[float, str]]:
47
+ """Classify seq with a filter.
48
+
49
+ if clas > thresh, return
50
+ if clas * csim > thresh return
51
+ if csim > thresh return
52
+ return ""
53
+ """
54
+ # _ = clas(seq, labels, multi_label=multi_label)
55
+ _ = cached_clas(seq, labels, multi_label=multi_label)
56
+
57
+ logger.debug("1 %s, %s", _.get("labels")[0], round(_.get("scores")[0], 2))
58
+
59
+ if _.get("scores")[0] > thresh:
60
+ return _.get("labels")[0], round(_.get("scores")[0], 2)
61
+
62
+ _ = dict(zip(_.get("labels"), _.get("scores")))
63
+
64
+ corr = np.array([_.get(elm) for elm in labels])
65
+
66
+ csim = cosine_similarity(encode([seq]), encode(labels))
67
+
68
+ corr = corr * csim
69
+
70
+ logger.debug("2 %s, %s", corr.argmax(), round(corr.max(), 2))
71
+
72
+ if corr.max() > thresh:
73
+ return labels[corr.argmax()], round(corr.max(), 2)
74
+
75
+ logger.debug("3 %s, %s, %s", csim.argmax(), round(csim.max(), 2), thresh / 2)
76
+
77
+ logger.debug("T or F: %s", csim.max() > (thresh / 2))
78
+ if csim.max() > (thresh / 2):
79
+ return labels[csim.argmax()], round(csim.max(), 2)
80
+
81
+ return "", ""