File size: 1,791 Bytes
45e09ab 7fa95d1 5c4a81e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
---
language:
- zh
tags:
- search
---
# Cross Language Search
## Search cliassical CN with modern ZH
* In some cases, Classical Chinese feels like another language, I even trained 2 translation model to prove this point.
* That's why, when people wants to be savvy about their words, we choose to quote our ancestors. It's exactly like westerners like to quote Latin or Shakespare, the difference is we have a much bigger pool to choose.
* This model helps you **find** text within **ancient Chinese** literature, but you can **search with modern Chinese**
# 跨语种搜索
## 博古搜今
```python
from unpackai.interp import CosineSearch
from sentence_transformers import SentenceTransformer
import pandas as pd
import numpy as np
TAG = "raynardj/xlsearch-cross-lang-search-zh-vs-classicical-cn"
encoder = SentenceTransformer(TAG)
# all_lines is a list of all your sentences
# all_lines 是一个你所有句子的列表, 可以是一本书, 按照句子分割, 也可以是很多很多书
all_lines = ["句子1","句子2",...]
vec = encoder.encode(all_lines, batch_size=32, show_progress_bar=True)
# consine距离搜索器
cosine = CosineSearch(vec)
def search(text):
enc = encoder.encode(text) # encode the search key
order = cosine(enc) # distance array
sentence_df = pd.DataFrame({"sentence":np.array(all_lines)[order[:5]]})
return sentence_df
```
将史记打成句子以后, 搜索效果如下
```python
>>> search("他是一个很慷慨的人")
```
```
sentence
0 季布者,楚人也。为气任侠,有名於楚。
1 董仲舒为人廉直。
2 大将军为人仁善退让,以和柔自媚於上,然天下未有称也。
3 勃为人木彊敦厚,高帝以为可属大事。
4 石奢者,楚昭王相也。坚直廉正,无所阿避。
``` |