Update README.md
Browse files
README.md
CHANGED
@@ -11,7 +11,9 @@ language:
|
|
11 |
- zh
|
12 |
pipeline_tag: text-classification
|
13 |
---
|
14 |
-
# fasttext-med-en-zh-identification
|
|
|
|
|
15 |
|
16 |
This model is an intermediate result of the [EPCD (Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP) project. It is primarily designed to accurately distinguish between Chinese and English samples in medical pretraining datasets. The model framework uses [fastText](https://github.com/facebookresearch/fastText).
|
17 |
|
@@ -60,6 +62,58 @@ The above datasets are high-quality, open-source datasets, which can save a lot
|
|
60 |
import fasttext
|
61 |
from huggingface_hub import hf_hub_download
|
62 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
63 |
def to_low(text):
|
64 |
return text.strip().lower()
|
65 |
|
|
|
11 |
- zh
|
12 |
pipeline_tag: text-classification
|
13 |
---
|
14 |
+
# fasttext-med-en-zh-identification[[中文]](#chinese) [[English]](#english)
|
15 |
+
|
16 |
+
<a id="english"></a>
|
17 |
|
18 |
This model is an intermediate result of the [EPCD (Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP) project. It is primarily designed to accurately distinguish between Chinese and English samples in medical pretraining datasets. The model framework uses [fastText](https://github.com/facebookresearch/fastText).
|
19 |
|
|
|
62 |
import fasttext
|
63 |
from huggingface_hub import hf_hub_download
|
64 |
|
65 |
+
def to_low(text):
|
66 |
+
return text.strip().lower()
|
67 |
+
|
68 |
+
model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin")
|
69 |
+
model = fasttext.load_model('fasttext.bin')
|
70 |
+
model.predict(to_low('Hello, world!'))
|
71 |
+
```
|
72 |
+
|
73 |
+
# fasttext-med-en-zh-identification[[中文]](#chinese) [[English]](#english)
|
74 |
+
|
75 |
+
<a id="chinese"></a>
|
76 |
+
|
77 |
+
该模型为[EPCD(Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP)项目的中间产物,主要用来区分医疗预训练语料中中文与英文样本。模型框架使用[fastText](https://github.com/facebookresearch/fastText)。
|
78 |
+
|
79 |
+
# 数据组成
|
80 |
+
|
81 |
+
## 中文通用预训练数据集
|
82 |
+
- [Skywork/SkyPile-150B](https://huggingface.co/datasets/Skywork/SkyPile-150B)
|
83 |
+
## 中文医疗预训练数据集
|
84 |
+
- [ticoAg/shibing624-medical-pretrain](https://huggingface.co/datasets/ticoAg/shibing624-medical-pretrain)
|
85 |
+
|
86 |
+
## 英文通用预训练数据集
|
87 |
+
- [togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)
|
88 |
+
## 英文医疗预训练数据集
|
89 |
+
- [medalpaca/medical_meadow_wikidoc](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc)
|
90 |
+
- [nlp-guild/medical-data](https://huggingface.co/datasets/nlp-guild/medical-data)
|
91 |
+
|
92 |
+
上述数据集均为高质量开源数据集,可以节省很多数据清洗的工作,感谢上述开发者对开源数据社区的支持!
|
93 |
+
|
94 |
+
# 数据清洗流程
|
95 |
+
- 数据集初步整理
|
96 |
+
- 对中文训练数据集,按`\n`分割预训练语料,去除开头和结尾可能存在的空格。
|
97 |
+
- 对英文训练数据集,按`\n`分割预训练语料,将所有字母全部变为小写,去除开头和结尾可能存在的空格。
|
98 |
+
- 统计词数量,具体的:
|
99 |
+
- 对中文,使用[jieba](https://github.com/fxsjy/jieba)包进行分词,并利用[jionlp](https://github.com/dongrixinyu/JioNLP)进一步过滤停用词和非中文字符。
|
100 |
+
- 对英文,使用[nltk](https://github.com/nltk/nltk)包进行分词,并利用内置停用词进行过滤。
|
101 |
+
- 根据词数量进行样本过滤,具体的(经验数值):
|
102 |
+
- 对中文:仅保留词数量大于5的样本。
|
103 |
+
- 对英文:仅保留词数量大于5的样本。
|
104 |
+
- 切分数据集,训练集比例为0.9,测试集比例为0.1。
|
105 |
+
|
106 |
+
# 模型表现
|
107 |
+
|Dataset | Accuracy |
|
108 |
+
|-------|-------|
|
109 |
+
|Train | 0.9994|
|
110 |
+
|Test | 0.9998|
|
111 |
+
|
112 |
+
## Usage Example
|
113 |
+
```python
|
114 |
+
import fasttext
|
115 |
+
from huggingface_hub import hf_hub_download
|
116 |
+
|
117 |
def to_low(text):
|
118 |
return text.strip().lower()
|
119 |
|