ytzfhqs commited on
Commit
9cb6f58
1 Parent(s): 2fa6a0e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -1
README.md CHANGED
@@ -11,7 +11,9 @@ language:
11
  - zh
12
  pipeline_tag: text-classification
13
  ---
14
- # fasttext-med-en-zh-identification
 
 
15
 
16
  This model is an intermediate result of the [EPCD (Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP) project. It is primarily designed to accurately distinguish between Chinese and English samples in medical pretraining datasets. The model framework uses [fastText](https://github.com/facebookresearch/fastText).
17
 
@@ -60,6 +62,58 @@ The above datasets are high-quality, open-source datasets, which can save a lot
60
  import fasttext
61
  from huggingface_hub import hf_hub_download
62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
  def to_low(text):
64
  return text.strip().lower()
65
 
 
11
  - zh
12
  pipeline_tag: text-classification
13
  ---
14
+ # fasttext-med-en-zh-identification[[中文]](#chinese) [[English]](#english)
15
+
16
+ <a id="english"></a>
17
 
18
  This model is an intermediate result of the [EPCD (Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP) project. It is primarily designed to accurately distinguish between Chinese and English samples in medical pretraining datasets. The model framework uses [fastText](https://github.com/facebookresearch/fastText).
19
 
 
62
  import fasttext
63
  from huggingface_hub import hf_hub_download
64
 
65
+ def to_low(text):
66
+ return text.strip().lower()
67
+
68
+ model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin")
69
+ model = fasttext.load_model('fasttext.bin')
70
+ model.predict(to_low('Hello, world!'))
71
+ ```
72
+
73
+ # fasttext-med-en-zh-identification[[中文]](#chinese) [[English]](#english)
74
+
75
+ <a id="chinese"></a>
76
+
77
+ 该模型为[EPCD(Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP)项目的中间产物,主要用来区分医疗预训练语料中中文与英文样本。模型框架使用[fastText](https://github.com/facebookresearch/fastText)。
78
+
79
+ # 数据组成
80
+
81
+ ## 中文通用预训练数据集
82
+ - [Skywork/SkyPile-150B](https://huggingface.co/datasets/Skywork/SkyPile-150B)
83
+ ## 中文医疗预训练数据集
84
+ - [ticoAg/shibing624-medical-pretrain](https://huggingface.co/datasets/ticoAg/shibing624-medical-pretrain)
85
+
86
+ ## 英文通用预训练数据集
87
+ - [togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)
88
+ ## 英文医疗预训练数据集
89
+ - [medalpaca/medical_meadow_wikidoc](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc)
90
+ - [nlp-guild/medical-data](https://huggingface.co/datasets/nlp-guild/medical-data)
91
+
92
+ 上述数据集均为高质量开源数据集,可以节省很多数据清洗的工作,感谢上述开发者对开源数据社区的支持!
93
+
94
+ # 数据清洗流程
95
+ - 数据集初步整理
96
+ - 对中文训练数据集,按`\n`分割预训练语料,去除开头和结尾可能存在的空格。
97
+ - 对英文训练数据集,按`\n`分割预训练语料,将所有字母全部变为小写,去除开头和结尾可能存在的空格。
98
+ - 统计词数量,具体的:
99
+ - 对中文,使用[jieba](https://github.com/fxsjy/jieba)包进行分词,并利用[jionlp](https://github.com/dongrixinyu/JioNLP)进一步过滤停用词和非中文字符。
100
+ - 对英文,使用[nltk](https://github.com/nltk/nltk)包进行分词,并利用内置停用词进行过滤。
101
+ - 根据词数量进行样本过滤,具体的(经验数值):
102
+ - 对中文:仅保留词数量大于5的样本。
103
+ - 对英文:仅保留词数量大于5的样本。
104
+ - 切分数据集,训练集比例为0.9,测试集比例为0.1。
105
+
106
+ # 模型表现
107
+ |Dataset | Accuracy |
108
+ |-------|-------|
109
+ |Train | 0.9994|
110
+ |Test | 0.9998|
111
+
112
+ ## Usage Example
113
+ ```python
114
+ import fasttext
115
+ from huggingface_hub import hf_hub_download
116
+
117
  def to_low(text):
118
  return text.strip().lower()
119