ytzfhqs
/

fasttext-med-en-zh-identification

Text Classification

English

Chinese

Model card Files Files and versions Community

ytzfhqs commited on Oct 8

Commit

9cb6f58

•

1 Parent(s): 2fa6a0e

Update README.md

Browse files

Files changed (1) hide show

README.md +55 -1

README.md CHANGED Viewed

@@ -11,7 +11,9 @@ language:
 - zh
 pipeline_tag: text-classification
 ---
-# fasttext-med-en-zh-identification
 This model is an intermediate result of the [EPCD (Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP) project. It is primarily designed to accurately distinguish between Chinese and English samples in medical pretraining datasets. The model framework uses [fastText](https://github.com/facebookresearch/fastText).
@@ -60,6 +62,58 @@ The above datasets are high-quality, open-source datasets, which can save a lot
 import fasttext
 from huggingface_hub import hf_hub_download
 def to_low(text):
     return text.strip().lower()

 - zh
 pipeline_tag: text-classification
 ---
+# fasttext-med-en-zh-identification[[中文]](#chinese)    [[English]](#english)
+<a id="english"></a>
 This model is an intermediate result of the [EPCD (Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP) project. It is primarily designed to accurately distinguish between Chinese and English samples in medical pretraining datasets. The model framework uses [fastText](https://github.com/facebookresearch/fastText).
 import fasttext
 from huggingface_hub import hf_hub_download
+def to_low(text):
+    return text.strip().lower()
+model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin")
+model = fasttext.load_model('fasttext.bin')
+model.predict(to_low('Hello, world!'))
+```
+# fasttext-med-en-zh-identification[[中文]](#chinese)    [[English]](#english)
+<a id="chinese"></a>
+该模型为[EPCD(Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP)项目的中间产物，主要用来区分医疗预训练语料中中文与英文样本。模型框架使用[fastText](https://github.com/facebookresearch/fastText)。
+# 数据组成
+## 中文通用预训练数据集
+ - [Skywork/SkyPile-150B](https://huggingface.co/datasets/Skywork/SkyPile-150B)
+## 中文医疗预训练数据集
+ - [ticoAg/shibing624-medical-pretrain](https://huggingface.co/datasets/ticoAg/shibing624-medical-pretrain)
+ ## 英文通用预训练数据集
+  - [togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)
+ ## 英文医疗预训练数据集
+  - [medalpaca/medical_meadow_wikidoc](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc)
+  - [nlp-guild/medical-data](https://huggingface.co/datasets/nlp-guild/medical-data)
+ 上述数据集均为高质量开源数据集，可以节省很多数据清洗的工作，感谢上述开发者对开源数据社区的支持！
+# 数据清洗流程
+ - 数据集初步整理
+     - 对中文训练数据集，按`\n`分割预训练语料，去除开头和结尾可能存在的空格。
+     - 对英文训练数据集，按`\n`分割预训练语料，将所有字母全部变为小写，去除开头和结尾可能存在的空格。
+ - 统计词数量，具体的：
+     - 对中文，使用[jieba](https://github.com/fxsjy/jieba)包进行分词，并利用[jionlp](https://github.com/dongrixinyu/JioNLP)进一步过滤停用词和非中文字符。
+      - 对英文，使用[nltk](https://github.com/nltk/nltk)包进行分词，并利用内置停用词进行过滤。
+ - 根据词数量进行样本过滤，具体的（经验数值）：
+      - 对中文：仅保留词数量大于5的样本。
+      - 对英文：仅保留词数量大于5的样本。
+ - 切分数据集，训练集比例为0.9，测试集比例为0.1。
+ # 模型表现
+ |Dataset | Accuracy |
+|-------|-------|
+|Train | 0.9994|
+|Test | 0.9998|
+## Usage Example
+```python
+import fasttext
+from huggingface_hub import hf_hub_download
 def to_low(text):
     return text.strip().lower()