--- license: apache-2.0 datasets: - Skywork/SkyPile-150B - ticoAg/shibing624-medical-pretrain - togethercomputer/RedPajama-Data-V2 - medalpaca/medical_meadow_wikidoc - nlp-guild/medical-data language: - en - zh pipeline_tag: text-classification --- # fasttext-med-en-zh-identification This model is an intermediate result of the [EPCD (Easy-Data-Clean-Pipeline)](https://github.com/ytzfhqs/EDCP) project. It is primarily designed to accurately distinguish between Chinese and English samples in medical pretraining datasets. The model framework uses [fastText](https://github.com/facebookresearch/fastText). ## Data Composition ### General Chinese Pretraining Dataset - [Skywork/SkyPile-150B](https://huggingface.co/datasets/Skywork/SkyPile-150B) ### Medical Chinese Pretraining Dataset - [ticoAg/shibing624-medical-pretrain](https://huggingface.co/datasets/ticoAg/shibing624-medical-pretrain) ### General English Pretraining Dataset - [togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) ### Medical English Pretraining Datasets - [medalpaca/medical_meadow_wikidoc](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc) - [nlp-guild/medical-data](https://huggingface.co/datasets/nlp-guild/medical-data) The above datasets are high-quality, open-source datasets, which can save a lot of effort in data cleaning. Many thanks to the developers for their contributions to the open-source data community! ## Data Cleaning Process - Initial dataset processing: - For the Chinese training datasets, the pretraining corpus is split by `\n`, and any leading or trailing spaces are removed. - For the English training datasets, the pretraining corpus is split by `\n`, all letters are converted to lowercase, and any leading or trailing spaces are removed. - Word count statistics: - For Chinese, the [jieba](https://github.com/fxsjy/jieba) package is used for tokenization, and stopwords and non-Chinese characters are further filtered using [jionlp](https://github.com/dongrixinyu/JioNLP). - For English, the [nltk](https://github.com/nltk/nltk) package is used for tokenization, with built-in stopwords for filtering. - Sample filtering based on word count (heuristic thresholds): - For Chinese: Keep only samples with more than 5 words. - For English: Keep only samples with more than 5 words. - Dataset splitting: 90% of the data is used for training and 10% for testing. ## Model Performance | Dataset | Precision | Recall | |---------|----------|----------| | Train | 0.9987 | 0.9987 | | Test | 0.9962 | 0.9962 | ## Usage Example ```python import fasttext from huggingface_hub import hf_hub_download def to_low(text): return text.strip().lower() model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin") model = fasttext.load_model('fasttext.bin') model.predict(to_low('Hello, world!')) ```