词法分析的库函数介绍/Introduction
1.分词 segment
method的参数包括 中文:jieba_ac,jieba_all,hanlp,thulac,snownlp,ltp 英文:spacy,nltk,split
2.词干提取 stem
method的参数包括 porter,lancester,snowball
3.词形还原 lemmatize_text
method的参数包括 spacy,nltk
4.词性标注 tagging
method的参数包括 中文:jieba,thulac,hanlp,npir,snownlp 英文:nltk,spacy
5.命名实体识别 named_entity_recognition
method参数包括 中文:LTP(Nh 人名,Ni机构名,Ns地名),Hanlp,spacy_ch 英文:spacy_en,nltk
6.去停用词 remove_stopword
7.词频统计 count_word_frequency
在function上填入对应的功能,method里填入对应方法的method参数
需要提前安装相应的库,库的内容在require文件里
除此之外,还需要通过 python -m spacy download zh_core_web_sm 和 python -m spacy download en_core_web_sm 来安装 zh_core_web_sm==3.7.0和en_core_web_sm==3.7.1
快速开始/Quick Start
关于词法分析的库函数的使用,样例如下
from huggingface_hub import hf_hub_download
import importlib.util
# 替换为你的 Hugging Face 用户名和仓库名
def nlp(content, function, method):
repo_id = "epetery/my-new-model"
filename = "divide_corpus.py"
stopwords_filename = "stopwords-master/baidu_stopwords.txt"
# 下载文件到当前工作目录
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
stopwords_file_path = hf_hub_download(repo_id=repo_id, filename=stopwords_filename)
# 导入模块
spec = importlib.util.spec_from_file_location("divide_corpus", file_path)
divide_corpus = importlib.util.module_from_spec(spec)
spec.loader.exec_module(divide_corpus)
divide_corpus.STOPWORDS_FILE_PATH = stopwords_file_path
# 使用模块中的类和方法
text_divider = getattr(divide_corpus, "NLP_Class")(content)
if function != 'count_word_frequency':
divided_text = getattr(text_divider, function)(method=method)
else:
seg_text = getattr(text_divider, 'segment')(method=method)
freq_counter = getattr(divide_corpus, "NLP_Class")(seg_text)
divided_text = freq_counter.count_word_frequency()
return divided_text
# 使用模块中的函数
text = "This is a test text."
divided_text=nlp(text,'remove_stopword','nltk')
print(divided_text)
Unable to determine this model's library. Check the
docs
.