中文词语分类

本模型对中文词语进行分类(多标签)。对于一个中文词语,其会被分为一个或多个类别,类别有如下:

"1": "人文科学",
"2": "农林渔畜",
"3": "医学",
"4": "城市信息大全",
"5": "娱乐",
"6": "工程与应用科学",
"7": "生活",
"8": "电子游戏",
"9": "社会科学",
"10": "自然科学",
"11": "艺术",
"12": "运动休闲"

类别来源于搜狗词汇的类型

使用样例

import torch
from transformers import AutoTokenizer, BertForSequenceClassification

model_path = "iioSnail/bert-base-chinese-word-classifier"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = BertForSequenceClassification.from_pretrained(model_path)

words = ["2型糖尿病", "太古里", "跑跑卡丁车", "河豚"]
inputs = tokenizer(words, return_tensors='pt', padding=True)
outputs = model(**inputs).logits
outputs = outputs.sigmoid()
preds = outputs > 0.5
for i, pred in enumerate(preds):
    pred = torch.argwhere(pred).view(-1)
    labels = [model.config.id2label[int(id)] for id in pred]
    print(words[i], ":", labels)

输出:

2型糖尿病 : ['医学']
太古里 : ['城市信息大全']
跑跑卡丁车 : ['电子游戏']
河豚 : ['人文科学', '娱乐', '电子游戏', '自然科学']
Downloads last month
78
Safetensors
Model size
102M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.