功能介绍

T5Corrector:中文字音与字形纠错模型

这个模型是基于mengzi-t5-base进行文本纠错训练,使用2kw+句子,通过替换同音词、近音词和形近字来,对于句中词组随机添加词组、删除词组中的部分字,以及字词乱序操作构造纠错平行语料,共计2亿+句对,累计训练66000步。

Github项目地址

加载模型:

# 加载模型
from transformers import AutoTokenizer, T5ForConditionalGeneration
pretrained = "Maciel/T5Corrector-base-v2"
tokenizer = AutoTokenizer.from_pretrained(pretrained)
model = T5ForConditionalGeneration.from_pretrained(pretrained)

使用模型进行预测推理方法:

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def correct(text, max_length):
    model_inputs = tokenizer(text, 
                                max_length=max_length, 
                                truncation=True, 
                                return_tensors="pt").to(device)
    output = model.generate(**model_inputs, 
                              num_beams=5,
                              no_repeat_ngram_size=4,
                              do_sample=True, 
                              early_stopping=True,
                              max_length=max_length,
                              return_dict_in_generate=True,
                              output_scores=True)
    pred_output = tokenizer.batch_decode(output.sequences, skip_special_tokens=True)[0]
    return pred_output

text = "贵州毛台现在多少钱一瓶啊,想买两瓶尝尝味道。"
correction = correct(text, max_length=32)
print(correction)

案例展示

示例1:
input: 能不能帮我买点淇淋,好久没吃了。
output: 能不能帮我买点冰淇淋,好久没吃了。

示例2:
input: 脑子有点胡涂了,这道题冥冥学过还没有做出来
output: 脑子有点糊涂了,这道题明明学过还没有做出来

示例3:
input: 今天天气不太好,我的心情也不是很偷快
output: 今天天气不太好,我的心情也不是很愉快
Downloads last month
20
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.