File size: 4,464 Bytes
65caebd
c06a0a7
c61f3ec
 
 
 
c06a0a7
 
c61f3ec
c06a0a7
 
 
 
 
 
050e3be
b2b9d69
65caebd
c61f3ec
 
 
 
 
 
 
 
35f2131
 
 
c61f3ec
b410dd9
c61f3ec
 
 
 
cfbfdf4
c61f3ec
 
 
 
 
 
 
cbd833c
6f0736b
cbd833c
c61f3ec
 
 
 
 
 
 
 
 
 
 
 
 
 
b410dd9
c61f3ec
b410dd9
 
 
c61f3ec
b410dd9
c61f3ec
b410dd9
 
 
 
 
 
 
 
 
 
 
 
 
c61f3ec
 
b410dd9
 
 
0643042
b410dd9
 
c61f3ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159258a
c61f3ec
 
 
 
 
 
 
 
 
 
 
 
c06a0a7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
language:
- zh
tags:
- chatglm
- pytorch
- Text-Generation
license: apache-2.0
widget:
- text: |-
    对下面中文拼写纠错:
    少先队员因该为老人让坐。
    答:
base_model: THUDM/chatglm3-6b
pipeline_tag: text-generation
library_name: peft
inference: false
---

# Chinese Spelling Correction LoRA Model
ChatGLM3-6B中文纠错LoRA模型

`shibing624/chatglm3-6b-csc-chinese-lora` evaluate test data:

The overall performance of shibing624/chatglm3-6b-csc-chinese-lora on CSC **test**:

|input_text|pred|
|:--- |:--- |
|对下面文本纠错:少先队员因该为老人让坐。|少先队员应该为老人让座。|

在CSC测试集上生成结果纠错准确率高,由于是基于[THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b)模型,结果常常能带给人惊喜,不仅能纠错,还带有句子润色和改写功能。


## Usage

本项目开源在 pycorrector 项目:[pycorrector](https://github.com/shibing624/pycorrector),可支持ChatGLM原生模型和LoRA微调后的模型,通过如下命令调用:

Install package:
```shell
pip install -U pycorrector
```

```python
from pycorrector import GptCorrector
model = GptCorrector("THUDM/chatglm3-6b", "chatglm", peft_name="shibing624/chatglm3-6b-csc-chinese-lora")
r = model.correct_batch(["少先队员因该为老人让坐。"])
print(r) # ['少先队员应该为老人让座。']
```

## Usage (HuggingFace Transformers)
Without [pycorrector](https://github.com/shibing624/pycorrector), you can use the model like this: 

First, you pass your input through the transformer model, then you get the generated sentence.

Install package:
```
pip install transformers 
```

```python
import os

import torch
from peft import PeftModel
from transformers import AutoTokenizer, AutoModel

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).half().cuda()
model = PeftModel.from_pretrained(model, "shibing624/chatglm3-6b-csc-chinese-lora")

sents = ['对下面文本纠错\n\n少先队员因该为老人让坐。',
         '对下面文本纠错\n\n下个星期,我跟我朋唷打算去法国玩儿。']


def get_prompt(user_query):
    vicuna_prompt = "A chat between a curious user and an artificial intelligence assistant. " \
                    "The assistant gives helpful, detailed, and polite answers to the user's questions. " \
                    "USER: {query} ASSISTANT:"
    return vicuna_prompt.format(query=user_query)


for s in sents:
    q = get_prompt(s)
    input_ids = tokenizer(q).input_ids
    generation_kwargs = dict(max_new_tokens=128, do_sample=True, temperature=0.8)
    outputs = model.generate(input_ids=torch.as_tensor([input_ids]).to('cuda:0'), **generation_kwargs)
    output_tensor = outputs[0][len(input_ids):]
    response = tokenizer.decode(output_tensor, skip_special_tokens=True)
    print(response)
```

output:
```shell
少先队员应该为老人让座。
下个星期,我跟我朋友打算去法国玩儿。
```


模型文件组成:
```
chatglm3-6b-csc-chinese-lora
    ├── adapter_config.json
    └── adapter_model.bin
```

#### 训练参数:

![loss](train_loss.png)

- num_epochs: 5
- per_device_train_batch_size: 6
- learning_rate: 2e-05
- best steps: 25100
- train_loss: 0.0834
- lr_scheduler_type: linear
- base model: THUDM/chatglm3-6b
- warmup_steps: 50
- "save_strategy": "steps"
- "save_steps": 500
- "save_total_limit": 10
- "bf16": false
- "fp16": true
- "optim": "adamw_torch"
- "ddp_find_unused_parameters": false
- "gradient_checkpointing": true
- max_seq_length: 512
- max_length: 512
- prompt_template_name: vicuna
- 6 * V100 32GB, training 48 hours

### 训练数据集
训练集包括以下数据:

- 中文拼写纠错数据集:https://huggingface.co/datasets/shibing624/CSC
- 中文语法纠错数据集:https://github.com/shibing624/pycorrector/tree/llm/examples/data/grammar
- 通用GPT4问答数据集:https://huggingface.co/datasets/shibing624/sharegpt_gpt4


如果需要训练文本纠错模型,请参考[https://github.com/shibing624/pycorrector](https://github.com/shibing624/pycorrector)



## Citation

```latex
@software{pycorrector,
  author = {Ming Xu},
  title = {pycorrector: Text Error Correction Tool},
  year = {2023},
  url = {https://github.com/shibing624/pycorrector},
}
```