Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# paddle paddle版本的RoFormer
|
2 |
+
|
3 |
+
# 需要安装最新的paddlenlp
|
4 |
+
`pip install git+https://github.com/PaddlePaddle/PaddleNLP.git`
|
5 |
+
|
6 |
+
## 预训练模型转换
|
7 |
+
|
8 |
+
预训练模型可以从 huggingface/transformers 转换而来,方法如下(适用于roformer模型,其他模型按情况调整):
|
9 |
+
|
10 |
+
1. 从huggingface.co获取roformer模型权重
|
11 |
+
2. 设置参数运行convert.py代码
|
12 |
+
3. 例子:
|
13 |
+
假设我想转换https://huggingface.co/junnyu/roformer_chinese_base 权重
|
14 |
+
(1)首先下载 https://huggingface.co/junnyu/roformer_chinese_base/tree/main 中的pytorch_model.bin文件,假设我们存入了`./roformer_chinese_base/pytorch_model.bin`
|
15 |
+
(2)运行convert.py
|
16 |
+
```bash
|
17 |
+
python convert.py \
|
18 |
+
--pytorch_checkpoint_path ./roformer_chinese_base/pytorch_model.bin \
|
19 |
+
--paddle_dump_path ./roformer_chinese_base/model_state.pdparams
|
20 |
+
```
|
21 |
+
(3)最终我们得到了转化好的权重`./roformer_chinese_base/model_state.pdparams`
|
22 |
+
|
23 |
+
## 预训练MLM测试
|
24 |
+
# test_mlm.py
|
25 |
+
```python
|
26 |
+
import paddle
|
27 |
+
import argparse
|
28 |
+
from paddlenlp.transformers import RoFormerForPretraining, RoFormerTokenizer
|
29 |
+
|
30 |
+
def test_mlm(text, model_name):
|
31 |
+
model = RoFormerForPretraining.from_pretrained(model_name)
|
32 |
+
model.eval()
|
33 |
+
tokenizer = RoFormerTokenizer.from_pretrained(model_name)
|
34 |
+
tokens = ["[CLS]"]
|
35 |
+
text_list = text.split("[MASK]")
|
36 |
+
for i,t in enumerate(text_list):
|
37 |
+
tokens.extend(tokenizer.tokenize(t))
|
38 |
+
if i==len(text_list)-1:
|
39 |
+
tokens.extend(["[SEP]"])
|
40 |
+
else:
|
41 |
+
tokens.extend(["[MASK]"])
|
42 |
+
|
43 |
+
input_ids_list = tokenizer.convert_tokens_to_ids(tokens)
|
44 |
+
input_ids = paddle.to_tensor([input_ids_list])
|
45 |
+
|
46 |
+
with paddle.no_grad():
|
47 |
+
pd_outputs = model(input_ids)[0][0]
|
48 |
+
pd_outputs_sentence = "paddle: "
|
49 |
+
for i, id in enumerate(input_ids_list):
|
50 |
+
if id == tokenizer.convert_tokens_to_ids(["[MASK]"])[0]:
|
51 |
+
tokens = tokenizer.convert_ids_to_tokens(pd_outputs[i].topk(5)[1].tolist())
|
52 |
+
pd_outputs_sentence += "[" + "||".join(tokens) + "]"
|
53 |
+
else:
|
54 |
+
pd_outputs_sentence += "".join(
|
55 |
+
tokenizer.convert_ids_to_tokens([id], skip_special_tokens=True)
|
56 |
+
)
|
57 |
+
print(pd_outputs_sentence)
|
58 |
+
|
59 |
+
if __name__ == "__main__":
|
60 |
+
parser = argparse.ArgumentParser()
|
61 |
+
parser.add_argument(
|
62 |
+
"--model_name", default="roformer-chinese-base", type=str, help="Pretrained roformer name or path."
|
63 |
+
)
|
64 |
+
parser.add_argument(
|
65 |
+
"--text", default="今天[MASK]很好,我想去公园玩!", type=str, help="MLM text."
|
66 |
+
)
|
67 |
+
args = parser.parse_args()
|
68 |
+
test_mlm(text=args.text, model_name=args.model_name)
|
69 |
+
|
70 |
+
```
|
71 |
+
```bash
|
72 |
+
python test_mlm.py --model_name roformer-chinese-base --text 今天[MASK]很好,我想去公园玩!
|
73 |
+
# paddle: 今天[天气||天||阳光||太阳||空气]很好,我想去公园玩!
|
74 |
+
python test_mlm.py --model_name roformer-chinese-base --text 北京是[MASK]的首都!
|
75 |
+
# paddle: 北京是[中国||谁||中华人民共和国||我们||中华民族]的首都!
|
76 |
+
python test_mlm.py --model_name roformer-chinese-char-base --text 今天[MASK]很好,我想去公园玩!
|
77 |
+
# paddle: 今天[天||气||都||风||人]很好,我想去公园玩!
|
78 |
+
python test_mlm.py --model_name roformer-chinese-char-base --text 北京是[MASK]的首都!
|
79 |
+
# paddle: 北京是[谁||我||你||他||国]的首都!
|
80 |
+
```
|