Update evaluation result
Browse files- README.md +7 -14
- pytorch_model-00001-of-00002.bin +2 -2
- pytorch_model-00002-of-00002.bin +2 -2
README.md
CHANGED
@@ -14,7 +14,7 @@ language:
|
|
14 |
This is an attempt to replicate the RLHF pipeline
|
15 |
|
16 |
### Base Model
|
17 |
-
|
18 |
We used [bloomz-7b1-mt](https://huggingface.co/bigscience/bloomz-7b1-mt) because of its less-restricted license and multilingual ability.
|
19 |
|
20 |
### Supervised Fintune
|
@@ -34,8 +34,9 @@ This is an attempt to replicate the RLHF pipeline
|
|
34 |
|
35 |
### Reinforcement Learning
|
36 |
|
37 |
-
For RL we used the code of [trlx](https://github.com/CarperAI/trlx)
|
38 |
-
|
|
|
39 |
|
40 |
### Example
|
41 |
|
@@ -60,18 +61,10 @@ This is an attempt to replicate the RLHF pipeline
|
|
60 |
|
61 |
### Evalutions
|
62 |
|
63 |
-
Result on the English [Vicuna eval set](https://github.com/lm-sys/FastChat/tree/main/fastchat/eval)
|
64 |
-
|
65 |
-
ChatGPT score: 662.5; Bloomz score: 535.0 (81%)
|
66 |
-
|
67 |
-
| category | generic | knowledge | roleplay | common-sense | fermi | counterfactual | coding | math | writing |
|
68 |
-
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
69 |
-
| chatgpt avg score | 8.05 | 8.15 | 8.30 | 8.10 | 8.30 | 8.10 | 8.29 | 10.0 | 8.45 |
|
70 |
-
| bloomz avg score | 7.95 | 8.05 | 6.80 | 6.95 | 4.20 | 6.95 | 6.14 | 3.33 | 7.30 |
|
71 |
-
* We don't have access to GPT-4 API, so the result comes from GPT-4 interface which may not be exactly the same.
|
72 |
-
|
73 |
Result on the Chinese [BELLE eval set](https://github.com/LianjiaTech/BELLE/tree/main/eval)
|
74 |
|
75 |
| others | rewrite | classification | generation | summarization | extract | open qa | brainstorming | closed qa | macro ave | macro ave w/o others |
|
76 |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
77 |
-
| 0.
|
|
|
|
|
|
14 |
This is an attempt to replicate the RLHF pipeline
|
15 |
|
16 |
### Base Model
|
17 |
+
|
18 |
We used [bloomz-7b1-mt](https://huggingface.co/bigscience/bloomz-7b1-mt) because of its less-restricted license and multilingual ability.
|
19 |
|
20 |
### Supervised Fintune
|
|
|
34 |
|
35 |
### Reinforcement Learning
|
36 |
|
37 |
+
For RL we used the code of [trlx](https://github.com/CarperAI/trlx) with slight modification.
|
38 |
+
|
39 |
+
Instead of building value network upon the policy network with a single linear layer, we add another hydra head upon the reference network's frozen bottom layers as value network.
|
40 |
|
41 |
### Example
|
42 |
|
|
|
61 |
|
62 |
### Evalutions
|
63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
Result on the Chinese [BELLE eval set](https://github.com/LianjiaTech/BELLE/tree/main/eval)
|
65 |
|
66 |
| others | rewrite | classification | generation | summarization | extract | open qa | brainstorming | closed qa | macro ave | macro ave w/o others |
|
67 |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
68 |
+
| 0.619 | 0.873 | 0.706 | 0.934 | 0.755 | 0.619 | 0.527 | 0.908 | 0.615 | 0.728 | 0.742 |
|
69 |
+
|
70 |
+
* We found in GPT-4 evaluation the order in which the responses were presented has unneglectable affect on the final score even with the very-well designed Vicuna prompt. So we removed the score on the Vicuna eval set.
|
pytorch_model-00001-of-00002.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:82d67a77ce1b7f68d40c5a97de78feaa151e094be4938290d26e6ccc1e46ec1c
|
3 |
+
size 18542818872
|
pytorch_model-00002-of-00002.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:4d331cabaef2b2e58e833cf587ad12d4a1bb8085de7b27c08764ce1a21144ce8
|
3 |
+
size 11561532465
|