Update evaluation result

Browse files

Files changed (3) hide show

README.md +7 -14
pytorch_model-00001-of-00002.bin +2 -2
pytorch_model-00002-of-00002.bin +2 -2

README.md CHANGED Viewed

@@ -14,7 +14,7 @@ language:
 This is an attempt to replicate the RLHF pipeline
 ### Base Model
   We used [bloomz-7b1-mt](https://huggingface.co/bigscience/bloomz-7b1-mt) because of its less-restricted license and multilingual ability.
 ### Supervised Fintune
@@ -34,8 +34,9 @@ This is an attempt to replicate the RLHF pipeline
 ### Reinforcement Learning
-  For RL we used the code of [trlx](https://github.com/CarperAI/trlx) and prompts from
-  - [fnlp/moss-002-sft-data](https://huggingface.co/datasets/fnlp/moss-002-sft-data/tree/main)
 ### Example
@@ -60,18 +61,10 @@ This is an attempt to replicate the RLHF pipeline
 ### Evalutions
-Result on the English [Vicuna eval set](https://github.com/lm-sys/FastChat/tree/main/fastchat/eval)
-ChatGPT score: 662.5; Bloomz score: 535.0 (81%)
-| category | generic | knowledge | roleplay | common-sense | fermi | counterfactual | coding | math | writing |
-| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
-| chatgpt avg score | 8.05 | 8.15 | 8.30 | 8.10 | 8.30 | 8.10 | 8.29 | 10.0 | 8.45 |
-| bloomz avg score | 7.95 | 8.05 | 6.80 | 6.95 | 4.20 | 6.95 | 6.14 | 3.33 | 7.30 |
-* We don't have access to GPT-4 API, so the result comes from GPT-4 interface which may not be exactly the same.
 Result on the Chinese [BELLE eval set](https://github.com/LianjiaTech/BELLE/tree/main/eval)
 | others | rewrite | classification | generation | summarization | extract | open qa | brainstorming | closed qa | macro ave | macro ave w/o others |
 | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
-| 0.617 | 0.900 | 0.715 | 0.932 | 0.733 | 0.597 | 0.537 | 0.899 | 0.552 | 0.720 | 0.733 |

 This is an attempt to replicate the RLHF pipeline
 ### Base Model
   We used [bloomz-7b1-mt](https://huggingface.co/bigscience/bloomz-7b1-mt) because of its less-restricted license and multilingual ability.
 ### Supervised Fintune
 ### Reinforcement Learning
+  For RL we used the code of [trlx](https://github.com/CarperAI/trlx) with slight modification.
+  Instead of building value network upon the policy network with a single linear layer, we add another hydra head upon the reference network's frozen bottom layers as value network.
 ### Example
 ### Evalutions
 Result on the Chinese [BELLE eval set](https://github.com/LianjiaTech/BELLE/tree/main/eval)
 | others | rewrite | classification | generation | summarization | extract | open qa | brainstorming | closed qa | macro ave | macro ave w/o others |
 | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
+| 0.619 | 0.873 | 0.706 | 0.934 | 0.755 | 0.619 | 0.527 | 0.908 | 0.615 | 0.728 | 0.742 |
+* We found in GPT-4 evaluation the order in which the responses were presented has unneglectable affect on the final score even with the very-well designed Vicuna prompt. So we removed the score on the Vicuna eval set.

pytorch_model-00001-of-00002.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a9bce24cc1e1b2bc1d4d7a38399ce5666cb955ecc8fe1a106ea883de58c99f1c
-size 10848957472

 version https://git-lfs.github.com/spec/v1
+oid sha256:82d67a77ce1b7f68d40c5a97de78feaa151e094be4938290d26e6ccc1e46ec1c
+size 18542818872

pytorch_model-00002-of-00002.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5fb6f7ea9c92823246b6eaedc807a60539caf1a43010d6d07a6e4a47c01dbe34
-size 6284244953

 version https://git-lfs.github.com/spec/v1
+oid sha256:4d331cabaef2b2e58e833cf587ad12d4a1bb8085de7b27c08764ce1a21144ce8
+size 11561532465