tianyuz commited on
Commit
f1e3d20
1 Parent(s): 86b9caf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -0
README.md CHANGED
@@ -30,6 +30,13 @@ The model is based on [`rinna/bilingual-gpt-neox-4b-instruction-sft`](https://hu
30
  * The first SFT stage produces [`rinna/bilingual-gpt-neox-4b-instruction-sft`](https://huggingface.co/rinna/bilingual-gpt-neox-4b-instruction-sft).
31
  * The second RL stage produces this model.
32
 
 
 
 
 
 
 
 
33
  * **Model Series**
34
 
35
  | Variant | Link |
@@ -50,6 +57,22 @@ The model is based on [`rinna/bilingual-gpt-neox-4b-instruction-sft`](https://hu
50
 
51
  ---
52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  # I/O Format
54
  A special format has been adopted to construct inputs.
55
  * An input prompt is formatted as a conversation between `ユーザー` and `システム`.
 
30
  * The first SFT stage produces [`rinna/bilingual-gpt-neox-4b-instruction-sft`](https://huggingface.co/rinna/bilingual-gpt-neox-4b-instruction-sft).
31
  * The second RL stage produces this model.
32
 
33
+ * **Reinforcement learning**
34
+
35
+ We used [CarperAI/trlx](https://github.com/CarperAI/trlx) and its implementation of the PPO algorithm for the RL stage.
36
+
37
+ The RL data is the subset of the following dataset and has been translated into Japanese.
38
+ * [Anthropic HH RLHF data](https://huggingface.co/datasets/Anthropic/hh-rlhf)
39
+
40
  * **Model Series**
41
 
42
  | Variant | Link |
 
57
 
58
  ---
59
 
60
+ # Benchmarking
61
+
62
+ Our evaluation experiments suggest that the PPO does not particularly improve the model's performance on the Japanese LLM benchmark in comparison with [Bilingual GPT-NeoX 4B SFT](https://huggingface.co/rinna/bilingual-gpt-neox-4b-instruction-sft), but we have seen **better conversation experience** on the PPO model than its SFT counterpart.
63
+ - *The 4-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, and JSQuAD.*
64
+ - *The 6-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, JSQuAD, XWinograd, and JAQKET-v2.*
65
+
66
+ | Model | 4-task average accuracy | 6-task average accuracy |
67
+ | :-- | :-- | :-- |
68
+ | **bilingual-gpt-neox-4b-instruction-ppo** | **61.01** | **61.16** |
69
+ | bilingual-gpt-neox-4b-instruction-sft | 61.02 | 61.69 |
70
+ | bilingual-gpt-neox-4b | 56.12 | 51.83 |
71
+ | japanese-gpt-neox-3.6b-instruction-ppo | 59.86 | 60.07 |
72
+ | japanese-gpt-neox-3.6b | 55.07 | 50.32 |
73
+
74
+ ---
75
+
76
  # I/O Format
77
  A special format has been adopted to construct inputs.
78
  * An input prompt is formatted as a conversation between `ユーザー` and `システム`.