GitBag commited on
Commit
508ba01
1 Parent(s): 1890755

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -16
README.md CHANGED
@@ -7,10 +7,10 @@ language:
7
  ---
8
  This is a model released for our paper: [REBEL: Reinforcement Learning via Regressing Relative Rewards](https://arxiv.org/abs/2404.16767).
9
 
10
- # REBEL-Llama-3
11
 
12
  This model is developed with REBEL based on [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) with [FsfairX-LLaMA3-RM-v0.1](https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1) as the reward model and [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset.
13
- The training code is available at https://github.com/ZhaolinGao/REBEL. This is the checkpoint that achieves the highest AlpacaEval 2.0 scores.
14
 
15
  ### Links to Other Model
16
 
@@ -18,13 +18,22 @@ The training code is available at https://github.com/ZhaolinGao/REBEL. This is t
18
 
19
  [REBEL-Llama-3](https://huggingface.co/Cornell-AGI/REBEL-Llama-3)
20
 
21
- ### AlpacaEval 2.0 Evaluations
22
 
23
- | Model | AlpacaEval 2.0<br>LC Win Rate | AlpacaEval 2.0<br>Win Rate |
24
- | :--------: | :--------: | :--------: |
25
- | REBEL-OpenChat-3.5| 17.3 | 12.8 |
26
- | REBEL-Llama-3 | 30.1 | 32.6 |
27
- | REBEL-Llama-3-epoch_2| 31.33 | 34.22 |
 
 
 
 
 
 
 
 
 
28
 
29
  ## Citation
30
  Please cite our paper if you use this model in your own work:
@@ -37,11 +46,4 @@ Please cite our paper if you use this model in your own work:
37
  archivePrefix={arXiv},
38
  primaryClass={cs.LG}
39
  }
40
- ```
41
-
42
-
43
-
44
-
45
-
46
-
47
-
 
7
  ---
8
  This is a model released for our paper: [REBEL: Reinforcement Learning via Regressing Relative Rewards](https://arxiv.org/abs/2404.16767).
9
 
10
+ # REBEL-Llama-3-epoch_2
11
 
12
  This model is developed with REBEL based on [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) with [FsfairX-LLaMA3-RM-v0.1](https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1) as the reward model and [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset.
13
+ The training code is available at https://github.com/ZhaolinGao/REBEL. We collect online generations during each iteration with a batch size of 32.
14
 
15
  ### Links to Other Model
16
 
 
18
 
19
  [REBEL-Llama-3](https://huggingface.co/Cornell-AGI/REBEL-Llama-3)
20
 
21
+ [REBEL-Llama-3-Armo-iter_1](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_1)
22
 
23
+ [REBEL-Llama-3-Armo-iter_2](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_2)
24
+
25
+ [REBEL-Llama-3-Armo-iter_3](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_3)
26
+
27
+ ### Evaluations
28
+
29
+ | Model | AlpacaEval 2.0<br>LC Win Rate | AlpacaEval 2.0<br>Win Rate | MT-Bench<br>Average | MMLU<br>(5-shot) | GSM8K<br>(5-shot) |
30
+ | :--------: | :--------: | :--------: | :--------: | :--------: | :--------: |
31
+ | REBEL-OpenChat-3.5| 17.3 | 12.8 | 8.06 | 63.7 | 68.8 |
32
+ | REBEL-Llama-3 | 30.1 | 32.6 | 8.16 | 65.8 | 75.6 |
33
+ | REBEL-Llama-3-epoch_2| 31.3 | 34.2 | 7.83 | 65.4 | 75.4 |
34
+ | REBEL-Llama-3-Armo-iter_1| 48.3 | 41.8 | 8.13 | 66.3 | 75.8 |
35
+ | REBEL-Llama-3-Armo-iter_2| 50.0 | 48.5 | 8.07 | 65.9 | 75.4 |
36
+ | REBEL-Llama-3-Armo-iter_3| 49.7 | 48.1 | 8.01 | 66.0 | 75.7 |
37
 
38
  ## Citation
39
  Please cite our paper if you use this model in your own work:
 
46
  archivePrefix={arXiv},
47
  primaryClass={cs.LG}
48
  }
49
+ ```