Update README.md
Browse files
README.md
CHANGED
@@ -18,7 +18,7 @@ datasets:
|
|
18 |
|
19 |
|
20 |
## Description:
|
21 |
-
Llama2-13B-RLHF-RM is a 13 billion parameter language model (with context of up to 4,096 tokens) used as the Reward Model in training [NV-Llama2-70B-RLHF](https://
|
22 |
|
23 |
Starting from [Llama2-13B base model](https://huggingface.co/meta-llama/Llama-2-13b), it is first instruction-tuned with a combination of public and proprietary data and then trained on [HH-RLHF dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf) with reward modeling objective. Given a conversation with multiple turns between user and assistant, it assigns a preference score on the last assistant turn.
|
24 |
|
|
|
18 |
|
19 |
|
20 |
## Description:
|
21 |
+
Llama2-13B-RLHF-RM is a 13 billion parameter language model (with context of up to 4,096 tokens) used as the Reward Model in training [NV-Llama2-70B-RLHF](https://huggingface.co/nvidia/NV-Llama2-70B-RLHF-Chat), which achieves 7.59 on MT-Bench and demonstrates strong performance on academic benchmarks.
|
22 |
|
23 |
Starting from [Llama2-13B base model](https://huggingface.co/meta-llama/Llama-2-13b), it is first instruction-tuned with a combination of public and proprietary data and then trained on [HH-RLHF dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf) with reward modeling objective. Given a conversation with multiple turns between user and assistant, it assigns a preference score on the last assistant turn.
|
24 |
|