Jellywibble commited on
Commit
4617606
1 Parent(s): 71e3645

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -1
README.md CHANGED
@@ -25,7 +25,7 @@ Its training dataset consists of purely user-generated content [retry_and_contin
25
  ### Intended use
26
  This reward model was developed primarily for commercial purposes. It learns an inner representation of response quality rated by humans that can be used to conduct best-of-N sampling and Reinforcement Leanring with the PPO framework.
27
 
28
- In addition to scientific uses, you may also further fine-tune and adapt this reward model for deployment, as long as your use is in accordance with the cc-by-nc-4.0 license, i.e. non-commercial use. This model works with the Transformers Library. If you decide to this pre-trained reward model as a basis for your fine-tuned model, please note that you need to conduct your own risk and bias assessment.
29
 
30
  ### Out-of-scope use
31
 
@@ -47,3 +47,24 @@ tokenizer.padding_side = ‘right’
47
  tokens = self.eval_tokenizer(candidates, return_tensors='pt', return_attention_mask=True, padding='longest', truncation=True, max_length=256)
48
  reward = model(**tokens).logits
49
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  ### Intended use
26
  This reward model was developed primarily for commercial purposes. It learns an inner representation of response quality rated by humans that can be used to conduct best-of-N sampling and Reinforcement Leanring with the PPO framework.
27
 
28
+ In addition to scientific uses, you may also further fine-tune and adapt this reward model for deployment, as long as your use is in accordance with the Creative Commons Attribution Non Commercial 4.0 (cc-by-nc-4.0) license, i.e. non-commercial use. This model works with the Transformers Library. If you decide to this pre-trained reward model as a basis for your fine-tuned model, please note that you need to conduct your own risk and bias assessment.
29
 
30
  ### Out-of-scope use
31
 
 
47
  tokens = self.eval_tokenizer(candidates, return_tensors='pt', return_attention_mask=True, padding='longest', truncation=True, max_length=256)
48
  reward = model(**tokens).logits
49
  ```
50
+
51
+ ## Model training
52
+ ### Training dataset
53
+ This model was trained by randomly sampling 5 million rows out of the [ChaiML/retry_and_continue_50m_reward_model](https://huggingface.co/datasets/ChaiML/retry_and_continue_50m_reward_model) dataset.
54
+ The original dataset contains over 50 million rows of completions (chatbot responses), along with number of remaining user messages within their corresponding conversations and whether the user pressed the "retry" button (where the completion is rejected and resampled). The model which was used to generate these completions is a in-house variant of [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B), with the following sampling parameters:
55
+
56
+ <figure style="width:30em">
57
+
58
+ | Parameters | Value |
59
+ | ---------------------- | ----------- |
60
+ | temperature | 0.72 |
61
+ | repetition_penalty | 1.13125 |
62
+ | max_new_tokens | 64 |
63
+ | top_p | 0.725 |
64
+ | top_k | 0 |
65
+ | eos_token_id | 198 |
66
+ | do_sample | True |
67
+ </figure>
68
+
69
+ ### Training procedure
70
+ The `gpt2_base_retry_and_continue_5m_reward_model` was trained using a [gpt2](https://huggingface.co/gpt2) base model and a classification head with single output. Binary Cross Entropy loss was used. The model was trained on 4xA40 GPUs, 16 per device batch size and gradient accumulation of 1 (therefore the effective batch size is 64), with 1e-5 learning rate for 2 epochs for a total of 156,240 steps. Tensor parallelism and pipeline parallelism were used to distribute the model across GPUs.