Reward model on deberta-v2-xxlarge (1.5B)

Reward model used in RLHF which is trained on webgpt, summarize from human feedback and Open Assistant user ranked dataset

Model Details

Model Description

  • Repository: Open Assistant
  • Paper : Instruct GPT : We try to replicate as close as we can on our hardware and existing datasets
This model was trained with human feedback comparison examples, which penalize bad or rude sentence with lower scores.

Direct Use

model_name = 'theblackcat102/deberta-v2-xxlarge-rm'
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "I just got out of prison, any suggestion?"
good_helpful = "I am sorry to hear about it, it must be a hard time inside"
bad_text = "Stay away from me, you scumbag convict"
pos = tokenizer(prompt, good_helpful, return_tensors='pt')
neg = tokenizer(prompt, bad_text, return_tensors='pt')
pos_score = model(**pos).logits[0]
neg_score = model(**neg).logits[0]
print(pos_score, neg_score)
>> tensor([-1.3449], grad_fn=<SelectBackward0>) tensor([-2.0942], grad_fn=<SelectBackward0>)

How to use it as a rank function

def divide_chunks(l, n):    
    # looping till length l
    for i in range(0, len(l), n):
        yield l[i:i + n]
def rank_model_fn(samples, **kwargs):
    output_scores = []
    for chunk_samples in divide_chunks(samples, 16):
        is_empty = []
        prefixes, postfixes = [], []
        for sample in chunk_samples:
            prefix, postfix = sample.split('[SEP]')
            postfix = postfix.strip()
            if len(postfix) == 0 or len(set(postfix)) <= 3:
        is_empty = np.array(is_empty)
        inputs = rank_tokenizer(prefixes, postfixes, return_tensors="pt", padding=True)
        inputs.pop("token_type_ids", None)
        inputs =  { key: tensor.cuda() for key, tensor in inputs.items() }
        scores = rank_model(**inputs).logits[:, 0].detach().cpu()
        scores[is_empty] = -4
        output_scores += [ s for s in scores ]
    return torch.from_numpy(np.array(output_scores))

How to Get Started with the Model

Use the code below to get started with the model.

Training Details

Training Procedure

checkout our training repo here

Training Hyperparameters

model_name: microsoft/deberta-v2-xxlarge
learning_rate: 2e-6
scheduler: cosine
gradient_checkpointing: false
gradient_accumulation_steps: 12
per_device_train_batch_size: 1
per_device_eval_batch_size: 4
warmup_steps: 600
eval_steps: 1000000
save_steps: 1000
max_length: 512
num_train_epochs: 2
  - webgpt
  - hfsummary
  - anthropic_rlhf
  - oa_private

Trained on 8 A100 80G model, since we are using the same batch strategy as InstructGPT, using a batch_size of 1 actually equals to (N-1) batch where N refers to number of negative examples. Which is why I recommend using the largest VRAM GPU you can find to train this model.


Testing Data

Model Architecture and Objective

Datasets used to train theblackcat102/deberta-v2-xxlarge-rm