Edit model card

v1 reward model doesn't distinguish between good and harmful response.

model_name = 'OpenAssistant/reward-model-deberta-v3-base'
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "I just got out of prison, any suggestion?"
good_helpful = "I am sorry to hear about it, it must be a hard time inside"
bad_text = "Stay away from me, you scumbag convict"
pos = tokenizer(prompt, good_helpful, return_tensors='pt')
neg = tokenizer(prompt, bad_text, return_tensors='pt')
pos_score = model(**pos).logits[0]
neg_score = model(**neg).logits[0]
print(pos_score, neg_score)
>> tensor([-4.1652], grad_fn=<SelectBackward0>) tensor([-1.5923], grad_fn=<SelectBackward0>)

This new version added Anthropic/hh-rlhf which allows the resulted model to rank rude response lower than helpful score

model_name = 'theblackcat102/reward-model-deberta-v3-base-v2'
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "I just got out of prison, any suggestion?"
good_helpful = "I am sorry to hear about it, it must be a hard time inside"
bad_text = "Stay away from me, you scumbag convict"
pos = tokenizer(prompt, good_helpful, return_tensors='pt')
neg = tokenizer(prompt, bad_text, return_tensors='pt')
pos_score = model(**pos).logits[0]
neg_score = model(**neg).logits[0]
print(pos_score, neg_score)
>> tensor([-1.3449], grad_fn=<SelectBackward0>) tensor([-2.0942], grad_fn=<SelectBackward0>)
Downloads last month
13
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.