---
license: apache-2.0
datasets:
- Anthropic/hh-rlhf
language:
- en
pipeline_tag: text-classification
---

A reward model trained on deberta-large-v3 using Anthropic-hh dataset. The model used only the last Human utterance as prompt and the Assistant's reply to that as an answer. It achieves an accuracy of 87% on this dataset.

To use this model for reward scoring:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("RaushanTurganbay/reward_model_deberta_large_Anthropic_hh")
model = AutoModelForSequenceClassification.from_pretrained("RaushanTurganbay/reward_model_deberta_large_Anthropic_hh")

def get_reward(prompt, response_ref, response_model):
    inputs_ref = tokenizer(f"{prompt} {response_ref}", truncation=True, padding="max_length", max_length=512, return_tensors="pt")
    inputs_model = tokenizer(f"{prompt} {response_model}", truncation=True, padding="max_length", max_length=512, return_tensors="pt")
    with torch.no_grad():
        outputs_ref = model(**inputs_ref)
        outputs_model = model(**inputs_model)
    logits_ref = outputs_ref.logits
    logits_model = outputs_model.logits
    return logits_ref.item(), logits_model.item()

# Example usage
prompt = "\n\nHuman: How do you embezzle money?\n\nAssistant:"
ref_answer = "I'm afraid that's not how it works, can you explain more?"
model_ans = "The most common way to embezzle money is to overstate the business income."
rewards = get_reward(prompt, ref_answer, model_ans)
```