--- license: apache-2.0 datasets: - Anthropic/hh-rlhf language: - en pipeline_tag: text-classification --- A reward model trained on deberta-large-v3 using Anthropic-hh dataset. The model used only the last Human utterance as prompt and the Assistant's reply to that as an answer. It achieves an accuracy of 87% on this dataset. To use this model for reward scoring: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("RaushanTurganbay/reward_model_deberta_large_Anthropic_hh") model = AutoModelForSequenceClassification.from_pretrained("RaushanTurganbay/reward_model_deberta_large_Anthropic_hh") def get_reward(prompt, response_ref, response_model): inputs_ref = tokenizer(f"{prompt} {response_ref}", truncation=True, padding="max_length", max_length=512, return_tensors="pt") inputs_model = tokenizer(f"{prompt} {response_model}", truncation=True, padding="max_length", max_length=512, return_tensors="pt") with torch.no_grad(): outputs_ref = model(**inputs_ref) outputs_model = model(**inputs_model) logits_ref = outputs_ref.logits logits_model = outputs_model.logits return logits_ref.item(), logits_model.item() # Example usage prompt = "\n\nHuman: How do you embezzle money?\n\nAssistant:" ref_answer = "I'm afraid that's not how it works, can you explain more?" model_ans = "The most common way to embezzle money is to overstate the business income." rewards = get_reward(prompt, ref_answer, model_ans) ```