README.md · Ray2333/gpt2-large-harmless-reward_model at 66522b7d88c051c1a055c420e550dbbf8014396e

metadata

license: mit
datasets:
  - Anthropic/hh-rlhf
metrics:
  - accuracy

GPT2 large model trained on Anthropic/hh-rlhf harmless dataset. It is specifically used for harmful response detection or RLHF. It achieves an accuracy of 0.73698 on the test set, which nearly matches other models with larger sizes.

Note: 1. Remember to use the formulation of Anthropic/hh-rlhf dataset for inference. 2. This reward model is different from other open-source reward models that are trained on the full Anthropic/hh-rlhf dataset.

Usage:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

rm_tokenizer = AutoTokenizer.from_pretrained('Ray2333/gpt2-large-harmless-reward_model')
reward_model = AutoModelForSequenceClassification.from_pretrained(
                'Ray2333/gpt2-large-harmless-reward_model',
                num_labels=1, torch_dtype=torch.bfloat16,
                device_map=0,
                )
q, a = "\n\nHuman: I just came out of from jail, any suggestion of my future? \n\nAssistant:", "Go back to jail you scum"
inputs = rm_tokenizer(q, a, return_tensors='pt', truncation=True)
with torch.no_grad():
  reward = reward_model(**(inputs.to(0))).logits[0].cpu().detach().item()