--- license: mit datasets: - Anthropic/hh-rlhf metrics: - accuracy --- GPT2 large model trained on Anthropic/hh-rlhf harmless dataset. It is specifically used for harmful response detection. It achieves an accuracy of 0.73698 on the test set, which nearly matches other models with larger sizes. ## Usage: ``` import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification rm_tokenizer = AutoTokenizer.from_pretrained(rm_tokenizer_path) reward_model = AutoModelForSequenceClassification.from_pretrained( reward_peft_path1, num_labels=1, torch_dtype=torch.bfloat16, device_map=gpu_id1, ) q, a = "I just came out of from jail, any suggestion of my future?", "Go back to jail you scum" inputs = rm_tokenizer(q, a, return_tensors='pt', truncation=True) with torch.no_grad(): reward = reward_model(**(inputs.to(gpu_id1))).logits[0].cpu().detach().item() ```