Text Generation
Transformers
PyTorch
llama
Eval Results
text-generation-inference
Inference Endpoints
Edit model card

Paper | Github | Dataset| Model

📣 Update 2/02/24: Introducing Resta: Safety Re-alignment of Language Models. Paper Github Dataset

As a part of our research efforts to make LLMs safer, we created Starling. It is obtained by fine-tuning Vicuna-7B on HarmfulQA, a ChatGPT-distilled dataset that we collected using the Chain of Utterances (CoU) prompt. More details are in our paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

Image

Experimental results on several safety benchmark datasets indicate that Starling is a safer model compared to the baseline model, Vicuna.

Image

Experimental Results

Compared to Vicuna, Avg. 5.2% reduction in Attack Success Rate (ASR) on DangerousQA and HarmfulQA using three different prompts.**

Compared to Vicuna, Avg. 3-7% improvement in HHH score measured on BBH-HHH benchmark.**

Image

TruthfulQA (MC2): 48.90 vs Vicuna's 47.00

MMLU (5-shot): 46.69 vs Vicuna's 47.18

BBH (3-shot): 33.47 vs Vicuna's 33.05

Jailbreak Prompt for harmfulness eval using Red Eval as reported in the paper

This jailbreak prompt (termed as Chain of Utterances (CoU) prompt in the paper) shows a 65% Attack Success Rate (ASR) on GPT-4 and 72% on ChatGPT.

Image

HarmfulQA Data Collection

We also release our HarmfulQA dataset with 1,960 harmful questions (converting 10 topics-10 subtopics) for red-teaming as well as conversations based on them used in model safety alignment, more details here. The following figure describes the data collection process.

Image

Note: This model is referred to as Starling (Blue) in the paper. We shall soon release Starling (Blue-Red) which was trained on harmful data using an objective function that helps the model learn from the red (harmful) response data.

Citation

@misc{bhardwaj2023redteaming,
      title={Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment}, 
      author={Rishabh Bhardwaj and Soujanya Poria},
      year={2023},
      eprint={2308.09662},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 50.73
AI2 Reasoning Challenge (25-Shot) 51.02
HellaSwag (10-Shot) 76.77
MMLU (5-Shot) 47.75
TruthfulQA (0-shot) 48.18
Winogrande (5-shot) 70.56
GSM8k (5-shot) 10.08
Downloads last month
394
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train declare-lab/starling-7B

Space using declare-lab/starling-7B 1

Evaluation results