File size: 4,029 Bytes
e585ddb 9742e83 612dc8f 340e8d2 9742e83 f5e02b8 7dbbb52 aa409da eb0adc3 9742e83 3053553 9742e83 612dc8f 9742e83 3053553 9742e83 04beed3 3053553 9742e83 2e78e24 9742e83 3053553 9742e83 3053553 9742e83 3053553 9742e83 87cf99e e585ddb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
---
license: cc-by-sa-4.0
---
This model is the RLHF version of `HuggingFaceH4/mistral-7b-sft-beta` without any external responses.
We perform GSHF algorithm on SFT baseline. The external signals include (1) Reward model; (2) AI-generated Prompts.
**We obtain 35.95% win-rate (34.79% LC win-rate) on Alpaca Eval v2.** The win-rate of the base model is only 4.63%.
For MT-bench, it obtained about 7.5, where the base model is only 5.3.
We have demonstrated the significant potential of the iterative RLHF algorithm for LLMs to deliver appropriate and well-structured responses,
even without any external responses.
## Model Details
We perform 3 iterations of GSHF algorithm on `HuggingFaceH4/mistral-7b-sft-beta` labeled by reward model, where prompts are generated by ChatGPT with self-instruct type prompt augmentation.
We use AI-generated 60K prompts in the training process.
Examples are as below,
```json
{"prompt": "Why is gold considered a good reserve asset for central banks?"}
{"prompt": "What are the top 5 yoga poses for stress relief?"}
{"prompt": "Craft a blog title about the health implications of eating avocados daily based on their caloric value."}
{"prompt": "Design a simple HTML chat interface that simulates a conversation between a user and a bot, displaying two messages from each."}
{"prompt": "List 10 names from different cultures that embody the meanings of peace, harmony, or compassion."}
```
## Uses
The usage and chat template format follow the SFT model `HuggingFaceH4/mistral-7b-sft-beta`.
```python
# Install transformers from source - only needed for versions <= v4.34
# pip install git+https://github.com/huggingface/transformers.git
# pip install accelerate
import torch
from transformers import pipeline
pipe = pipeline("text-generation", model="sfairXC/FsfairX-Zephyr-Chat-v0.1", torch_dtype=torch.bfloat16, device_map="auto")
# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
{"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate"},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
# <|system|>
# You are a friendly chatbot who always responds in the style of a pirate.</s>
# <|user|>
# How many helicopters can a human eat in one sitting?</s>
# <|assistant|>
# Ah, me hearty matey! But yer question be a puzzler! A human cannot eat a helicopter in one sitting, as helicopters are not edible. They be made of metal, plastic, and other materials, not food!
```
## Evaluation
The evaluation on Alpaca Eval v2 are provided as below,
| Model | Win Rate | LC Win Rate | Avg Length |
|-------------|----------|-------------|------------|
| Base | 4.63 | 8.01 | 916 |
| Iteration 1 | 13.26 | 20.81 | 1205 |
| Iteration 2 | 23.57 | 27.63 | 1623 |
| Iteration 3 | 35.95 | 34.79 | 2275 |
## Citation
If you found this helpful, please cite the following papers.
```bibtex
@article{dong2023raft,
title={Raft: Reward ranked finetuning for generative foundation model alignment},
author={Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong},
journal={arXiv preprint arXiv:2304.06767},
year={2023}
}
@misc{xiong2024iterative,
title={Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint},
author={Wei Xiong and Hanze Dong and Chenlu Ye and Ziqi Wang and Han Zhong and Heng Ji and Nan Jiang and Tong Zhang},
year={2024},
eprint={2312.11456},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
``` |