--- license: bigscience-bloom-rail-1.0 datasets: - OpenAssistant/oasst1 - RyokoAI/ShareGPT52K - Dahoas/full-hh-rlhf - liswei/rm-static-m2m100-zh - fnlp/moss-002-sft-data language: - zh - en --- This is an attempt to replicate the RLHF pipeline ### Base Model We used [bloomz-7b1-mt](https://huggingface.co/bigscience/bloomz-7b1-mt) because of its less-restricted license and multilingual ability. ### Supervised Fintune For SFT we used a combination of multiple datasets including: - [RyokoAI/ShareGPT52K](https://huggingface.co/datasets/RyokoAI/ShareGPT52K) - [GPTeacher](https://github.com/teknium1/GPTeacher) - [Alpaca-GPT4](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) en & zh - Filtered subset of machine-translated ShareGPT dataset into Chinese ### Reward Model For RM we used the code of [reward-modeling](https://github.com/Dahoas/reward-modeling) repo and datasets from - [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) - [Dahoas/full-hh-rlhf](https://huggingface.co/datasets/Dahoas/full-hh-rlhf) - [liswei/rm-static-m2m100-zh](https://huggingface.co/datasets/liswei/rm-static-m2m100-zh) ### Reinforcement Learning For RL we used the code of [trlx](https://github.com/CarperAI/trlx) with slight modification. Instead of building value network upon the policy network with a single linear layer, we add another hydra head upon the reference network's frozen bottom layers as value network. ### Example We used Vicuna v1.1 template for model training ``` from transformers import AutoModelForCausalLM, AutoTokenizer checkpoint = "keyfan/bloomz-rlhf" tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForCausalLM.from_pretrained(checkpoint).cuda() template = ("A chat between a curious human and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the human's questions. " "USER: {}\nASSISTANT:") question = template.format("Who was the president of the United States in 1955?") inputs = tokenizer.encode(question, return_tensors="pt").cuda() outputs = model.generate(inputs, do_sample=True, top_p=0.8, max_new_tokens=512) print(tokenizer.decode(outputs[0])) ``` ### Evalutions Result on the Chinese [BELLE eval set](https://github.com/LianjiaTech/BELLE/tree/main/eval) | others | rewrite | classification | generation | summarization | extract | open qa | brainstorming | closed qa | macro ave | macro ave w/o others | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | 0.619 | 0.873 | 0.706 | 0.934 | 0.755 | 0.619 | 0.527 | 0.908 | 0.615 | 0.728 | 0.742 | * We found in GPT-4 evaluation the order in which the responses were presented has unneglectable affect on the final score even with the very-well designed Vicuna prompt. So we removed the score on the Vicuna eval set.