Introduction:

We trained two Chinese reward models based on the base model Baichuan-13B-Base, following Llama 2 style. rw_helpful_13b_wpack model was trained to give scores based on the helpfulness of the responses; rw_safe_13b_wpack was trained to give scores based on the safety of the responses.

Loading and usage:

Under GPU runtime, run scoring.py in src folder. The prompt list and the answer list can be modified or imported to fit your purpose of use. An example would be:

python scoring.py \
    --model_name_or_path PATH_TO_THE_REWARD_MODEL

in which PATH_TO_THE_REWARD_MODEL can either be the path to rw_helpful_13b_wpack_exported or rw_safe_13b_wpack_exported.

Multiple GPUs are expected, 8$\times$A100 for example.

Testing results

We tested the performance of our safety reward model and helpfulness reward model on their separate test dataset.

# rw_safe_13b_wpack evaluation results
{

    "eval_accuracy": 0.8876339025592757,

    "eval_loss": 0.197021484375,

    "eval_runtime": 1131.4606,

    "eval_samples_per_second": 19.719,

    "eval_steps_per_second": 2.465

}

# rw_helpful_13b_wpack evaluation results
{

    "eval_accuracy": 0.6387571848594814,

    "eval_loss": 0.63916015625,

    "eval_runtime": 2188.8722,

    "eval_samples_per_second": 17.248,

    "eval_steps_per_second": 2.156

}

Citation

pass

奖励模型

简介：

我们仿照LLama 2 奖励模型的训练机制以Baichuan-13B-Base模型为基座训练了两个中文的奖励模型。其中 rw_helpful_13b_wpack model 模型用于根据回答是否对问题有帮助对回答进行打分；rw_safe_13b_wpack 模型用于根据回答的安全性对回答进行打分。

用法：

在GPU 运行时下，运行 src 文件夹下的 scoring.py 脚本。该脚本中的prompt_list 变量和good_ans_list 变量可根据需要修改或导入。模型将对good_ans_list 中的每一个answer 结合对应的prompt 根据其适用性和安全性进行打分。一个例子：

python scoring.py \
    --model_name_or_path PATH_TO_THE_REWARD_MODEL

其中 PATH_TO_THE_REWARD_MODEL 可以是rw_helpful_13b_wpack_exported 或 rw_safe_13b_wpack_exported 两者中一个的模型路径。一般需要多块显卡，例如8张A100 GPU

测试结果

我们分别测试了训练好的适用性奖励模型和安全性奖励模型在各自测试数据集下的表现如下：

# rw_safe_13b_wpack的测试结果
{

    "eval_accuracy": 0.8876339025592757,

    "eval_loss": 0.197021484375,

    "eval_runtime": 1131.4606,

    "eval_samples_per_second": 19.719,

    "eval_steps_per_second": 2.465

}

# rw_helpful_13b_wpack 的测试结果
{

    "eval_accuracy": 0.6387571848594814,

    "eval_loss": 0.63916015625,

    "eval_runtime": 2188.8722,

    "eval_samples_per_second": 17.248,

    "eval_steps_per_second": 2.156

}

引用


@Misc{Baichuan-13B-Base,
  title = {Baichuan-13B-Base},
  author = {hiyouga},
  howpublished = {\url{https://huggingface.co/baichuan-inc/}},
  year = {2023}
}

@Misc{llama-efficient-tuning,
  title = {LLaMA Efficient Tuning},
  author = {hiyouga},
  howpublished = {\url{https://github.com/hiyouga/LLaMA-Efficient-Tuning}},
  year = {2023}
}

DirectLLM
/

Reward-Baichuan13b

You need to agree to share your contact information to access this model