Edit model card

Description

Llama3-Instruct-8B model finetuned by hybrid WPO (GPT-4-turbo + on-policy sampling + Ultrafeedback). Details in WPO: Enhancing RLHF with Weighted Preference Optimization.

In comparison to the Llama3-Instruct-8B-WPO-HB model, it employs an enhanced preference data construction method:

  1. Uses the response with the minimum score as the rejected one.
  2. When multiple outputs have the same highest score, the one with the shortest length is selected.
  3. When multiple outputs have the same minimum score, the one with the smallest length difference from the chosen output is selected.

License

This model is licensed under the Zoom software license and is permitted for use only for noncommercial, educational, or academic research purposes.

Downloads last month
12
Safetensors
Model size
8.03B params
Tensor type
F32
·
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including wzhouad/Llama3-Instruct-8B-WPO-HB-v2