gx-ai-architect commited on
Commit
80a24c4
·
verified ·
1 Parent(s): f4ff397

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -18,7 +18,7 @@ base_model: mistralai/Mistral-7B-v0.1
18
  # Model Card for Merlinite-7B-pt 🔥
19
 
20
  ### Overview
21
- We introduce **Merlinite-7B-pt**, a strong open-source chat model, aligned using AI feedback **without proprietary models or using any human annotation**.
22
  - **Merlinite-7B-pt** is first supervised-finetuned (SFT) via [LAB](https://arxiv.org/abs/2403.01081) using Mistral-7B-v0.1 as base model, and then preference-tuned via AI feedback.
23
  - Our preference tuning recipe uses the DPO reward from Mixtral-8x7B-Instruct-v0.1 as the proxy for human preferences, and applies iterative rejection sampling to finetune the SFT policy.
24
  - We show that DPO log-ratios can serve as a reliable reward signal, showing clear correlation between reward improvements and MT-Bench improvements.
 
18
  # Model Card for Merlinite-7B-pt 🔥
19
 
20
  ### Overview
21
+ We introduce **Merlinite-7B-pt**, a strong open-source chat model, preference aligned using AI feedback **without proprietary models or using any human annotation**.
22
  - **Merlinite-7B-pt** is first supervised-finetuned (SFT) via [LAB](https://arxiv.org/abs/2403.01081) using Mistral-7B-v0.1 as base model, and then preference-tuned via AI feedback.
23
  - Our preference tuning recipe uses the DPO reward from Mixtral-8x7B-Instruct-v0.1 as the proxy for human preferences, and applies iterative rejection sampling to finetune the SFT policy.
24
  - We show that DPO log-ratios can serve as a reliable reward signal, showing clear correlation between reward improvements and MT-Bench improvements.