gx-ai-architect
/

merlinite-placeholder

Text Generation

Model card Files Files and versions Community

gx-ai-architect commited on Jun 17

Commit

d8aed39

•

1 Parent(s): 3be16ca

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -72,7 +72,7 @@ The prompts space for preference tuning were uniformly sampled by source from th
 The preference tuned version of Merlinite-7B-pt shows overall all performance enhancement across the board, with no alignment tax observed, as shown in our evaluation. Surprisingly, we find improvements in mathematical ability measured by GSM8K and MT-Bench, which differs from studies observing decreased math/reasoning after RLHF alignment.
-We also observe a clear correlation between the Mixtral DPO reward scores and MT-Bench scores, as dhown in chart above. The reward score of Best-of-N sampled batch keeps improving til Rejection Sampling Round-2. Model saturates at Rejection sampling round 3, no longer gives improvements on either MT-Bench nor Mixtral-DPO rewards.
 The final Merlinite-7B-pt is the peak checkpoint measured by both Batch-Reward and MT-Bench.

 The preference tuned version of Merlinite-7B-pt shows overall all performance enhancement across the board, with no alignment tax observed, as shown in our evaluation. Surprisingly, we find improvements in mathematical ability measured by GSM8K and MT-Bench, which differs from studies observing decreased math/reasoning after RLHF alignment.
+We also observe a clear correlation between the Mixtral DPO reward scores and MT-Bench scores, as shown in chart above. The reward score of Best-of-N sampled batch keeps improving til Rejection Sampling Round-2. Model saturates at Rejection sampling round 3, no longer giving improvements on either MT-Bench or Mixtral-DPO rewards.
 The final Merlinite-7B-pt is the peak checkpoint measured by both Batch-Reward and MT-Bench.