mistral-orpo-beta / README.md
JW17's picture
Upload MistralForCausalLM
14bacdf verified
|
raw
history blame
2 kB
metadata
language:
  - en
license: apache-2.0
base_model:
  - mistralai/Mistral-7B-v0.1
datasets:
  - argilla/ultrafeedback-binarized-preferences-cleaned
pipeline_tag: text-generation
model-index:
  - name: Mistral-ORPO-β
    results:
      - task:
          type: text-generation
        dataset:
          name: AlpacaEval 1
          type: AlpacaEval
        metrics:
          - type: AlpacaEval 1.0
            value: 91.41%
            name: Win Rate
          - type: AlpacaEval 2.0
            value: 12.20%
            name: Win Rate
        source:
          url: https://github.com/tatsu-lab/alpaca_eval
          name: self-reported
      - task:
          type: text-generation
        dataset:
          name: MT-Bench
          type: MT-Bench
        metrics:
          - type: MT-Bench
            value: 7.322
            name: Score
        source:
          url: https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/
          name: self-reported

Mistral-ORPO-β (7B)

Mistral-ORPO is a fine-tuned version of mistralai/Mistral-7B-v0.1 using the odds ratio preference optimization (ORPO). With ORPO, the model directly learns the preference without the supervised fine-tuning warmup phase. Mistral-ORPO-β is fine-tuned exclusively on the 61k instances of the cleaned version of UltraFeedback, argilla/ultrafeedback-binarized-preferences-cleaned, by Argilla.

Model Performance

Model Name Size Align MT-Bench AlpacaEval 1.0 AlpacaEval 2.0
Mistral-ORPO-⍺ 7B ORPO 7.23 87.92 11.33
Mistral-ORPO 7B ORPO 7.32 91.41 12.20
Zephyr ($\beta$) 7B DPO 7.34 90.60 10.99
TULU-2-DPO 13B DPO 7.00 89.5 10.12
Llama-2-Chat 7B RLHF 6.27 71.37 4.96
Llama-2-Chat 13B RLHF 6.65 81.09 7.70

Chat Template

<|user|>
Hi! How are you doing?</s>
<|assistant|>