File size: 6,128 Bytes
ba3caf5
 
 
 
b0d34ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ba3caf5
 
 
 
 
 
 
 
 
 
 
2e3fef2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5f47b4f
 
 
b0d34ba
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
license: apache-2.0
datasets:
- berkeley-nest/Nectar
base_model: openchat/openchat-3.5-0106
model-index:
- name: openchat-nectar-0.5
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: AI2 Reasoning Challenge (25-Shot)
      type: ai2_arc
      config: ARC-Challenge
      split: test
      args:
        num_few_shot: 25
    metrics:
    - type: acc_norm
      value: 66.72
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=andysalerno/openchat-nectar-0.5
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: HellaSwag (10-Shot)
      type: hellaswag
      split: validation
      args:
        num_few_shot: 10
    metrics:
    - type: acc_norm
      value: 83.53
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=andysalerno/openchat-nectar-0.5
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU (5-Shot)
      type: cais/mmlu
      config: all
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 65.36
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=andysalerno/openchat-nectar-0.5
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: TruthfulQA (0-shot)
      type: truthful_qa
      config: multiple_choice
      split: validation
      args:
        num_few_shot: 0
    metrics:
    - type: mc2
      value: 52.15
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=andysalerno/openchat-nectar-0.5
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: Winogrande (5-shot)
      type: winogrande
      config: winogrande_xl
      split: validation
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 82.08
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=andysalerno/openchat-nectar-0.5
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GSM8k (5-shot)
      type: gsm8k
      config: main
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 68.16
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=andysalerno/openchat-nectar-0.5
      name: Open LLM Leaderboard
---
This is openchat/openchat-3.5-0106, tuned with DPO on a subset Nectar. This time with 5000 steps, a full epoch.

Careful attention was paid to make sure the chat template was followed properly.

Data selection and filtering:
- filtered dataset to only include examples with multiple turns, to preserve strength in multi-turn scenarios
- used the 4th ranking response as the "rejected" instead of the 3rd. When I inspected the dataset, I frequently could not find any meaningful difference in quality between the 1st and 3rd ranked responses, so to make the accepted/rejected signal extra clear, I replaced 3rd ranking with 4th ranking.
- I filtered out any examples with "good_natured == False". Why? When I inspected examples with "good_natured == False" in the Nectar dataset, I noticed they frequently include refusals from even the top ranking model. So, counter-intuitively, including "bad natured" entries might actually censor the model *more*, since the top responses (as ranked by GPT-4) to these queries tend to be refusals. Not to mention, the quality of the conversations that are "bad natured" tends to be worse in general, in my own opinion.

Differences from 0.4:
- Trained on 5000 steps instead of 500, with a lower learning rate and slower warmup period.

Summary of versions:

**[openchat-nectar-0.1](https://huggingface.co/andysalerno/openchat-nectar-0.1)**
- 200 steps, no filtering on Nectar dataset, 5e-5 learning rate
  
**[openchat-nectar-0.2](https://huggingface.co/andysalerno/openchat-nectar-0.2)**
- empty repo, failed training. ignore it
  
**[openchat-nectar-0.3](https://huggingface.co/andysalerno/openchat-nectar-0.3)**
- 500 steps, no filtering on Nectar dataset, 5e-5 learning rate (same as 1 but with more steps)
  
**[openchat-nectar-0.4](https://huggingface.co/andysalerno/openchat-nectar-0.4)**
- 500 steps, filtered dataset to only include multi-chat-turn examples, used 4th ranking response as the "rejected" instead of 3rd, filtered out "good_natured=False", 5e-5 learning rate

**[openchat-nectar-0.5](https://huggingface.co/andysalerno/openchat-nectar-0.5)**
- 5000 steps (over a full epoch), filtered dataset to only include multi-chat-turn examples, used 4th ranking response as the "rejected" instead of 3rd, filtered out "good_natured=False", 5e-6 learning rate. Same as 0.4 but with 10x the steps, and 1/10th the learning rate

**[openchat-nectar-0.6](https://huggingface.co/andysalerno/openchat-nectar-0.6)**
- 500 steps, filtered dataset to only include multi-chat-turn examples, used 4th ranking response as the "rejected" instead of 3rd, filtered out "good_natured=False", 5e-5 learning rate. Same as 0.5 but with 1/10th the steps, and 10x the learning rate
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_andysalerno__openchat-nectar-0.5)

|             Metric              |Value|
|---------------------------------|----:|
|Avg.                             |69.67|
|AI2 Reasoning Challenge (25-Shot)|66.72|
|HellaSwag (10-Shot)              |83.53|
|MMLU (5-Shot)                    |65.36|
|TruthfulQA (0-shot)              |52.15|
|Winogrande (5-shot)              |82.08|
|GSM8k (5-shot)                   |68.16|