Update README.md
Browse files
README.md
CHANGED
@@ -19,13 +19,13 @@ datasets:
|
|
19 |
We have used our own [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca) to fine-tune Llama2-13B using [OpenChat](https://huggingface.co/openchat) packing and conditional behavior cloning.
|
20 |
This dataset is our attempt to reproduce the dataset generated for Microsoft Research's [Orca Paper](https://arxiv.org/abs/2306.02707).
|
21 |
|
22 |
-
This second preview release is trained on a curated filtered subset of most of our
|
23 |
|
24 |
This release highlights that our dataset and training methods have surpassed performance parity with the Orca paper.
|
25 |
-
We measured this with BigBench-Hard and AGIEval results with the same methods as used in the Orca paper, finding
|
26 |
-
As well, this is done with
|
27 |
|
28 |
-
We have run extensive evaluations internally and expect this model to place number 1 on both the HuggingFaceH4 Open LLM Leaderboard and the GPT4ALL Leaderboard for 13B models.
|
29 |
|
30 |
"One" of [OpenChat](https://huggingface.co/openchat) has joined our team, and we'd like to provide special thanks for their training of this model!
|
31 |
We have utilized OpenChat conditional behavior cloning and [MultiPack algorithm](https://github.com/imoneoi/multipack_sampler) which achieves 99.85% bin-packing efficiency on our dataset.
|
@@ -54,46 +54,58 @@ We have evaluated **OpenOrcaxOpenChat-Preview2-13B** on hard reasoning tasks fro
|
|
54 |
|
55 |
Our average performance for BigBench-Hard: 0.488
|
56 |
|
57 |
-
Average for AGIEval: 0.
|
58 |
|
59 |
In the Orca paper, they measured their score relative to Vicuna on these evals.
|
60 |
-
We
|
61 |
|
62 |
-
So we are surpassing Orca performance with <20% of the dataset size and
|
63 |
|
64 |
-
|
65 |
-
|
66 |
-
![OpenOrca Preview2 BigBench-Hard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_BigBenchHard.png "BigBench-Hard Performance")
|
67 |
|
68 |
## AGIEval Performance
|
69 |
|
70 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
71 |
|
72 |
## HuggingFaceH4 Open LLM Leaderboard Performance
|
73 |
|
74 |
We have run our own tests using parameters matching the [HuggingFaceH4 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) evals.
|
75 |
-
We find
|
76 |
|
77 |
-
|
|
|
|
|
78 |
|
79 |
## GPT4ALL Leaderboard Performance
|
80 |
|
81 |
We have tested using parameters matching the GPT4ALL Benchmark Suite and report our results and placement vs their official reporting below.
|
82 |
-
We place #1 for all open models and come within comparison of text-davinci-003, a proprietary model an order of magnitude larger.
|
83 |
|
84 |
-
|
|
|
|
|
85 |
|
86 |
|
87 |
# Dataset
|
88 |
|
89 |
We used a curated, filtered selection of most of the GPT-4 augmented data from our OpenOrca dataset, which aims to reproduce the Orca Research Paper dataset.
|
90 |
-
Further details of our curation practices will be forthcoming with our full model
|
91 |
|
92 |
|
93 |
# Training
|
94 |
|
95 |
-
We trained with 8x A100-80G GPUs for 46 hours, completing 5 epochs of full fine tuning on our dataset.
|
96 |
-
This contrasts with the 20x A100-80G GPUs for 200 hours used in the Orca paper, for only 3 epochs.
|
97 |
Our compute requirement was <1/10th that of the original Orca.
|
98 |
Commodity cost was ~$600.
|
99 |
|
@@ -116,6 +128,18 @@ tokenize("User: Hello<|end_of_turn|>Assistant: Hi<|end_of_turn|>User: How are yo
|
|
116 |
# Result: [1, 4911, 29901, 15043, 32000, 4007, 22137, 29901, 6324, 32000, 4911, 29901, 1128, 526, 366, 9826, 29973, 32000, 4007, 22137, 29901]
|
117 |
```
|
118 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
119 |
|
120 |
# Serving
|
121 |
|
|
|
19 |
We have used our own [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca) to fine-tune Llama2-13B using [OpenChat](https://huggingface.co/openchat) packing and conditional behavior cloning.
|
20 |
This dataset is our attempt to reproduce the dataset generated for Microsoft Research's [Orca Paper](https://arxiv.org/abs/2306.02707).
|
21 |
|
22 |
+
This second preview release is trained on a curated filtered subset of most of our GPT-4 augmented data.
|
23 |
|
24 |
This release highlights that our dataset and training methods have surpassed performance parity with the Orca paper.
|
25 |
+
We measured this with BigBench-Hard and AGIEval results with the same methods as used in the Orca paper, finding **~103%** of original Orca's performance on average.
|
26 |
+
As well, this is done with <1/10th the compute requirement and using <20% of the dataset size from the original Orca paper.
|
27 |
|
28 |
+
We have run extensive evaluations internally and expect this model to **place number 1** on both the HuggingFaceH4 Open LLM Leaderboard and the GPT4ALL Leaderboard for 13B models.
|
29 |
|
30 |
"One" of [OpenChat](https://huggingface.co/openchat) has joined our team, and we'd like to provide special thanks for their training of this model!
|
31 |
We have utilized OpenChat conditional behavior cloning and [MultiPack algorithm](https://github.com/imoneoi/multipack_sampler) which achieves 99.85% bin-packing efficiency on our dataset.
|
|
|
54 |
|
55 |
Our average performance for BigBench-Hard: 0.488
|
56 |
|
57 |
+
Average for AGIEval: 0.447
|
58 |
|
59 |
In the Orca paper, they measured their score relative to Vicuna on these evals.
|
60 |
+
We have done the same and have found our score averages to **~103%** of the total improvement that was shown in the Orca paper, using the same evaluation methods as outlined in the paper.
|
61 |
|
62 |
+
So we are surpassing Orca performance with <20% of the dataset size and <1/10th the training budget!
|
63 |
|
64 |
+
As well, we have evaluated using the methodology and tools for the HuggingFace Leaderboard and GPT4ALL Leaderboard, and find that we place #1 on both for all 13B models at release time!
|
|
|
|
|
65 |
|
66 |
## AGIEval Performance
|
67 |
|
68 |
+
We present our results in two columns.
|
69 |
+
The column for "`(Orca Paper eval`" uses the methods outlined in the Orca paper, so as to be a direct apples-to-apples comparison with the results from the paper.
|
70 |
+
The column for "`(HF Leaderboard eval)`" uses EleutherAI's LM Evaluation Harness with settings outlined by HuggingFace. These results are not comparable to the other columns, as the methods are different.
|
71 |
+
|
72 |
+
![OpenOrca Preview2 AGIEval Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2AGIEval.png "AGIEval Performance")
|
73 |
+
|
74 |
+
## BigBench-Hard Performance
|
75 |
+
|
76 |
+
We present our results in two columns.
|
77 |
+
The column for "`(Orca Paper eval`" uses the methods outlined in the Orca paper, so as to be a direct apples-to-apples comparison with the results from the paper.
|
78 |
+
The column for "`(HF Leaderboard eval)`" uses EleutherAI's LM Evaluation Harness with settings outlined by HuggingFace. These results are not comparable to the other columns, as the methods are different.
|
79 |
+
|
80 |
+
![OpenOrca Preview2 BigBench-Hard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2BigBenchHardEval.png "BigBench-Hard Performance")
|
81 |
|
82 |
## HuggingFaceH4 Open LLM Leaderboard Performance
|
83 |
|
84 |
We have run our own tests using parameters matching the [HuggingFaceH4 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) evals.
|
|
|
85 |
|
86 |
+
We place #1 for all 13B models at release time!
|
87 |
+
|
88 |
+
![OpenOrca Preview2 HuggingFace Leaderboard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2HuggingFaceLeaderboard.png "GPT4ALL Performance")
|
89 |
|
90 |
## GPT4ALL Leaderboard Performance
|
91 |
|
92 |
We have tested using parameters matching the GPT4ALL Benchmark Suite and report our results and placement vs their official reporting below.
|
|
|
93 |
|
94 |
+
We place #1 for all open models and come within comparison of `text-davinci-003`, a proprietary OpenAI model an order of magnitude larger.
|
95 |
+
|
96 |
+
![OpenOrca Preview2 GPT4ALL Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2GPT4ALL_Leaderboard.png "GPT4ALL Performance")
|
97 |
|
98 |
|
99 |
# Dataset
|
100 |
|
101 |
We used a curated, filtered selection of most of the GPT-4 augmented data from our OpenOrca dataset, which aims to reproduce the Orca Research Paper dataset.
|
102 |
+
Further details of our curation practices will be forthcoming with our full model releases.
|
103 |
|
104 |
|
105 |
# Training
|
106 |
|
107 |
+
We trained with 8x A100-80G GPUs for 46 hours, completing 5 epochs of full fine tuning on our dataset in one training run.
|
108 |
+
This contrasts with the 20x A100-80G GPUs for 200 hours used in the Orca paper, for only 3 epochs, and requiring stacked training (which is known to suffer catastrophic forgetting).
|
109 |
Our compute requirement was <1/10th that of the original Orca.
|
110 |
Commodity cost was ~$600.
|
111 |
|
|
|
128 |
# Result: [1, 4911, 29901, 15043, 32000, 4007, 22137, 29901, 6324, 32000, 4911, 29901, 1128, 526, 366, 9826, 29973, 32000, 4007, 22137, 29901]
|
129 |
```
|
130 |
|
131 |
+
For UIs with Prefix and Suffix fields, these will likely work:
|
132 |
+
|
133 |
+
Prefix (include a space after colon):
|
134 |
+
```
|
135 |
+
User:
|
136 |
+
```
|
137 |
+
|
138 |
+
Suffix (space after colon):
|
139 |
+
```
|
140 |
+
<|end_of_turn|>\nAssistant:
|
141 |
+
```
|
142 |
+
|
143 |
|
144 |
# Serving
|
145 |
|