nguyenbh commited on
Commit
cfe82c6
1 Parent(s): a3c8c89

Update REAMDE

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -195,7 +195,7 @@ More specifically, we do not change prompts, pick different few-shot examples, c
195
 
196
  The number of k–shot examples is listed per-benchmark.
197
 
198
- |Benchmark|Phi-3-Medium-128k-Instruct<br>14b|Command R+<br>104B|Mixtral<br>8x22B|Llama-3-70B-Instruct<br>8b|GPT3.5-Turbo<br>version 1106|Gemini<br>Pro|GPT-4-Turbo<br>version 1106 (Chat)|
199
  |---------|-----------------------|--------|-------------|-------------------|-------------------|----------|------------------------|
200
  |AGI Eval<br>5-shot|49.7|50.1|54.0|56.9|48.4|49.0|59.6|
201
  |MMLU<br>5-shot|76.6|73.8|76.2|80.2|71.4|66.7|84.0|
@@ -220,7 +220,7 @@ The number of k–shot examples is listed per-benchmark.
220
 
221
  We take a closer look at different categories across 80 public benchmark datasets at the table below:
222
 
223
- |Benchmark|Phi-3-Medium-128k-Instruct<br>14b|Command R+<br>104B|Mixtral<br>8x22B|Llama-3-70B-Instruct<br>8b|GPT3.5-Turbo<br>version 1106|Gemini<br>Pro|GPT-4-Turbo<br>version 1106 (Chat)|
224
  |--------|------------------------|--------|-------------|-------------------|-------------------|----------|------------------------|
225
  | Popular aggregated benchmark | 72.3 | 69.9 | 73.4 | 76.3 | 67.0 | 67.5 | 80.5 |
226
  | Reasoning | 83.2 | 79.3 | 81.5 | 86.7 | 78.3 | 80.4 | 89.3 |
 
195
 
196
  The number of k–shot examples is listed per-benchmark.
197
 
198
+ |Benchmark|Phi-3-Medium-128k-Instruct<br>14b|Command R+<br>104B|Mixtral<br>8x22B|Llama-3-70B-Instruct|GPT3.5-Turbo<br>version 1106|Gemini<br>Pro|GPT-4-Turbo<br>version 1106 (Chat)|
199
  |---------|-----------------------|--------|-------------|-------------------|-------------------|----------|------------------------|
200
  |AGI Eval<br>5-shot|49.7|50.1|54.0|56.9|48.4|49.0|59.6|
201
  |MMLU<br>5-shot|76.6|73.8|76.2|80.2|71.4|66.7|84.0|
 
220
 
221
  We take a closer look at different categories across 80 public benchmark datasets at the table below:
222
 
223
+ |Benchmark|Phi-3-Medium-128k-Instruct<br>14b|Command R+<br>104B|Mixtral<br>8x22B|Llama-3-70B-Instruct|GPT3.5-Turbo<br>version 1106|Gemini<br>Pro|GPT-4-Turbo<br>version 1106 (Chat)|
224
  |--------|------------------------|--------|-------------|-------------------|-------------------|----------|------------------------|
225
  | Popular aggregated benchmark | 72.3 | 69.9 | 73.4 | 76.3 | 67.0 | 67.5 | 80.5 |
226
  | Reasoning | 83.2 | 79.3 | 81.5 | 86.7 | 78.3 | 80.4 | 89.3 |