Text Generation
Transformers
Safetensors
mixtral
reasoning
preference_learning
nca
conversational
Inference Endpoints
text-generation-inference
lievan commited on
Commit
053006a
1 Parent(s): 733fd94

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -13
README.md CHANGED
@@ -41,19 +41,19 @@ Eurux-8x22B-NCA is SFT and [NCA](https://arxiv.org/abs/2402.05369) fine-tuned fr
41
  It achieves superb reasoning performance as well as exellent chat & instruction-following capabilities.
42
 
43
  ## Evaluation
44
- We conducted overall coding, math, reasoning, knowledge, instruction-following and chat benchmarking. Results are shown below:
45
-
46
- | Models/Benchmarks| Coding | | | Math | | | Reasoning | Knowledge | Ins-Following | Chat |
47
- |-----------------|:---------:|:-----:|:--------:|:-------:|:-----:|:---------:|:---------:|:---------:|:-------------:|:--------:|
48
- | | HumanEval | MBPP | LeetCode | GSMPLUS | MATH | TheoremQA | BBH (CoT) | MMLU | IFEval | MT-Bench |
49
- | GPT-3.5-Turbo | 76.8 | 82.5 | 23.3 | 61.2 | 37.8 | 35.6 | 70.1 | 70.0 | 56.6 | 7.94 |
50
- | GPT-4 | 85.4 | 83.5 | 41.8 | 85.6 | 69.7 | 52.4 | 86.7 | 86.4 | 79.7 | 8.96 |
51
- | Mixtral-8x7B-Ins| 50.6 | 50.1 | 5.6 | 49.6 | 25.9 | 20.4 | 73.5 | 70.3 | 48.8 | 8.30 |
52
- | DS-LM-67B-Chat | 70.7 | 65.7 | 20.0 | 65.0 | 41.0 | 17.9 | 78.9 | 72.3 | 52.7 | - |
53
- | QWen-1.5-72B | 71.3 | 56.9 | 15.6 | 65.4 | 43.4 | 18.5 | 78.0 | 72.9 | 53.4 | 8.61 |
54
- | Eurus-70b-NCA | 79.3 | 71.9 | 33.3 | 62.8 | 41.7 | 32.6 | 80.0 | 59.4 | 49.2 | 7.54 |
55
- | Eurux-8x22b-KTO | 71.3 | 68.9 | 29.4 | 68.3 | 48.4 | 35.3 | 83.6 | 75.9 | 67.1 | 8.58 |
56
- | Eurux-8x22b-NCA | 75.0 | 69.7 | 35.0 | 68.1 | 49.0 | 35.5 | 83.5 | 75.6 | 67.1 | 8.46 |
57
 
58
  ## Usage
59
 
 
41
  It achieves superb reasoning performance as well as exellent chat & instruction-following capabilities.
42
 
43
  ## Evaluation
44
+ We conducted overall coding, math, reasoning, knowledge, instruction-following and chat benchmarking. Results are shown below, with the best scores in open-source models **bolded**:
45
+
46
+ | Models/Benchmarks | Coding | | | Math | | | Reasoning | Knowledge | Ins-Following | Chat |
47
+ |-------------------|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:-------------:|:---------:|
48
+ | | HumanEval | MBPP | LeetCode | GSMPLUS | MATH | TheoremQA | BBH (CoT) | MMLU | IFEval | MT-Bench |
49
+ | GPT-3.5-Turbo | 76.8 | 82.5 | 23.3 | 61.2 | 37.8 | 35.6 | 70.1 | 70.0 | 56.6 | 7.94 |
50
+ | GPT-4 | 85.4 | 83.5 | 41.8 | 85.6 | 69.7 | 52.4 | 86.7 | 86.4 | 79.7 | 8.96 |
51
+ | Mixtral-8x7B-Ins | 50.6 | 50.1 | 5.6 | 49.6 | 25.9 | 20.4 | 73.5 | 70.3 | 48.8 | 8.30 |
52
+ | DS-LM-67B-Chat | 70.7 | 65.7 | 20.0 | 65.0 | 41.0 | 17.9 | 78.9 | 72.3 | 52.7 | 8.35 |
53
+ | QWen-1.5-72B | 71.3 | 56.9 | 15.6 | 65.4 | 43.4 | 18.5 | 78.0 | 72.9 | 53.4 | **8.61 ** |
54
+ | Eurus-70b-NCA | **79.3 ** | **71.9 ** | 33.3 | 62.8 | 41.7 | 32.6 | 80.0 | 59.4 | 49.2 | 7.54 |
55
+ | Eurux-8x22b-KTO | 71.3 | 68.9 | 29.4 | **68.3 ** | 48.4 | 35.3 | **83.6 ** | **75.9 ** | **67.1 ** | 8.58 |
56
+ | Eurux-8x22b-NCA | 75.0 | 69.7 | **35.0 ** | 68.1 | **49.0 ** | **35.5 ** | 83.5 | 75.6 | **67.1 ** | 8.46 |
57
 
58
  ## Usage
59