LLM360
/

CrystalCoder

@@ -20,7 +20,7 @@ Despite being trained on a smaller dataset of 1.4 trillion tokens—compared to
 It demonstrates superior performance in benchmarks like MMLU, HumanEval, and MBPP.
 By comparing CrystalCoder with other similar work, CrystalCoder is quite balance on language and coding tasks.
-|        Model        | Trained Tokens | Avg. of Avg. | Language Avg. | Coding Avg. |  ARC  | HellaSwag | MMLU (5-shot) | TruthfulQA | HumanEval (pass@1) | MBPP (pass@1) |
 |:-------------------:|:--------------:|:------------:|:-------------:|:-----------:|:-----:|:---------:|:-------------:|:----------:|:------------------:|:-------------:|
 | Mistral 7B          | -              | 48.68        | 62.40         | 33.95       | 59.98 | 83.31     | 64.16         | 42.15      | 29.12              | 38.78         |
 | **CrystalCoder 7B** | 1.27T           | 39.56        | 51.68         | 27.44       | 47.44 | 74.38     | 48.42         | 36.46      | 23.90 | 30.988  |
@@ -31,10 +31,14 @@ By comparing CrystalCoder with other similar work, CrystalCoder is quite balance
 | LLaMA 2 7B          | 2T             | 34.98        | 53.39         | 16.57       | 53.07 | 77.74     | 43.80         | 38.98      | 13.05              | 20.09         |
 | StarCoder-15B       | 1.03           | -            | -             | 38.46       | -     | -         | -             | -          | 33.63              | 43.28         |
-** Notes **
 - For detailed token breakdown of CrystalCoder dataset, refer to the [CrystalCoder dataset repository](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
-- Scores for HumanEval is computed with a temporature of 0.2
-- Scores for MBPP is computed with a temperature of 0.1
 ## About LLM360

 It demonstrates superior performance in benchmarks like MMLU, HumanEval, and MBPP.
 By comparing CrystalCoder with other similar work, CrystalCoder is quite balance on language and coding tasks.
+|        Model        | Trained Tokens | Avg. of Avg. | Language Avg. | Coding Avg. |  ARC  | HellaSwag | MMLU | TruthfulQA | HumanEval (pass@1) | MBPP (pass@1) |
 |:-------------------:|:--------------:|:------------:|:-------------:|:-----------:|:-----:|:---------:|:-------------:|:----------:|:------------------:|:-------------:|
 | Mistral 7B          | -              | 48.68        | 62.40         | 33.95       | 59.98 | 83.31     | 64.16         | 42.15      | 29.12              | 38.78         |
 | **CrystalCoder 7B** | 1.27T           | 39.56        | 51.68         | 27.44       | 47.44 | 74.38     | 48.42         | 36.46      | 23.90 | 30.988  |
 | LLaMA 2 7B          | 2T             | 34.98        | 53.39         | 16.57       | 53.07 | 77.74     | 43.80         | 38.98      | 13.05              | 20.09         |
 | StarCoder-15B       | 1.03           | -            | -             | 38.46       | -     | -         | -             | -          | 33.63              | 43.28         |
+**Notes**
+- We compute all evaluation metrics ourselves.
+- Language benchmarks are computed following the convention of [the Huggingface Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which means
+AI2 Reasoning Challenge in 25-shot, HellaSwag in 10-shot, MMLU computed in 5-shot, TruthfulQA in 0-shot.
+- As reported in prior work, the choice of temperature affect the programming metrics a lot, we evaluate all models with the following temperature:
+   - Scores for HumanEval is computed with a temperature of 0.2
+   - Scores for MBPP is computed with a temperature of 0.1
 - For detailed token breakdown of CrystalCoder dataset, refer to the [CrystalCoder dataset repository](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
 ## About LLM360