LLM360-MBZUAI commited on
Commit
dbafa41
β€’
1 Parent(s): 6f13575

Update README.md

Browse files

add some bullets emoji to model card

Files changed (1) hide show
  1. README.md +16 -13
README.md CHANGED
@@ -25,14 +25,17 @@ By comparing CrystalCoder with other similar work, CrystalCoder is quite balance
25
 
26
 
27
  **Notes**
28
- - We compute all evaluation metrics ourselves.
29
- - Language benchmarks are computed following the convention of [the Huggingface Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which means
30
- AI2 Reasoning Challenge in 25-shot, HellaSwag in 10-shot, MMLU computed in 5-shot, TruthfulQA in 0-shot.
 
 
31
  - As reported in prior work, the choice of temperature affect the programming metrics a lot, we evaluate all models with the following temperature:
32
  - Scores for HumanEval is computed with a temperature of 0.2
33
  - Scores for MBPP is computed with a temperature of 0.1
34
  - For detailed token breakdown of CrystalCoder dataset, refer to the [CrystalCoder dataset repository](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
35
 
 
36
 
37
  ## About LLM360
38
  LLM360 is an initiative for comprehensive and fully open-sourced LLMs,
@@ -47,7 +50,7 @@ effort.
47
 
48
  Get access now at [LLM360 site](https://www.llm360.ai/)
49
 
50
- ## Model Description
51
 
52
  - **Model type:** Language model with the same architecture as LLaMA-7B
53
  - **Language(s) (NLP):** English
@@ -58,7 +61,7 @@ Get access now at [LLM360 site](https://www.llm360.ai/)
58
  - [Metrics](https://github.com/LLM360/Analysis360)
59
  - [Fully processed CrystalCoder pretraining data](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets)
60
 
61
- # Model Architecture
62
 
63
  CrystalCoder leverages a GPT-like architecture, akin to LLaMA, but with the addition of maximal update parameterization (**muP**).
64
 
@@ -85,7 +88,7 @@ For other architecture choices:
85
  - Training sequence length is `2048`.
86
  - Embedding dimension is `32032`.
87
 
88
- # Tokenization
89
 
90
  Our tokenizer is based on the LLaMA tokenizer, with 22 additional special tokens for the following usage:
91
  - 4 filling-in-middle (FIM) tokens such as `<|fim_prefix|>` to support FIM inference.
@@ -94,7 +97,7 @@ Our tokenizer is based on the LLaMA tokenizer, with 22 additional special tokens
94
 
95
  Therefore, we extended the LLaMA tokenizer vocabulary size from `32000` to `32032`. Some token ids are reserved and not used.
96
 
97
- # Training
98
 
99
  Our training has 3 stages:
100
  - Stage 1: Pretraining on first half of SlimPajama (50% x 690B = 345B).
@@ -114,12 +117,12 @@ For hyperparameters used in each stage, please refer to the following table:
114
 
115
  For more details of training, please refer to [our paper](https://arxiv.org/pdf/2312.06550.pdf).
116
 
117
- # Dataset
118
 
119
  Our tokenized datasets for all phases are available at [CrystalCoderDatasets](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
120
 
121
 
122
- # Model Usage
123
 
124
  To load a specific checkpoint, use the revision argument as shown below, for example, `CrystalCoder_phase1_checkpoint_055500`. All the revisions can be seen from the branch dropdown in the "Files and versions" tab. If no revision argument is provided, it will load the phase 3 final checkpoint `CrystalCoder_phase3_checkpoint_027728`.
125
 
@@ -146,7 +149,7 @@ print("-"*20 + "Output for model" + 20 * '-')
146
  print(tokenizer.batch_decode(gen_tokens)[0])
147
  ```
148
 
149
- ## Completion Example:
150
 
151
  ### prompt:
152
 
@@ -185,7 +188,7 @@ def closest_pair(numbers: List[float], threshold: float) -> int:
185
  <unk> import torch
186
  import numpy as np
187
  ```
188
- # Training Logs and Evaluation Results
189
 
190
  Please refer to our [W&B project page](https://wandb.ai/llm360/CrystalCoder) for complete training logs and evaluation results.
191
 
@@ -204,11 +207,11 @@ Selected Metrics are displayed below.
204
  |<img src="cc-mmlu-1.png" alt="mmlu" width="400"/> | <img src="cc-truthful-1.png" alt="truthfulqa" width="400"/> |
205
 
206
 
207
- # CrystalCoder-Instruct
208
 
209
  We also have instruction tuned versions of CrystalCoder, based on stage 2 and stage 3 final checkpoints. The Instruct version will be released later.
210
 
211
- # Citation
212
 
213
  **BibTeX:**
214
 
 
25
 
26
 
27
  **Notes**
28
+
29
+ - We compute all evaluation metrics ourselves.
30
+
31
+ - Language benchmarks are computed following the convention of [the Huggingface Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which means AI2 Reasoning Challenge in 25-shot, HellaSwag in 10-shot, MMLU computed in 5-shot, TruthfulQA in 0-shot.
32
+
33
  - As reported in prior work, the choice of temperature affect the programming metrics a lot, we evaluate all models with the following temperature:
34
  - Scores for HumanEval is computed with a temperature of 0.2
35
  - Scores for MBPP is computed with a temperature of 0.1
36
  - For detailed token breakdown of CrystalCoder dataset, refer to the [CrystalCoder dataset repository](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
37
 
38
+
39
 
40
  ## About LLM360
41
  LLM360 is an initiative for comprehensive and fully open-sourced LLMs,
 
50
 
51
  Get access now at [LLM360 site](https://www.llm360.ai/)
52
 
53
+ ## 🟣 Model Description
54
 
55
  - **Model type:** Language model with the same architecture as LLaMA-7B
56
  - **Language(s) (NLP):** English
 
61
  - [Metrics](https://github.com/LLM360/Analysis360)
62
  - [Fully processed CrystalCoder pretraining data](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets)
63
 
64
+ # 🟣 Model Architecture
65
 
66
  CrystalCoder leverages a GPT-like architecture, akin to LLaMA, but with the addition of maximal update parameterization (**muP**).
67
 
 
88
  - Training sequence length is `2048`.
89
  - Embedding dimension is `32032`.
90
 
91
+ # 🟣 Tokenization
92
 
93
  Our tokenizer is based on the LLaMA tokenizer, with 22 additional special tokens for the following usage:
94
  - 4 filling-in-middle (FIM) tokens such as `<|fim_prefix|>` to support FIM inference.
 
97
 
98
  Therefore, we extended the LLaMA tokenizer vocabulary size from `32000` to `32032`. Some token ids are reserved and not used.
99
 
100
+ # 🟣 Training
101
 
102
  Our training has 3 stages:
103
  - Stage 1: Pretraining on first half of SlimPajama (50% x 690B = 345B).
 
117
 
118
  For more details of training, please refer to [our paper](https://arxiv.org/pdf/2312.06550.pdf).
119
 
120
+ # 🟣 Dataset
121
 
122
  Our tokenized datasets for all phases are available at [CrystalCoderDatasets](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
123
 
124
 
125
+ # 🟣 Model Usage
126
 
127
  To load a specific checkpoint, use the revision argument as shown below, for example, `CrystalCoder_phase1_checkpoint_055500`. All the revisions can be seen from the branch dropdown in the "Files and versions" tab. If no revision argument is provided, it will load the phase 3 final checkpoint `CrystalCoder_phase3_checkpoint_027728`.
128
 
 
149
  print(tokenizer.batch_decode(gen_tokens)[0])
150
  ```
151
 
152
+ ## 🟣 Completion Example:
153
 
154
  ### prompt:
155
 
 
188
  <unk> import torch
189
  import numpy as np
190
  ```
191
+ # 🟣 Training Logs and Evaluation Results
192
 
193
  Please refer to our [W&B project page](https://wandb.ai/llm360/CrystalCoder) for complete training logs and evaluation results.
194
 
 
207
  |<img src="cc-mmlu-1.png" alt="mmlu" width="400"/> | <img src="cc-truthful-1.png" alt="truthfulqa" width="400"/> |
208
 
209
 
210
+ # 🟣 CrystalCoder-Instruct
211
 
212
  We also have instruction tuned versions of CrystalCoder, based on stage 2 and stage 3 final checkpoints. The Instruct version will be released later.
213
 
214
+ # 🟣 Citation
215
 
216
  **BibTeX:**
217