Jam-CGPT
Jam-CGPT is a GPT2-like model that follows jam's pretraining procedure to pretrain models ranging from 38 million to 350 million parameters and finetuning with comments generated by GPT-3.5 and data size ranging from 170k to 2.15m.
Jam-CGPT Training Details
- We follow jam's pretraining procedure and use the same data to pretrain our 38m, 110m and 350m parameters models.
- We finetune our Jam-CGPT with the summaries generated by GPT-3.5 and 4 different dataset size Jam-CGPT dataset.
- We finetune our models for 3 epochs.
- Our GitHub repo contains the code for reproduction using the same data.
Jam-CGPT 38 million parameters model
Hyperparameter |
Description |
Value |
e |
embedding dimensions |
512 |
L |
number of layers |
4 |
h |
attention heads |
4 |
c |
block size / context length |
256 |
b |
batch size |
64 |
a |
accumulation steps |
2 |
d |
dropout |
0.20 |
r |
learning rate |
3e-5 |
y |
iterations |
1e-5 |
iter |
number of iterations after pretraing |
757,000 |
Jam-CGPT 110 million parameters model
Hyperparameter |
Description |
Value |
e |
embedding dimensions |
768 |
L |
number of layers |
10 |
h |
attention heads |
8 |
c |
block size / context length |
256 |
b |
batch size |
32 |
a |
accumulation steps |
4 |
d |
dropout |
0.20 |
r |
learning rate |
3e-5 |
y |
iterations |
1e-5 |
iter |
number of iterations after pretraing |
762,000 |
Jam-CGPT 350 million parameters model
Hyperparameter |
Description |
Value |
e |
embedding dimensions |
1024 |
L |
number of layers |
24 |
h |
attention heads |
16 |
c |
block size / context length |
256 |
b |
batch size |
4 |
a |
accumulation steps |
32 |
d |
dropout |
0.20 |
r |
learning rate |
3e-5 |
y |
weight decay |
1e-5 |
iter |
iterations |
272,000 |
- Note that you can adjust the batch size and accumulation steps based on your GPU memory. But, the batch size * accumulation steps should be 128.
- If you finetune your models with multiple GPUs, you can turn down accumulation steps. For example, if you finetune with 2 GPUs, you will need to half the accumulation steps.
- We pretrained 38m and 110m models for 3 epochs.