crumb/CGPT-124m · Hugging Face

The CGPT model employs rotary position embeddings, tied layer normalization, and RMSNorm as the layer normalization techniques. The model is trained with 20 tokens per parameter, following the Cerebras and Chinchilla approaches, using a batch size of 256 and a context length of 2048, resulting in 524,288 tokens in each batch and a total of around 2.4 billion tokens. The test loss on the Pile dataset reaches approximately 3.2 for single examples loaded with a batch size of 1, truncated to 2048 tokens, and loaded as they were during training, when two samples are loaded back-to-back delimited with <|endoftext|>, the test loss reduces to approximately 1.9. The accompanying Jupyter notebook contains both sampling and evaluation of test loss for further analysis.

Model parameters:

(
    block_size = 2048,
    vocab_size = 50257, # tiktoken r50k_base
    n_layer = 12,
    n_head = 12,
    n_embd = 768,
    bias = False,
)

Sample:

"""

Once upon a time, the world is still a long time. In the past few years, the world has been the subject of a new world, but it has been the subject of a new world. Today, we have seen an unprecedented change in the world in the first place. This change is a new trend, and it is a new way of life.

The world is now the world’s most important asset for all. There is a lot of things to do with the world, and the world is a place that is not just a city, but a country. As a society, there is a great way to go to the world. It is a place where people are not only going to be able to live.

"""

crumb
/

CGPT-124m

Dataset used to train crumb/CGPT-124m