small-scale pretraining experiments of mine
Note this is a mid-training checkpoint of what is now smol_llama-220M