HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 502k • 1.13k
Provided are the model files for my custom transformer implementation in Julia. V2 has 12 layers, ctx=1024, hidden_dim=768, 12 attention heads (MHA), vocab_size=65535, and an up-proj dim of 3072. Trained on a subset of a dataset mix containing: 60% FineWeb-Edu 20% Wikipedia 10% Gutenberg 5% 4chan 5% Alpaca Instruct/Code In total, ~1.5 billion out of 5 total billion tokens were used for training. The model has a perlexity of ~30, or loss=~3.4. It scores a truly bewildering 26% on MMLU. lr=0.003-0.0015, batch_size=8-64, batches of 1024 tokens.