Some very small and very simple models.
29,960,200 parameters.
"dim":256,"dim_head":32,"headcount":8,"ff_mult":4, "vocab_size":50304, "num_layers":4.
this is nonstandard (for tinystories), reflecting a full gpt-2 vocabulary size (bloating the embedding layers), and the use of a swiglu activation function, (which doubles the width of one of the feedforward layers).
training, inference, dataset preparation, and network definitions source available at https://github.com/SQCU/attn_demo
training logs
(unprocessed! unfiltered! it's a bunch of log prints of train and validation loss!)
and training loader source for each run included with the demo models.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
HF Inference deployability: The model has no library tag.