Scaling Laws for Floating Point Quantization Training Paper β’ 2501.02423 β’ Published 8 days ago β’ 23
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper β’ 2404.14219 β’ Published Apr 22, 2024 β’ 254
view post Post 2834 Native tensor parallel has landed in transformers!!! https://github.com/huggingface/transformers/pull/34184 thanks a lot to the torch team for their support! Contributions are welcome to support more models! π₯ π₯ 13 13 β€οΈ 4 4 π€― 3 3 π€ 3 3 + Reply
Small-scale proxies for large-scale Transformer training instabilities Paper β’ 2309.14322 β’ Published Sep 25, 2023 β’ 19
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets Paper β’ 2201.02177 β’ Published Jan 6, 2022 β’ 2
view article Article A failed experiment: Infini-Attention, and why we should keep trying? Aug 14, 2024 β’ 57
Grokfast: Accelerated Grokking by Amplifying Slow Gradients Paper β’ 2405.20233 β’ Published May 30, 2024 β’ 6
Transformer Explainer: Interactive Learning of Text-Generative Models Paper β’ 2408.04619 β’ Published Aug 8, 2024 β’ 156