view post Post 2031 Only a single RTX 4090 running model pre-training is really slow, even for small language models!!! ( JingzeShi/doge-slm-677fd879f8c4fd0f43e05458) See translation 2 replies · 👀 8 8 🤯 6 6 👍 4 4 + Reply
view post Post 1575 🤩warmup -> stable -> decay leanring rate scheduler: 😎use the Stable Phase CheckPoints to Continue Training the model on Any New Dataset without spikes of the training!!! JingzeShi/Doge-20M-checkpoint JingzeShi/Doge-60M-checkpoint See translation 4 replies · 🔥 8 8 😎 1 1 👀 1 1 🤗 1 1 + Reply
Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture Paper • 2412.11834 • Published Dec 16, 2024 • 7
Cheems: Wonderful Matrices More Efficient and More Effective Architecture Paper • 2407.16958 • Published Jul 24, 2024 • 3