DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models Paper • 2309.14509 • Published Sep 25, 2023 • 16
LLM Augmented LLMs: Expanding Capabilities through Composition Paper • 2401.02412 • Published Jan 4 • 35
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models Paper • 2401.06066 • Published Jan 11 • 35
Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding Paper • 2401.12954 • Published Jan 23 • 28
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval Paper • 2401.18059 • Published Jan 31 • 23
Specialized Language Models with Cheap Inference from Limited Domain Data Paper • 2402.01093 • Published Feb 2 • 45
Repeat After Me: Transformers are Better than State Space Models at Copying Paper • 2402.01032 • Published Feb 1 • 22
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models Paper • 2402.01739 • Published Jan 29 • 26
Scaling Laws for Downstream Task Performance of Large Language Models Paper • 2402.04177 • Published Feb 6 • 16
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs Paper • 2402.04291 • Published Feb 6 • 48
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model Paper • 2402.07827 • Published Feb 12 • 43
Learning to Learn Faster from Human Feedback with Language Model Predictive Control Paper • 2402.11450 • Published Feb 18 • 20
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows Paper • 2402.10379 • Published Feb 16 • 27
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper • 2402.13753 • Published Feb 21 • 104
User-LLM: Efficient LLM Contextualization with User Embeddings Paper • 2402.13598 • Published Feb 21 • 17
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping Paper • 2402.14083 • Published Feb 21 • 43
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs Paper • 2402.15627 • Published Feb 23 • 31
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits Paper • 2402.17764 • Published Feb 27 • 565
Stop Regressing: Training Value Functions via Classification for Scalable Deep RL Paper • 2403.03950 • Published Mar 6 • 11
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection Paper • 2403.03507 • Published Mar 6 • 172
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM Paper • 2403.07816 • Published Mar 12 • 37
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU Paper • 2403.06504 • Published Mar 11 • 52
PERL: Parameter Efficient Reinforcement Learning from Human Feedback Paper • 2403.10704 • Published Mar 15 • 54