-
Robust Mixture-of-Expert Training for Convolutional Neural Networks
Paper • 2308.10110 • Published • 2 -
Experts Weights Averaging: A New General Training Scheme for Vision Transformers
Paper • 2308.06093 • Published • 2 -
ConstitutionalExperts: Training a Mixture of Principle-based Prompts
Paper • 2403.04894 • Published • 2 -
Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language Models
Paper • 2403.03432 • Published
Collections
Discover the best community collections!
Collections including paper arxiv:2401.06066
-
Non-asymptotic oracle inequalities for the Lasso in high-dimensional mixture of experts
Paper • 2009.10622 • Published • 1 -
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 46 -
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Paper • 2401.04081 • Published • 68 -
MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving
Paper • 2401.14361 • Published • 2
-
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Paper • 2402.19427 • Published • 50 -
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 135 -
Tuning Language Models by Proxy
Paper • 2401.08565 • Published • 19 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 35
-
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 22 -
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Paper • 2402.01739 • Published • 26 -
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 46 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 35
-
Simple linear attention language models balance the recall-throughput tradeoff
Paper • 2402.18668 • Published • 17 -
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
Paper • 2402.10644 • Published • 74 -
Repeat After Me: Transformers are Better than State Space Models at Copying
Paper • 2402.01032 • Published • 22 -
Zoology: Measuring and Improving Recall in Efficient Language Models
Paper • 2312.04927 • Published • 2
-
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 22 -
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Paper • 2402.01739 • Published • 26 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 35
-
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 46 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 35 -
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Paper • 2312.07987 • Published • 39 -
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Paper • 2101.03961 • Published • 13
-
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Paper • 2401.09417 • Published • 51 -
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Paper • 2401.04081 • Published • 68 -
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Paper • 2312.07987 • Published • 39 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 35
-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 135 -
ReFT: Reasoning with Reinforced Fine-Tuning
Paper • 2401.08967 • Published • 27 -
Tuning Language Models by Proxy
Paper • 2401.08565 • Published • 19 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 62
-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Paper • 1701.06538 • Published • 4 -
Sparse Networks from Scratch: Faster Training without Losing Performance
Paper • 1907.04840 • Published • 3 -
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Paper • 1910.02054 • Published • 3 -
A Mixture of h-1 Heads is Better than h Heads
Paper • 2005.06537 • Published • 2