-
Measuring the Effects of Data Parallelism on Neural Network Training
Paper • 1811.03600 • Published • 2 -
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Paper • 1804.04235 • Published • 2 -
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Paper • 1905.11946 • Published • 3 -
Yi: Open Foundation Models by 01.AI
Paper • 2403.04652 • Published • 59
Collections
Discover the best community collections!
Collections including paper arxiv:2403.08295
-
Nemotron-4 15B Technical Report
Paper • 2402.16819 • Published • 40 -
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Paper • 2402.19427 • Published • 50 -
RWKV: Reinventing RNNs for the Transformer Era
Paper • 2305.13048 • Published • 10 -
Reformer: The Efficient Transformer
Paper • 2001.04451 • Published
-
Neural Network Diffusion
Paper • 2402.13144 • Published • 93 -
Genie: Generative Interactive Environments
Paper • 2402.15391 • Published • 67 -
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Paper • 2402.17177 • Published • 87 -
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper • 2403.00522 • Published • 40
-
Rethinking Optimization and Architecture for Tiny Language Models
Paper • 2402.02791 • Published • 12 -
More Agents Is All You Need
Paper • 2402.05120 • Published • 46 -
Scaling Laws for Forgetting When Fine-Tuning Large Language Models
Paper • 2401.05605 • Published -
Aligning Large Language Models with Counterfactual DPO
Paper • 2401.09566 • Published • 2
-
LoRA: Low-Rank Adaptation of Large Language Models
Paper • 2106.09685 • Published • 24 -
Instruct-Imagen: Image Generation with Multi-modal Instruction
Paper • 2401.01952 • Published • 29 -
mistralai/Mixtral-8x7B-Instruct-v0.1
Text Generation • Updated • 423k • 3.86k -
Gemma: Open Models Based on Gemini Research and Technology
Paper • 2403.08295 • Published • 43
-
Attention Is All You Need
Paper • 1706.03762 • Published • 36 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 11 -
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Paper • 1907.11692 • Published • 7 -
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Paper • 1910.01108 • Published • 11
-
SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling
Paper • 2312.15166 • Published • 55 -
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Paper • 2312.12456 • Published • 40 -
Cached Transformers: Improving Transformers with Differentiable Memory Cache
Paper • 2312.12742 • Published • 11 -
Mini-GPTs: Efficient Large Language Models through Contextual Pruning
Paper • 2312.12682 • Published • 7