-
Lost in the Middle: How Language Models Use Long Contexts
Paper • 2307.03172 • Published • 31 -
Efficient Estimation of Word Representations in Vector Space
Paper • 1301.3781 • Published • 6 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 11 -
Attention Is All You Need
Paper • 1706.03762 • Published • 34
Collections
Discover the best community collections!
Collections including paper arxiv:2404.02258
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 566 -
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper • 2310.11453 • Published • 94 -
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Paper • 2404.02258 • Published • 100 -
TransformerFAM: Feedback attention is working memory
Paper • 2404.09173 • Published • 42
-
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Paper • 2404.02258 • Published • 100 -
Jamba: A Hybrid Transformer-Mamba Language Model
Paper • 2403.19887 • Published • 98 -
EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba
Paper • 2403.09977 • Published • 7 -
SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series
Paper • 2403.15360 • Published • 11
-
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Paper • 2404.07413 • Published • 32 -
Rho-1: Not All Tokens Are What You Need
Paper • 2404.07965 • Published • 79 -
Jamba: A Hybrid Transformer-Mamba Language Model
Paper • 2403.19887 • Published • 98 -
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Paper • 2404.02258 • Published • 100
-
Jamba: A Hybrid Transformer-Mamba Language Model
Paper • 2403.19887 • Published • 98 -
Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order
Paper • 2404.00399 • Published • 39 -
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Paper • 2404.02258 • Published • 100 -
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Paper • 2404.08801 • Published • 61