Models
Datasets
Spaces
Posts
Docs
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2401.18079

KV Cache Compression

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

Paper • 2405.12981 • Published May 21 • 28
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Paper • 2401.18079 • Published Jan 31 • 7

Lee's RoPE Tricks / Context Extension Reads

Set of Long Context (RoPE or otherwise) I'm collecting off of HF

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Paper • 2402.13753 • Published Feb 21 • 111
Data Engineering for Scaling Language Models to 128K Context

Paper • 2402.10171 • Published Feb 15 • 21
LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration

Paper • 2402.11550 • Published Feb 18 • 15
The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey

Paper • 2401.07872 • Published Jan 15 • 2

Quantization | Compression

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

Paper • 2402.04291 • Published Feb 6 • 48
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Paper • 2401.18079 • Published Jan 31 • 7
Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

Paper • 2402.08958 • Published Feb 14 • 3
OneBit: Towards Extremely Low-bit Large Language Models

Paper • 2402.11295 • Published Feb 17 • 22

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Paper • 2402.19427 • Published Feb 29 • 52
Simple linear attention language models balance the recall-throughput tradeoff

Paper • 2402.18668 • Published Feb 28 • 18
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Paper • 2402.15220 • Published Feb 23 • 19
Linear Transformers are Versatile In-Context Learners

Paper • 2402.14180 • Published Feb 21 • 6

Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon

Paper • 2401.03462 • Published Jan 7 • 26
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

Paper • 2305.07185 • Published May 12, 2023 • 9
YaRN: Efficient Context Window Extension of Large Language Models

Paper • 2309.00071 • Published Aug 31, 2023 • 65
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Paper • 2401.02669 • Published Jan 5 • 14

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

Paper • 2307.13304 • Published Jul 25, 2023 • 2
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

Paper • 2306.03078 • Published Jun 5, 2023 • 3
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

Paper • 2308.13137 • Published Aug 25, 2023 • 17
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Paper • 2306.00978 • Published Jun 1, 2023 • 8

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Paper • 2312.11514 • Published Dec 12, 2023 • 258
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

Paper • 2312.12456 • Published Dec 16, 2023 • 41
Accelerating LLM Inference with Staged Speculative Decoding

Paper • 2308.04623 • Published Aug 8, 2023 • 23
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Paper • 2208.07339 • Published Aug 15, 2022 • 4

Company

© Hugging Face

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs