matlok
's Collections
Papers - Attention
updated
Linear Transformers with Learnable Kernel Functions are Better
In-Context Models
Paper
•
2402.10644
•
Published
•
79
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head
Checkpoints
Paper
•
2305.13245
•
Published
•
5
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and
Two-Phase Partition
Paper
•
2402.15220
•
Published
•
19
Sequence Parallelism: Long Sequence Training from System Perspective
Paper
•
2105.13120
•
Published
•
5
Ring Attention with Blockwise Transformers for Near-Infinite Context
Paper
•
2310.01889
•
Published
•
10
Striped Attention: Faster Ring Attention for Causal Transformers
Paper
•
2311.09431
•
Published
•
4
Longformer: The Long-Document Transformer
Paper
•
2004.05150
•
Published
•
3
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Paper
•
2006.03654
•
Published
•
3
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing
Paper
•
2111.09543
•
Published
•
2
Rescoring Sequence-to-Sequence Models for Text Line Recognition with
CTC-Prefixes
Paper
•
2110.05909
•
Published
•
2
3D Medical Image Segmentation based on multi-scale MPU-Net
Paper
•
2307.05799
•
Published
•
2
Attention Swin U-Net: Cross-Contextual Attention Mechanism for Skin
Lesion Segmentation
Paper
•
2210.16898
•
Published
•
2
CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows
Paper
•
2107.00652
•
Published
•
2
BOAT: Bilateral Local Attention Vision Transformer
Paper
•
2201.13027
•
Published
•
2
MAFormer: A Transformer Network with Multi-scale Attention Fusion for
Visual Recognition
Paper
•
2209.01620
•
Published
•
2
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Paper
•
2103.14030
•
Published
•
4
Using Multi-scale SwinTransformer-HTC with Data augmentation in CoNIC
Challenge
Paper
•
2202.13588
•
Published
•
2
Interpretability in the Wild: a Circuit for Indirect Object
Identification in GPT-2 small
Paper
•
2211.00593
•
Published
•
2
BurstAttention: An Efficient Distributed Attention Framework for
Extremely Long Sequences
Paper
•
2403.09347
•
Published
•
20
Vision Transformer with Quadrangle Attention
Paper
•
2303.15105
•
Published
•
2
Lightweight Image Inpainting by Stripe Window Transformer with Joint
Attention to CNN
Paper
•
2301.00553
•
Published
•
2
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as
an Alternative to Attention Layers in Transformers
Paper
•
2311.10642
•
Published
•
23
Code Completion using Neural Attention and Byte Pair Encoding
Paper
•
2004.06343
•
Published
•
2
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Paper
•
2403.09919
•
Published
•
20
Vid2Robot: End-to-end Video-conditioned Policy Learning with
Cross-Attention Transformers
Paper
•
2403.12943
•
Published
•
14
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
Paper
•
2403.13501
•
Published
•
9
Efficient Memory Management for Large Language Model Serving with
PagedAttention
Paper
•
2309.06180
•
Published
•
25
Leave No Context Behind: Efficient Infinite Context Transformers with
Infini-attention
Paper
•
2404.07143
•
Published
•
103
Paper
•
2404.07821
•
Published
•
11
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Paper
•
2404.07413
•
Published
•
36
Megalodon: Efficient LLM Pretraining and Inference with Unlimited
Context Length
Paper
•
2404.08801
•
Published
•
63
Hydragen: High-Throughput LLM Inference with Shared Prefixes
Paper
•
2402.05099
•
Published
•
19
MegaScale: Scaling Large Language Model Training to More Than 10,000
GPUs
Paper
•
2402.15627
•
Published
•
34
MoA: Mixture-of-Attention for Subject-Context Disentanglement in
Personalized Image Generation
Paper
•
2404.11565
•
Published
•
14
SpecInfer: Accelerating Generative LLM Serving with Speculative
Inference and Token Tree Verification
Paper
•
2305.09781
•
Published
•
4
GLIGEN: Open-Set Grounded Text-to-Image Generation
Paper
•
2301.07093
•
Published
•
3
FlashSpeech: Efficient Zero-Shot Speech Synthesis
Paper
•
2404.14700
•
Published
•
29
Multi-Head Mixture-of-Experts
Paper
•
2404.15045
•
Published
•
59
Transformers Can Represent n-gram Language Models
Paper
•
2404.14994
•
Published
•
18
BASS: Batched Attention-optimized Speculative Sampling
Paper
•
2404.15778
•
Published
•
8
InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation
Paper
•
2404.19427
•
Published
•
71
What needs to go right for an induction head? A mechanistic study of
in-context learning circuits and their formation
Paper
•
2404.07129
•
Published
•
3
Transformers are SSMs: Generalized Models and Efficient Algorithms
Through Structured State Space Duality
Paper
•
2405.21060
•
Published
•
63
VideoFACT: Detecting Video Forgeries Using Attention, Scene Context, and
Forensic Traces
Paper
•
2211.15775
•
Published
•
1
Reasoning in Large Language Models: A Geometric Perspective
Paper
•
2407.02678
•
Published
•
1
Beyond Euclid: An Illustrated Guide to Modern Machine Learning with
Geometric, Topological, and Algebraic Structures
Paper
•
2407.09468
•
Published
•
1
Paper
•
2405.15932
•
Published
•
1
Tree Attention: Topology-aware Decoding for Long-Context Attention on
GPU clusters
Paper
•
2408.04093
•
Published
•
4
Attention Heads of Large Language Models: A Survey
Paper
•
2409.03752
•
Published
•
88
Paper
•
2410.05258
•
Published
•
166
ThunderKittens: Simple, Fast, and Adorable AI Kernels
Paper
•
2410.20399
•
Published
•
1
HAT: Hybrid Attention Transformer for Image Restoration
Paper
•
2309.05239
•
Published
•
1
Unraveling the Gradient Descent Dynamics of Transformers
Paper
•
2411.07538
•
Published
•
2