Papers
arxiv:2402.01032

Repeat After Me: Transformers are Better than State Space Models at Copying

Published on Feb 1
ยท Submitted by akhaliq on Feb 5
Authors:
,

Abstract

Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on the sequence length, which we refer to as "generalized state space models" (GSSMs). In this paper we show that while GSSMs are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context. We start with a theoretical analysis of the simple task of string copying and prove that a two layer transformer can copy strings of exponential length while GSSMs are fundamentally limited by their fixed-size latent state. Empirically, we find that transformers outperform GSSMs in terms of efficiency and generalization on synthetic tasks that require copying the context. Finally, we evaluate pretrained large language models and find that transformer models dramatically outperform state space models at copying and retrieving information from context. Taken together, these results suggest a fundamental gap between transformers and GSSMs on tasks of practical interest.

Community

(yet) Another paper on induction heads vs copying: https://arxiv.org/pdf/2205.10487.pdf (Anthropic, also with Catherine Olsson)

Pages 7-8 discuss how high levels of repetition within the pretraining data leads a (transformer) model to fail to learn to represent induction heads, which they use to explain a significant degradation in copying performance. See section titled "The disproportionate performance hit to copying coincides with a disproportionate degradation of induction heads."

Somewhat circularly - https://arxiv.org/pdf/2401.12973.pdf discusses how GSSMs fail to learn representations of induction heads (but don't dive into performance of copy tasks)

Was this tested on random bag of words? or was it tested with random character strings?

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Why Transformers Outshine State Space Models in Copying Tasks

Links ๐Ÿ”—:

๐Ÿ‘‰ Subscribe: https://www.youtube.com/@Arxflix
๐Ÿ‘‰ Twitter: https://x.com/arxflix
๐Ÿ‘‰ LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2402.01032 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.01032 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.01032 in a Space README.md to link it from this page.

Collections including this paper 8