Papers
arxiv:2405.12250

Your Transformer is Secretly Linear

Published on May 19
Β· Submitted by akhaliq on May 22
#1 Paper of the day

Abstract

This paper reveals a novel linear characteristic exclusive to transformer decoders, including models such as GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed due to a consistently low output norm of the transformer layer. Our experiments show that removing or linearly approximating some of the most linear blocks of transformers does not affect significantly the loss or model performance. Moreover, in our pretraining experiments on smaller models we introduce a cosine-similarity-based regularization, aimed at reducing layer linearity. This regularization improves performance metrics on benchmarks like Tiny Stories and SuperGLUE and as well successfully decreases the linearity of the models. This study challenges the existing understanding of transformer architectures, suggesting that their operation may be more linear than previously assumed.

Community

wonder what would this imply in terms for approx Transformer/efficiency

Β·

The proposed regularization technique makes training more efficient due to embedding linearisation control

Absolute chads at SberAI for still releasing after the war started, regardless of one's political stance, I respect them a lot for not just cancelling their research division or not putting anything on arxiv anymore.

Β·

However, the major work is done in AIRIπŸ˜‰ We love science and there are no limits for the job you love. Thank you for kind words

I'm a simple man, I see "secretly linear," I upvote.

Well, from the newer paper by MIT it seems the features are not as linear as it has been thought. https://huggingface.co/papers/2405.14860

In this case, if I get both papers right, linearization can hurt the model by eliminating complex associations, such as days of week, months, years and many other implicit nonlinear features we cannot even know that exist in the model, but directly tied to the model's understanding of the cyclic/curved/jagged parts of the world.

Β·

These are different papers: this one studies the linearity between two consecutive transformer block transformations, but the paper by MIT studied embedding linearity within one transformer layer.

MIT VS AIRI LMAO

Is that so? Or should I say: We will see about that!

Working on reproducing this and similar pruning criteria here:
https://github.com/melisa-writer/short-transformers
Linear approximation of the last token is there, along with angular distances, bi score etc.

The goal of the library: choose your distance (layer importance metric), get cropped model. :rocket:

The implications of this work are significant. There is so much to explore.
One thing that I can't quite grasp is how Cosine Similarity regularization manages to control linearity.

Β·

Actually this is a challenging outcome, because the hypothesis is when adding cosine similarity to make embeddings more similar (CS -> 1), the training process leads to increasing the non linear part in the residual stream. We plan to investigate this effect more

Your Transformer Might Be Linear! | Deep Dive

πŸ‘‰ Subscribe: https://www.youtube.com/@Arxflix
πŸ‘‰ Twitter: https://x.com/arxflix
πŸ‘‰ LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

In the paper

Furthermore, our feature triggering regime hypothesis
proposes that rare specific features on a
few tokens with high non-linearity significantly influence
model behavior β€” in the Figure 9 one can
see that some layers of OPT-1.3B have the long
tailed distribution of L2 errors, which means that
there are still sparse spikes of non-linearity.

how is this L2 error in Figure 9 here calculated?

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2405.12250 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2405.12250 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2405.12250 in a Space README.md to link it from this page.

Collections including this paper 17