@Kseniase on Hugging Face: "15 types of attention mechanisms Attention mechanisms allow models to…"

Post

1634

15 types of attention mechanisms

Attention mechanisms allow models to dynamically focus on specific parts of their input when performing tasks. In our recent article, we discussed Multi-Head Latent Attention (MLA) in detail and now it's time to summarize other existing types of attention.

Here is a list of 15 types of attention mechanisms used in AI models:

1. Soft attention (Deterministic attention) -> Neural Machine Translation by Jointly Learning to Align and Translate (1409.0473)
Assigns a continuous weight distribution over all parts of the input. It produces a weighted sum of the input using attention weights that sum to 1.

2. Hard attention (Stochastic attention) -> Effective Approaches to Attention-based Neural Machine Translation (1508.04025)
Makes a discrete selection of some part of the input to focus on at each step, rather than attending to everything.

3. Self-attention -> Attention Is All You Need (1706.03762)
Each element in the sequence "looks" at other elements and "decides" how much to borrow from each of them for its new representation.

4. Cross-Attention (Encoder-Decoder attention) -> Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation (2104.08771)
The queries come from one sequence and the keys/values come from another sequence. It allows a model to combine information from two different sources.

5. Multi-Head Attention (MHA) -> Attention Is All You Need (1706.03762)
Multiple attention “heads” are run in parallel. The model computes several attention distributions (heads), each with its own set of learned projections of queries, keys, and values.

6. Multi-Head Latent Attention (MLA) -> DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (2405.04434)
Extends MHA by incorporating a latent space where attention heads can dynamically learn different latent factors or representations.

7. Memory-Based attention -> End-To-End Memory Networks (1503.08895)
Involves an external memory and uses attention to read from and write to this memory.

See other types in the comments 👇

Adaptive attention -> https://huggingface.co/papers/1612.01887
Dynamically adjusts its attention behavior – when or whether to use attention, or how broad the attention should be.
Scaled Dot-Product attention -> https://huggingface.co/papers/2404.16629
Attention scores are computed by the dot product between a query vector and a key vector, and then divided by the square root of the key dimension before applying softmax.
Additive attention -> https://huggingface.co/papers/1409.0473
Computes attention scores using a small feed-forward that combines the query and key vectors.
Global attention -> https://huggingface.co/papers/1508.04025
Is a form of soft attention that considers all possible positions in the input sequence.
Local attention -> https://huggingface.co/papers/1508.04025
It's a compromise between hard and soft attention. The model only attends to a restricted subset of inputs at a given step.
Sparse attention -> https://huggingface.co/papers/1602.02068
Applies patterns that limit what each word can focus on.
Hierarchical attention -> https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf
Model first applies attention at the word level and produces a sentence representation. Then it applies another attention at the sentence level to determine which sentences are important for the document representation.
Temporal attention -> https://huggingface.co/papers/1502.08029
Deals with time-series or sequential data, allowing a model to focus on particular time steps or time segments.

Join the conversation