Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Jaward 
posted an update Apr 20
Post
3456
# On Coding Your First Attention

While it’s not necessarily the case that you must code the attention block of a transformer from scratch to understand how it works, yet it sure is the closest you can get to having a first-principles understanding of why/how transformers behave the way they do.

@karpathy covered attention in detail in his nanoGPT video (strongly recommend watching). Now I would like to share some thoughts and experience in writing my first attention.

First let’s zoom out quickly and explain what attention is in transformers: Attention in transformers is a communication mechanism that allows the model to focus on different parts of the input sequence when making predictions.

It assigns weights to each input token based on its relevance to the current context, enabling the model to weigh information selectively. This mechanism helps transformers capture long-range dependencies and contextual information effectively.

The official AIAN paper introduced two commonly used forms of attentions: Scaled Dot-Product Attention (also known as Self-Attention) and a stack of self-attention blocks known as Multi-Head Attention.

# The Code

Now, attention as for most deep learning algorithms boils down to a math equation. So writing the code can get really trivial especially with a deep learning framework like PyTorch. Below is what's called a Single Head Attention

(image 2)

The code defines single-head attention in PyTorch - it transforms input vectors, computes attention scores and weights, and then calculates the weighted sum of values based on these weights (as per the attention equation)

When you have multiple of those stacked in parallel, you get what's called Multi-Head Attention. This gives a much simpler code if you are inheriting from the SingleHeadAttention class:

(image 3)

Full Article here: https://huggingface.co/blog/Jaward/coding-your-first-attention

nice : He also said to train a model on simple text corpus first to get it to generate good text in the way you desire , hence a set of documents for a personality can be the pretrain text generation task , its best to input as much simple text as possible :
The weights are actually word to word matrixes and probability tables of predictive text :
Now when you train for question and answer ; it will be able to undertand the Q&A Pairs as an expected sequence , (larger window) ... then After it can be Trained for TASK! (not multiple but single individual tasks: later these task can be generalised with and INSTRUCT Set of similar task! so after learning a few single focused tasks ; it is ready to learn generalised tasks....

SO.... the personality is the Corpus you begin with ..... as with indepth fitting of this set (for text generation) ... it will be the style of the writer ...