hesamation commited on
Commit
2b1309a
·
1 Parent(s): 64f3286

modify distill.js to change publication date, enhance index.html with additional audio content, improved clarity in paragraphs, and new asides for better user engagement.

Browse files
assets/audio/podcast.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:07f4c913f033e8144df5000718b35bd5f25a8b613989659ed478140ddd11c97d
3
+ size 31564844
assets/pdf/pdf_version.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:90a77509125812d76475748ca81a97291de539c0fefcf7415bc3e9fc0f5140b1
3
+ size 2525198
src/bibliography.bib CHANGED
@@ -67,6 +67,5 @@
67
  url={https://arxiv.org/abs/1301.3781},
68
  journal={arXiv.org},
69
  author={Mikolov, Tomas and Chen, Kai and Corrado, Greg and Dean, Jeffrey},
70
- year={2013},
71
- month=jan
72
  }
 
67
  url={https://arxiv.org/abs/1301.3781},
68
  journal={arXiv.org},
69
  author={Mikolov, Tomas and Chen, Kai and Corrado, Greg and Dean, Jeffrey},
70
+ year={2013}
 
71
  }
src/distill.js CHANGED
@@ -2102,12 +2102,12 @@ d-appendix > distill-appendix {
2102
  </div>
2103
  <div >
2104
  <h3>Published</h3>
2105
- <div>March 27, 2025</div>
2106
  </div>
2107
  </div>
2108
  <div class="side pdf-download">
2109
  <h3>Download</h3>
2110
- <a href="hf.co">
2111
  <img style="width: 32px;" src="../assets/images/256px-PDF.png" alt="PDF"></a>
2112
  </div>
2113
  `;
 
2102
  </div>
2103
  <div >
2104
  <h3>Published</h3>
2105
+ <div>March 28, 2025</div>
2106
  </div>
2107
  </div>
2108
  <div class="side pdf-download">
2109
  <h3>Download</h3>
2110
+ <a href="../assets/pdf/pdf_version.pdf">
2111
  <img style="width: 32px;" src="../assets/images/256px-PDF.png" alt="PDF"></a>
2112
  </div>
2113
  `;
src/index.html CHANGED
@@ -52,13 +52,24 @@
52
  </d-contents>
53
 
54
  <p>Embeddings are the semantic backbone of LLMs, the gate at which raw text is transformed into vectors of numbers that are understandable by the model. When you prompt an LLM to help you debug your code, your words and tokens are transformed into a high-dimensional vector space where semantic relationships become mathematical relationships.</p>
55
- <aside>Reading time: 12-15 minutes.</aside>
56
 
57
  <p>In this article we go through the fundamentals of embeddings. We will cover what embeddings are, how they evolved over time from statistical methods to modern techniques, check out how they're implemented in practice, look at some of the most important embedding techniques, and how the embeddings of an LLM (DeepSeek-R1-Distill-Qwen-1.5B) look like as a graph representation.</p>
58
 
59
  <p>This article includes interactive visualization and hands-on code examples. It also avoids verbosity and focuses on the core concepts for a fast-paced read to get straight to the point. The full code is available on my <a href="https://github.com/hesamsheikh/llm-mechanics">LLM Mechanics GitHub repository</a>.</p>
60
  <aside>To contribute to the article, point out any mistakes, or suggest improvements, please check out<a href="https://huggingface.co/spaces/hesamation/primer-llm-embedding/discussions">Community</a>.</aside>
61
-
 
 
 
 
 
 
 
 
 
 
 
62
  <h2>What Are Embeddings?</h2>
63
 
64
  <p>Processing text for NLP tasks requires a numeric representation of each word. Most embedding methods come down to turning a word or token into a vector. What makes embedding techniques different from each other, is how they approach this word → vector conversion.</p>
@@ -92,7 +103,7 @@
92
 
93
  <h2>Traditional Embedding Techniques</h2>
94
 
95
- <p>Almost every embedding technique relies on a large corpus of text data to extract the relationship of the word. Previously, embedding methods relied on statistic methods based on the occurence or co-occurence of words in a text. This was based on the assumption that if a pair of words often appear together then they must have a closer relationship. For us in the modern day who know how embeddings can be more sophisticated, this doesn't seem a reliable approach, but they are simple methods that are not as computation-heavy as other techniques. One of such methods is:</p>
96
 
97
  <h3>TF-IDF (Term Frequency-Inverse Document Frequency)</h3>
98
 
@@ -164,7 +175,7 @@
164
  <p>There are two things noticeable about this embedding space:</p>
165
 
166
  <ol>
167
- <li>The majority of the words are concentrated into one particular area. This means that the embedding of most words are similar in this approach. It signals a lack of expressiveness and uniqueness about these word embeddings.</li>
168
  <li>The embeddings have no semantic connection. The distance between words has nothing to do with their meaning.</li>
169
  </ol>
170
 
@@ -185,7 +196,16 @@
185
 
186
  <p>Aside from CBOW, another variant is Skipgram which works completely the opposite: it aims to predict the neighbors, given a particular word as input.</p>
187
 
188
- <p>Let's see what happens in the case of a CBOW word2vec: after choosing a context window (e.g. 2 in the image above), we get the two words that appear before and two words after a particular word. The four words are encoded as one-hot vectors and passed through the hidden layer. The hidden layer has a linear activation function, it outputs the input without changing it. The outputs of the hidden layer are aggregated (e.g. using a lambda mean function) and then fed to the final layer which, using Softmax, predicts a probability for each possible word. The token with the highest probability is considered the output of the network.</p>
 
 
 
 
 
 
 
 
 
189
 
190
  <p>The hidden layer is where the embeddings are stored. It has a shape of <i>Vocabulary size x Embedding size</i> and as we give a one-hot vector (a vector that is all zeros except for one element set to 1) of a word to the network, that specific <code>1</code> triggers the embeddings of that word to be passed to the next layers. You can see a cool and simple implementation of the word2vec network in <d-cite bibtex-key="sarkar2018cbow"></d-cite>.</p>
191
 
@@ -198,7 +218,7 @@
198
 
199
  <p>Since the network relies on the relationship between words in a context, and not on the occurrence or co-occurrence of words as in TF-IDF, it is able to capture <strong>Semantics Relationships</strong> between the words.</p>
200
 
201
- <p>You can download the pretrained version from Google's official page <d-cite bibtex-key="word2vec"></d-cite>. Let's see word2vec embeddings in action:</p>
202
 
203
  <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
204
  <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
@@ -209,9 +229,9 @@
209
  </div>
210
  </details>
211
 
212
- <p>To train word2vec efficiently, especially with large vocabularies, an optimization technique called negative sampling is used. Instead of computing the full softmax over the entire vocabulary (which is computationally expensive), negative sampling simplifies the task by updating only a small number of negative examples (i.e., randomly selected words not related to the context) along with the positive ones. This makes training faster and more scalable.</p>
213
 
214
- <p>The semantic relationship is a fun topic to explore and word2vec is a simple setup for your experiments. You can explore the biases of society or the data, or explore how words have evolved overtime by studying the embeddings of older manuscripts.</p>
215
 
216
  <p>You can actually visualize and play with word2vec embeddings with <a href="https://projector.tensorflow.org/">Tensorflow Embedding Projector</a>.</p>
217
 
@@ -224,7 +244,8 @@
224
 
225
  <h2>BERT (Bidirectional encoder representations from transformers)</h2>
226
 
227
- <p>Wherever you look in the world of NLP, you will see BERT. It's a good idea to do yourself a favor and learn about BERT once and for all, as it is the source of many ideas and techniques when it comes to LLMs. Here's a good video to get started. <d-cite bibtex-key="codeemporium2020bert"></d-cite></p>
 
228
 
229
  <p>In summary, BERT is an encoder-only transformer model consisting of 4 main parts:</p>
230
 
@@ -259,7 +280,7 @@
259
 
260
  <h2>Embeddings in Modern LLMs</h2>
261
 
262
- <p>Embeddings are a foundational component in large language models and also a broad term. For the purpose of this article, we focus on "embeddings" as the module that transforms tokens into vector representations.</p>
263
 
264
  <h3>Where does the embedding fit into LLMs?</h3>
265
 
@@ -283,7 +304,7 @@
283
 
284
  <h3>torch.nn.Embedding</h3>
285
 
286
- <p>The embedding layer in LLMs works as a look-up table. Given a list of indices (token ids) it returns their embeddings. <d-cite bibtex-key="manning2024llm"></d-cite> shows this concept pretty well.</p>
287
 
288
  <figure class="fullscreen">
289
  <div id="image-as-graph" style="display: grid; position: relative; justify-content: center;">
@@ -292,7 +313,7 @@
292
  <figcaption>Visualization of how the embedding layer works as a lookup table. (Image source: <d-cite bibtex-key="manning2024llm"></d-cite>)</figcaption>
293
  </figure>
294
 
295
- <p>The code implementation of an embedding layer in PyTorch is done using <code>torch.nn.Embedding</code> which acts as a simple look-up table. There is nothing more special about this layer than a simple Linear layer, rather than the fact that it can work with indices as input rather than one-hot encoding inputs. The Embedding layer is simply a Linear layer that works with indices.</p>
296
 
297
  <figure class="fullscreen">
298
  <div id="image-as-graph" style="display: grid; position: relative; justify-content: center;">
@@ -309,7 +330,7 @@
309
 
310
  <p>How does the embedding layer in a large language model look like?</p>
311
 
312
- <p>Let's dissect the embeddings of the distilled version of DeepSeek-R1 in the Qwen model. Some parts of the following code is inspired by <d-cite bibtex-key="chrishayuk2024embeddings"></d-cite>.</p>
313
 
314
  <p>We begin by loading the <code>deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B</code> model from <a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B">Hugging Face</a> and saving the embeddings.</p>
315
 
@@ -428,7 +449,7 @@
428
 
429
  <p>You can actually see a more comprehensive example at the beginning of the article in which 50 tokens and their closest tokens are mapped out.</p>
430
 
431
- <p>A token or word such as "list" may have many different variations with their own embeddings, such as "_list", "List", and many more:</p>
432
 
433
  <figure class="fullscreen">
434
  <div id="image-as-graph" style="display: grid; position: relative; justify-content: center;">
@@ -440,13 +461,15 @@
440
 
441
  <h2>Let's Wrap Up</h2>
442
 
443
- <p>Embeddings remain as one of the fundamental parts in natural language processing and modern large language models. While the research in machine learning and LLMs rapidly uncovers new methods and techniques, embeddings haven't seen much change in large language models (and that has to mean something). They are essential, easy to understand, and easy to work with.</p>
444
-
 
 
445
  <p>In this blog post we went through the basics of what you need to know about embeddings, and their evolution from traditional statistical methods into their use case in today's LLMs. My hope is that this has been a comprehensive jump-start to help you gain an intuitive understanding of the word embeddings and what they represent.</p>
446
 
447
  <p><strong>Thank you for reading through this article.</strong> If you found this useful, consider following me on <a href="https://x.com/Hesamation">X (Twitter)</a> and <a href="https://huggingface.co/hesamation">Hugging Face</a> to be notified about my next projects.</p>
448
 
449
- <p>If you have any questions or feedback, please feel free to write in the <a href="https://huggingface.co/spaces/hesamation/primer-llm-embedding/discussions"></a>community</a>.</p>
450
  </d-article>
451
 
452
  <d-appendix>
 
52
  </d-contents>
53
 
54
  <p>Embeddings are the semantic backbone of LLMs, the gate at which raw text is transformed into vectors of numbers that are understandable by the model. When you prompt an LLM to help you debug your code, your words and tokens are transformed into a high-dimensional vector space where semantic relationships become mathematical relationships.</p>
55
+ <aside>Reading time: 12-15 minutes. This blog post is recommended for desktop users.</aside>
56
 
57
  <p>In this article we go through the fundamentals of embeddings. We will cover what embeddings are, how they evolved over time from statistical methods to modern techniques, check out how they're implemented in practice, look at some of the most important embedding techniques, and how the embeddings of an LLM (DeepSeek-R1-Distill-Qwen-1.5B) look like as a graph representation.</p>
58
 
59
  <p>This article includes interactive visualization and hands-on code examples. It also avoids verbosity and focuses on the core concepts for a fast-paced read to get straight to the point. The full code is available on my <a href="https://github.com/hesamsheikh/llm-mechanics">LLM Mechanics GitHub repository</a>.</p>
60
  <aside>To contribute to the article, point out any mistakes, or suggest improvements, please check out<a href="https://huggingface.co/spaces/hesamation/primer-llm-embedding/discussions">Community</a>.</aside>
61
+
62
+ <div class="audio-container">
63
+ <audio controls>
64
+ <source src="assets/audio/podcast.wav" type="audio/wav">
65
+ Your browser does not support the audio element.
66
+ </audio>
67
+ <div class="figure-legend">
68
+ <p>If your ears are more important than your eyes, you can listen to the podcast version of this article generated by <a href="https://www.notebooklm.com/">NotebookLM</a>.</p>
69
+ </div>
70
+ </div>
71
+
72
+
73
  <h2>What Are Embeddings?</h2>
74
 
75
  <p>Processing text for NLP tasks requires a numeric representation of each word. Most embedding methods come down to turning a word or token into a vector. What makes embedding techniques different from each other, is how they approach this word → vector conversion.</p>
 
103
 
104
  <h2>Traditional Embedding Techniques</h2>
105
 
106
+ <p>Almost every embedding technique relies on a large corpus of text data to extract the relationship of the word. Previously, embedding methods relied on statistic methods based on the occurence or co-occurence of words in a text. This was based on the assumption that if a pair of words often appear together then they must have a closer relationship. These are simple methods that are not as computation-heavy as other techniques. One of such methods is:</p>
107
 
108
  <h3>TF-IDF (Term Frequency-Inverse Document Frequency)</h3>
109
 
 
175
  <p>There are two things noticeable about this embedding space:</p>
176
 
177
  <ol>
178
+ <li>The majority of the words are concentrated into one particular area. This means that the embedding of most words are similar in this approach. It signals a lack of expressiveness and uniqueness about these embeddings.</li>
179
  <li>The embeddings have no semantic connection. The distance between words has nothing to do with their meaning.</li>
180
  </ol>
181
 
 
196
 
197
  <p>Aside from CBOW, another variant is Skipgram which works completely the opposite: it aims to predict the neighbors, given a particular word as input.</p>
198
 
199
+ <p>Here's how CBOW word2vec works step by step:</p>
200
+ <ol>
201
+ <li>Choose a context window (e.g. 2 in the image above)</li>
202
+ <li>Take the two words before and two words after a particular word as input</li>
203
+ <li>Encode these four context words as one-hot vectors</li>
204
+ <li>Pass the encoded vectors through the hidden layer, which has a linear activation function that outputs the input unchanged</li>
205
+ <li>Aggregate the outputs of the hidden layer (e.g. using a lambda mean function)</li>
206
+ <li>Feed the aggregated output to the final layer which uses Softmax to predict probabilities for each possible word</li>
207
+ <li>Select the token with the highest probability as the final output of the network</li>
208
+ </ol>
209
 
210
  <p>The hidden layer is where the embeddings are stored. It has a shape of <i>Vocabulary size x Embedding size</i> and as we give a one-hot vector (a vector that is all zeros except for one element set to 1) of a word to the network, that specific <code>1</code> triggers the embeddings of that word to be passed to the next layers. You can see a cool and simple implementation of the word2vec network in <d-cite bibtex-key="sarkar2018cbow"></d-cite>.</p>
211
 
 
218
 
219
  <p>Since the network relies on the relationship between words in a context, and not on the occurrence or co-occurrence of words as in TF-IDF, it is able to capture <strong>Semantics Relationships</strong> between the words.</p>
220
 
221
+ <p>You can download the pretrained version from Google's official page <d-cite bibtex-key="word2vec"></d-cite>. Here's the code to use word2vec in practice:</p>
222
 
223
  <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
224
  <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
 
229
  </div>
230
  </details>
231
 
232
+ <p>To train word2vec efficiently, especially with large vocabularies, an optimization technique called <strong>negative sampling</strong> is used. Instead of computing the full softmax over the entire vocabulary (which is computationally expensive), negative sampling simplifies the task by updating only a small number of negative examples (i.e., randomly selected words not related to the context) along with the positive ones. This makes training faster and more scalable.</p>
233
 
234
+ <p>The semantic relationship is a fun topic to explore and word2vec is a simple setup for your experiments. You can explore the biases of society or the data, or explore how words have evolved over time by studying the embeddings of older manuscripts.</p>
235
 
236
  <p>You can actually visualize and play with word2vec embeddings with <a href="https://projector.tensorflow.org/">Tensorflow Embedding Projector</a>.</p>
237
 
 
244
 
245
  <h2>BERT (Bidirectional encoder representations from transformers)</h2>
246
 
247
+ <p>Wherever you look in the world of NLP, you will see BERT. It's a good idea to do yourself a favor and learn about BERT once and for all, as it is the source of many ideas and techniques when it comes to LLMs.</p>
248
+ <aside>Here's a good video to get started on BERT. <d-cite bibtex-key="codeemporium2020bert"></d-cite></aside>
249
 
250
  <p>In summary, BERT is an encoder-only transformer model consisting of 4 main parts:</p>
251
 
 
280
 
281
  <h2>Embeddings in Modern LLMs</h2>
282
 
283
+ <p>Embeddings are a foundational component in large language models and also a broad term. For the purpose of this article, we focus on "embeddings" as the module that transforms tokens into vector representations as opposed to the latent space in the hiddent layers.</p>
284
 
285
  <h3>Where does the embedding fit into LLMs?</h3>
286
 
 
304
 
305
  <h3>torch.nn.Embedding</h3>
306
 
307
+ <p>The embedding layer in LLMs works as a look-up table. Given a list of indices (token ids) it returns their embeddings. Build a Large Language Model (From Scratch) <d-cite bibtex-key="manning2024llm"></d-cite> shows this concept comperhensively.</p>
308
 
309
  <figure class="fullscreen">
310
  <div id="image-as-graph" style="display: grid; position: relative; justify-content: center;">
 
313
  <figcaption>Visualization of how the embedding layer works as a lookup table. (Image source: <d-cite bibtex-key="manning2024llm"></d-cite>)</figcaption>
314
  </figure>
315
 
316
+ <p>The code implementation of an embedding layer in PyTorch is done using <code>torch.nn.Embedding</code> which acts as a simple look-up table. There is nothing more special about this layer than a simple Linear layer, other than the fact that it can work with indices as input rather than one-hot encoding inputs. The Embedding layer is simply a Linear layer that works with indices.</p>
317
 
318
  <figure class="fullscreen">
319
  <div id="image-as-graph" style="display: grid; position: relative; justify-content: center;">
 
330
 
331
  <p>How does the embedding layer in a large language model look like?</p>
332
 
333
+ <p>Let's dissect the embeddings of the distilled version of DeepSeek-R1 in the Qwen model. Some parts of the following code is inspired by <d-cite bibtex-key="chrishayuk2024embeddings"></d-cite>. Let's get our hands dirty.</p>
334
 
335
  <p>We begin by loading the <code>deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B</code> model from <a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B">Hugging Face</a> and saving the embeddings.</p>
336
 
 
449
 
450
  <p>You can actually see a more comprehensive example at the beginning of the article in which 50 tokens and their closest tokens are mapped out.</p>
451
 
452
+ <p>In the graph examples, we have ignored the "token variations". A token or word may have many different variations, each with their own embeddings. For example, the token "list" may have many different variations with their own embeddings, such as "_list", "List", and many more. These variations often have an embedding that is very similar to the original token. The following is a graph of the token "list" and its close neighbors, including its variations.</p>
453
 
454
  <figure class="fullscreen">
455
  <div id="image-as-graph" style="display: grid; position: relative; justify-content: center;">
 
461
 
462
  <h2>Let's Wrap Up</h2>
463
 
464
+ <p>Embeddings remain as one of the fundamental parts in natural language processing and modern large language models. While the research in machine learning and LLMs discover new methods and techniques every day, embeddings haven't seen much change in large language models (and that has to mean something). They are essential, easy to understand, and easy to work with.</p>
465
+
466
+ <aside>This blog post's template is inspired by the <a href="https://distill.pub/">distill.pub</a>, if you want to use it for your own blog post, you can find the template <a href="https://huggingface.co/spaces/lvwerra/distill-blog-template">here</a>.</aside>
467
+
468
  <p>In this blog post we went through the basics of what you need to know about embeddings, and their evolution from traditional statistical methods into their use case in today's LLMs. My hope is that this has been a comprehensive jump-start to help you gain an intuitive understanding of the word embeddings and what they represent.</p>
469
 
470
  <p><strong>Thank you for reading through this article.</strong> If you found this useful, consider following me on <a href="https://x.com/Hesamation">X (Twitter)</a> and <a href="https://huggingface.co/hesamation">Hugging Face</a> to be notified about my next projects.</p>
471
 
472
+ <p>If you have any questions or feedback, please feel free to write in the <a href="https://huggingface.co/spaces/hesamation/primer-llm-embedding/discussions">community</a>.</p>
473
  </d-article>
474
 
475
  <d-appendix>