Spaces:

hesamation
/

primer-llm-embedding

Running

App Files Files Community

hesamation commited on Mar 28

Commit

2b1309a

1 Parent(s): 64f3286

modify distill.js to change publication date, enhance index.html with additional audio content, improved clarity in paragraphs, and new asides for better user engagement.

Browse files

Files changed (5) hide show

assets/audio/podcast.wav +3 -0
assets/pdf/pdf_version.pdf +3 -0
src/bibliography.bib +1 -2
src/distill.js +2 -2
src/index.html +40 -17

assets/audio/podcast.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:07f4c913f033e8144df5000718b35bd5f25a8b613989659ed478140ddd11c97d
+size 31564844

assets/pdf/pdf_version.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:90a77509125812d76475748ca81a97291de539c0fefcf7415bc3e9fc0f5140b1
+size 2525198

src/bibliography.bib CHANGED Viewed

@@ -67,6 +67,5 @@
   url={https://arxiv.org/abs/1301.3781},
   journal={arXiv.org},
   author={Mikolov, Tomas and Chen, Kai and Corrado, Greg and Dean, Jeffrey},
-  year={2013},
-  month=jan
 }

   url={https://arxiv.org/abs/1301.3781},
   journal={arXiv.org},
   author={Mikolov, Tomas and Chen, Kai and Corrado, Greg and Dean, Jeffrey},
+  year={2013}
 }

src/distill.js CHANGED Viewed

@@ -2102,12 +2102,12 @@ d-appendix > distill-appendix {
       </div>
       <div >
           <h3>Published</h3>
-          <div>March 27, 2025</div>
       </div>
     </div>
     <div class="side pdf-download">
       <h3>Download</h3>
-      <a href="hf.co">
       <img style="width: 32px;" src="../assets/images/256px-PDF.png" alt="PDF"></a>
     </div>
 `;

       </div>
       <div >
           <h3>Published</h3>
+          <div>March 28, 2025</div>
       </div>
     </div>
     <div class="side pdf-download">
       <h3>Download</h3>
+      <a href="../assets/pdf/pdf_version.pdf">
       <img style="width: 32px;" src="../assets/images/256px-PDF.png" alt="PDF"></a>
     </div>
 `;

src/index.html CHANGED Viewed

@@ -52,13 +52,24 @@
         </d-contents>
         <p>Embeddings are the semantic backbone of LLMs, the gate at which raw text is transformed into vectors of numbers that are understandable by the model. When you prompt an LLM to help you debug your code, your words and tokens are transformed into a high-dimensional vector space where semantic relationships become mathematical relationships.</p>
-        <aside>Reading time: 12-15 minutes.</aside>
         <p>In this article we go through the fundamentals of embeddings. We will cover what embeddings are, how they evolved over time from statistical methods to modern techniques, check out how they're implemented in practice, look at some of the most important embedding techniques, and how the embeddings of an LLM (DeepSeek-R1-Distill-Qwen-1.5B) look like as a graph representation.</p>
         <p>This article includes interactive visualization and hands-on code examples. It also avoids verbosity and focuses on the core concepts for a fast-paced read to get straight to the point. The full code is available on my <a href="https://github.com/hesamsheikh/llm-mechanics">LLM Mechanics GitHub repository</a>.</p>
         <aside>To contribute to the article, point out any mistakes, or suggest improvements, please check out<a href="https://huggingface.co/spaces/hesamation/primer-llm-embedding/discussions">Community</a>.</aside>
         <h2>What Are Embeddings?</h2>
         <p>Processing text for NLP tasks requires a numeric representation of each word. Most embedding methods come down to turning a word or token into a vector. What makes embedding techniques different from each other, is how they approach this word → vector conversion.</p>
@@ -92,7 +103,7 @@
         <h2>Traditional Embedding Techniques</h2>
-        <p>Almost every embedding technique relies on a large corpus of text data to extract the relationship of the word. Previously, embedding methods relied on statistic methods based on the occurence or co-occurence of words in a text. This was based on the assumption that if a pair of words often appear together then they must have a closer relationship. For us in the modern day who know how embeddings can be more sophisticated, this doesn't seem a reliable approach, but they are simple methods that are not as computation-heavy as other techniques. One of such methods is:</p>
         <h3>TF-IDF (Term Frequency-Inverse Document Frequency)</h3>
@@ -164,7 +175,7 @@
         <p>There are two things noticeable about this embedding space:</p>
         <ol>
-            <li>The majority of the words are concentrated into one particular area. This means that the embedding of most words are similar in this approach. It signals a lack of expressiveness and uniqueness about these word embeddings.</li>
             <li>The embeddings have no semantic connection. The distance between words has nothing to do with their meaning.</li>
         </ol>
@@ -185,7 +196,16 @@
         <p>Aside from CBOW, another variant is Skipgram which works completely the opposite: it aims to predict the neighbors, given a particular word as input.</p>
-        <p>Let's see what happens in the case of a CBOW word2vec: after choosing a context window (e.g. 2 in the image above), we get the two words that appear before and two words after a particular word. The four words are encoded as one-hot vectors and passed through the hidden layer. The hidden layer has a linear activation function, it outputs the input without changing it. The outputs of the hidden layer are aggregated (e.g. using a lambda mean function) and then fed to the final layer which, using Softmax, predicts a probability for each possible word. The token with the highest probability is considered the output of the network.</p>
         <p>The hidden layer is where the embeddings are stored. It has a shape of <i>Vocabulary size x Embedding size</i> and as we give a one-hot vector (a vector that is all zeros except for one element set to 1) of a word to the network, that specific <code>1</code> triggers the embeddings of that word to be passed to the next layers. You can see a cool and simple implementation of the word2vec network in <d-cite bibtex-key="sarkar2018cbow"></d-cite>.</p>
@@ -198,7 +218,7 @@
         <p>Since the network relies on the relationship between words in a context, and not on the occurrence or co-occurrence of words as in TF-IDF, it is able to capture <strong>Semantics Relationships</strong> between the words.</p>
-        <p>You can download the pretrained version from Google's official page <d-cite bibtex-key="word2vec"></d-cite>. Let's see word2vec embeddings in action:</p>
         <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
             <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
@@ -209,9 +229,9 @@
                 </div>
         </details>
-        <p>To train word2vec efficiently, especially with large vocabularies, an optimization technique called negative sampling is used. Instead of computing the full softmax over the entire vocabulary (which is computationally expensive), negative sampling simplifies the task by updating only a small number of negative examples (i.e., randomly selected words not related to the context) along with the positive ones. This makes training faster and more scalable.</p>
-        <p>The semantic relationship is a fun topic to explore and word2vec is a simple setup for your experiments. You can explore the biases of society or the data, or explore how words have evolved overtime by studying the embeddings of older manuscripts.</p>
         <p>You can actually visualize and play with word2vec embeddings with <a href="https://projector.tensorflow.org/">Tensorflow Embedding Projector</a>.</p>
@@ -224,7 +244,8 @@
         <h2>BERT (Bidirectional encoder representations from transformers)</h2>
-        <p>Wherever you look in the world of NLP, you will see BERT. It's a good idea to do yourself a favor and learn about BERT once and for all, as it is the source of many ideas and techniques when it comes to LLMs. Here's a good video to get started. <d-cite bibtex-key="codeemporium2020bert"></d-cite></p>
         <p>In summary, BERT is an encoder-only transformer model consisting of 4 main parts:</p>
@@ -259,7 +280,7 @@
         <h2>Embeddings in Modern LLMs</h2>
-        <p>Embeddings are a foundational component in large language models and also a broad term. For the purpose of this article, we focus on "embeddings" as the module that transforms tokens into vector representations.</p>
         <h3>Where does the embedding fit into LLMs?</h3>
@@ -283,7 +304,7 @@
         <h3>torch.nn.Embedding</h3>
-        <p>The embedding layer in LLMs works as a look-up table. Given a list of indices (token ids) it returns their embeddings. <d-cite bibtex-key="manning2024llm"></d-cite> shows this concept pretty well.</p>
         <figure class="fullscreen">
             <div id="image-as-graph" style="display: grid; position: relative; justify-content: center;">
@@ -292,7 +313,7 @@
             <figcaption>Visualization of how the embedding layer works as a lookup table. (Image source: <d-cite bibtex-key="manning2024llm"></d-cite>)</figcaption>
         </figure>
-        <p>The code implementation of an embedding layer in PyTorch is done using <code>torch.nn.Embedding</code> which acts as a simple look-up table. There is nothing more special about this layer than a simple Linear layer, rather than the fact that it can work with indices as input rather than one-hot encoding inputs. The Embedding layer is simply a Linear layer that works with indices.</p>
         <figure class="fullscreen">
             <div id="image-as-graph" style="display: grid; position: relative; justify-content: center;">
@@ -309,7 +330,7 @@
         <p>How does the embedding layer in a large language model look like?</p>
-        <p>Let's dissect the embeddings of the distilled version of DeepSeek-R1 in the Qwen model. Some parts of the following code is inspired by <d-cite bibtex-key="chrishayuk2024embeddings"></d-cite>.</p>
         <p>We begin by loading the <code>deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B</code> model from <a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B">Hugging Face</a> and saving the embeddings.</p>
@@ -428,7 +449,7 @@
         <p>You can actually see a more comprehensive example at the beginning of the article in which 50 tokens and their closest tokens are mapped out.</p>
-        <p>A token or word such as "list" may have many different variations with their own embeddings, such as "_list", "List", and many more:</p>
         <figure class="fullscreen">
             <div id="image-as-graph" style="display: grid; position: relative; justify-content: center;">
@@ -440,13 +461,15 @@
         <h2>Let's Wrap Up</h2>
-        <p>Embeddings remain as one of the fundamental parts in natural language processing and modern large language models. While the research in machine learning and LLMs rapidly uncovers new methods and techniques, embeddings haven't seen much change in large language models (and that has to mean something). They are essential, easy to understand, and easy to work with.</p>
         <p>In this blog post we went through the basics of what you need to know about embeddings, and their evolution from traditional statistical methods into their use case in today's LLMs. My hope is that this has been a comprehensive jump-start to help you gain an intuitive understanding of the word embeddings and what they represent.</p>
         <p><strong>Thank you for reading through this article.</strong> If you found this useful, consider following me on <a href="https://x.com/Hesamation">X (Twitter)</a> and <a href="https://huggingface.co/hesamation">Hugging Face</a> to be notified about my next projects.</p>
-        <p>If you have any questions or feedback, please feel free to write in the <a href="https://huggingface.co/spaces/hesamation/primer-llm-embedding/discussions"></a>community</a>.</p>
     </d-article>
     <d-appendix>

         </d-contents>
         <p>Embeddings are the semantic backbone of LLMs, the gate at which raw text is transformed into vectors of numbers that are understandable by the model. When you prompt an LLM to help you debug your code, your words and tokens are transformed into a high-dimensional vector space where semantic relationships become mathematical relationships.</p>
+        <aside>Reading time: 12-15 minutes. This blog post is recommended for desktop users.</aside>
         <p>In this article we go through the fundamentals of embeddings. We will cover what embeddings are, how they evolved over time from statistical methods to modern techniques, check out how they're implemented in practice, look at some of the most important embedding techniques, and how the embeddings of an LLM (DeepSeek-R1-Distill-Qwen-1.5B) look like as a graph representation.</p>
         <p>This article includes interactive visualization and hands-on code examples. It also avoids verbosity and focuses on the core concepts for a fast-paced read to get straight to the point. The full code is available on my <a href="https://github.com/hesamsheikh/llm-mechanics">LLM Mechanics GitHub repository</a>.</p>
         <aside>To contribute to the article, point out any mistakes, or suggest improvements, please check out<a href="https://huggingface.co/spaces/hesamation/primer-llm-embedding/discussions">Community</a>.</aside>
+        <div class="audio-container">
+            <audio controls>
+                <source src="assets/audio/podcast.wav" type="audio/wav">
+                Your browser does not support the audio element.
+              </audio>
+              <div class="figure-legend">
+              <p>If your ears are more important than your eyes, you can listen to the podcast version of this article generated by <a href="https://www.notebooklm.com/">NotebookLM</a>.</p>
+            </div>
+        </div>
         <h2>What Are Embeddings?</h2>
         <p>Processing text for NLP tasks requires a numeric representation of each word. Most embedding methods come down to turning a word or token into a vector. What makes embedding techniques different from each other, is how they approach this word → vector conversion.</p>
         <h2>Traditional Embedding Techniques</h2>
+        <p>Almost every embedding technique relies on a large corpus of text data to extract the relationship of the word. Previously, embedding methods relied on statistic methods based on the occurence or co-occurence of words in a text. This was based on the assumption that if a pair of words often appear together then they must have a closer relationship. These are simple methods that are not as computation-heavy as other techniques. One of such methods is:</p>
         <h3>TF-IDF (Term Frequency-Inverse Document Frequency)</h3>
         <p>There are two things noticeable about this embedding space:</p>
         <ol>
+            <li>The majority of the words are concentrated into one particular area. This means that the embedding of most words are similar in this approach. It signals a lack of expressiveness and uniqueness about these embeddings.</li>
             <li>The embeddings have no semantic connection. The distance between words has nothing to do with their meaning.</li>
         </ol>
         <p>Aside from CBOW, another variant is Skipgram which works completely the opposite: it aims to predict the neighbors, given a particular word as input.</p>
+        <p>Here's how CBOW word2vec works step by step:</p>
+        <ol>
+            <li>Choose a context window (e.g. 2 in the image above)</li>
+            <li>Take the two words before and two words after a particular word as input</li>
+            <li>Encode these four context words as one-hot vectors</li>
+            <li>Pass the encoded vectors through the hidden layer, which has a linear activation function that outputs the input unchanged</li>
+            <li>Aggregate the outputs of the hidden layer (e.g. using a lambda mean function)</li>
+            <li>Feed the aggregated output to the final layer which uses Softmax to predict probabilities for each possible word</li>
+            <li>Select the token with the highest probability as the final output of the network</li>
+        </ol>
         <p>The hidden layer is where the embeddings are stored. It has a shape of <i>Vocabulary size x Embedding size</i> and as we give a one-hot vector (a vector that is all zeros except for one element set to 1) of a word to the network, that specific <code>1</code> triggers the embeddings of that word to be passed to the next layers. You can see a cool and simple implementation of the word2vec network in <d-cite bibtex-key="sarkar2018cbow"></d-cite>.</p>
         <p>Since the network relies on the relationship between words in a context, and not on the occurrence or co-occurrence of words as in TF-IDF, it is able to capture <strong>Semantics Relationships</strong> between the words.</p>
+        <p>You can download the pretrained version from Google's official page <d-cite bibtex-key="word2vec"></d-cite>. Here's the code to use word2vec in practice:</p>
         <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
             <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
                 </div>
         </details>
+        <p>To train word2vec efficiently, especially with large vocabularies, an optimization technique called <strong>negative sampling</strong> is used. Instead of computing the full softmax over the entire vocabulary (which is computationally expensive), negative sampling simplifies the task by updating only a small number of negative examples (i.e., randomly selected words not related to the context) along with the positive ones. This makes training faster and more scalable.</p>
+        <p>The semantic relationship is a fun topic to explore and word2vec is a simple setup for your experiments. You can explore the biases of society or the data, or explore how words have evolved over time by studying the embeddings of older manuscripts.</p>
         <p>You can actually visualize and play with word2vec embeddings with <a href="https://projector.tensorflow.org/">Tensorflow Embedding Projector</a>.</p>
         <h2>BERT (Bidirectional encoder representations from transformers)</h2>
+        <p>Wherever you look in the world of NLP, you will see BERT. It's a good idea to do yourself a favor and learn about BERT once and for all, as it is the source of many ideas and techniques when it comes to LLMs.</p>
+        <aside>Here's a good video to get started on BERT. <d-cite bibtex-key="codeemporium2020bert"></d-cite></aside>
         <p>In summary, BERT is an encoder-only transformer model consisting of 4 main parts:</p>
         <h2>Embeddings in Modern LLMs</h2>
+        <p>Embeddings are a foundational component in large language models and also a broad term. For the purpose of this article, we focus on "embeddings" as the module that transforms tokens into vector representations as opposed to the latent space in the hiddent layers.</p>
         <h3>Where does the embedding fit into LLMs?</h3>
         <h3>torch.nn.Embedding</h3>
+        <p>The embedding layer in LLMs works as a look-up table. Given a list of indices (token ids) it returns their embeddings. Build a Large Language Model (From Scratch) <d-cite bibtex-key="manning2024llm"></d-cite> shows this concept comperhensively.</p>
         <figure class="fullscreen">
             <div id="image-as-graph" style="display: grid; position: relative; justify-content: center;">
             <figcaption>Visualization of how the embedding layer works as a lookup table. (Image source: <d-cite bibtex-key="manning2024llm"></d-cite>)</figcaption>
         </figure>
+        <p>The code implementation of an embedding layer in PyTorch is done using <code>torch.nn.Embedding</code> which acts as a simple look-up table. There is nothing more special about this layer than a simple Linear layer, other than the fact that it can work with indices as input rather than one-hot encoding inputs. The Embedding layer is simply a Linear layer that works with indices.</p>
         <figure class="fullscreen">
             <div id="image-as-graph" style="display: grid; position: relative; justify-content: center;">
         <p>How does the embedding layer in a large language model look like?</p>
+        <p>Let's dissect the embeddings of the distilled version of DeepSeek-R1 in the Qwen model. Some parts of the following code is inspired by <d-cite bibtex-key="chrishayuk2024embeddings"></d-cite>. Let's get our hands dirty.</p>
         <p>We begin by loading the <code>deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B</code> model from <a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B">Hugging Face</a> and saving the embeddings.</p>
         <p>You can actually see a more comprehensive example at the beginning of the article in which 50 tokens and their closest tokens are mapped out.</p>
+        <p>In the graph examples, we have ignored the "token variations". A token or word may have many different variations, each with their own embeddings. For example, the token "list" may have many different variations with their own embeddings, such as "_list", "List", and many more. These variations often have an embedding that is very similar to the original token. The following is a graph of the token "list" and its close neighbors, including its variations.</p>
         <figure class="fullscreen">
             <div id="image-as-graph" style="display: grid; position: relative; justify-content: center;">
         <h2>Let's Wrap Up</h2>
+        <p>Embeddings remain as one of the fundamental parts in natural language processing and modern large language models. While the research in machine learning and LLMs discover new methods and techniques every day, embeddings haven't seen much change in large language models (and that has to mean something). They are essential, easy to understand, and easy to work with.</p>
+        <aside>This blog post's template is inspired by the <a href="https://distill.pub/">distill.pub</a>, if you want to use it for your own blog post, you can find the template <a href="https://huggingface.co/spaces/lvwerra/distill-blog-template">here</a>.</aside>
         <p>In this blog post we went through the basics of what you need to know about embeddings, and their evolution from traditional statistical methods into their use case in today's LLMs. My hope is that this has been a comprehensive jump-start to help you gain an intuitive understanding of the word embeddings and what they represent.</p>
         <p><strong>Thank you for reading through this article.</strong> If you found this useful, consider following me on <a href="https://x.com/Hesamation">X (Twitter)</a> and <a href="https://huggingface.co/hesamation">Hugging Face</a> to be notified about my next projects.</p>
+        <p>If you have any questions or feedback, please feel free to write in the <a href="https://huggingface.co/spaces/hesamation/primer-llm-embedding/discussions">community</a>.</p>
     </d-article>
     <d-appendix>