Spaces:

imh0
/

transformers-p1-embeddings

Runtime error

App Files Files Community

im commited on Jul 25, 2023

Commit

120ad45

•

1 Parent(s): 085ad06

refine formatting

Browse files

Files changed (1) hide show

app.py +61 -57

app.py CHANGED Viewed

@@ -20,51 +20,53 @@ st.title("Transformers: Tokenisers and Embeddings")
 preface_image, preface_text,  = st.columns(2)
 # preface_image.image("https://static.streamlit.io/examples/dice.jpg")
 # preface_image.image("""https://assets.digitalocean.com/articles/alligator/boo.svg""")
-preface_text.write("""*Transformers represent a revolutionary class of machine learning architectures that have sparked
-immense interest. While numerous insightful tutorials are available, the evolution of transformer architectures over
-the last few years has led to significant simplifications. These advancements have made it increasingly
-straightforward to understand their inner workings. In this series of articles, I aim to provide a direct, clear explanation of
-how and why modern transformers function, unburdened by the historical complexities associated with their inception.*
 """)
 divider()
-st.write("""In order to understand the recent success in AI we need to understand the Transformer architecture. Its
-rise in the field of Natural Language Processing (NLP) is largely attributed to a combination of several key
-advancements:
-- Tokenisers and Embeddings
-- Attention and Self-Attention
-- Encoder-Decoder architecture
-Understanding these foundational concepts is crucial to comprehending the overall structure and function of the
-Transformer model. They are the building blocks from which the rest of the model is constructed, and their roles
-within the architecture are essential to the model's ability to process and generate language. In my view,
-a comprehensive and simple explanation may give a reader a significant advantage in using LLMs. Feynman once said:
-"*I think I can safely say that nobody understands quantum mechanics.*". Because he couldn't explain it to a freshman.
-Given the importance and complexity of these concepts, I have chosen to dedicate the first article in this series
-solely to Tokenisation and embeddings. The decision to separate the topics into individual articles is driven by a
-desire to provide a thorough and in-depth understanding of each component of the Transformer model.
-Note: *HuggingFace provides an exceptional [tutorial on Transformer models](https://huggingface.co/docs/transformers/index).
-That tutorial is particularly beneficial for readers willing to dive into advanced topics.*
 """)
 with st.expander("Copernicus Museum in Warsaw"):
-    st.write("""
-Have you ever visited the Copernicus Museum in Warsaw? It's an engaging interactive hub that allows
-you to familiarize yourself with various scientific topics. The experience is both entertaining and educational,
-providing the opportunity to explore different concepts firsthand. **They even feature a small neural network that
-illustrates the neuron activation process during the recognition of handwritten digits!**
-Taking inspiration from this approach, we'll embark on our journey into the world of Transformer models by first
-establishing a firm understanding of tokenisation and embeddings. This foundation will equip us with the knowledge
-needed to delve into the more complex aspects of these models later on.
-I encourage you not to hesitate in modifying parameters or experimenting with different models in the provided
-examples. This hands-on exploration can significantly enhance your learning experience. So, let's begin our journey
-through this virtual, interactive museum of AI. Enjoy the exploration!
 """)
     st.image("https://i.pinimg.com/originals/04/11/2c/04112c791a859d07a01001ac4f436e59.jpg")
@@ -73,11 +75,12 @@ divider()
 st.header("Tokenisers and Tokenisation")
-st.write("""Tokenisation is the initial step in the data preprocessing pipeline for natural language processing (NLP)
-models. It involves breaking down a piece of text—whether a sentence, paragraph, or document—into smaller units,
-known as "tokens". In English and many other languages, a token often corresponds to a word, but it can also be a
-subword, character, or n-gram. The choice of token size depends on various factors, including the task at hand and
-the language of the text.
 """)
 from transformers import AutoTokenizer
@@ -90,7 +93,7 @@ sentence_encode_bert = tokenizer.encode(sentence)
 sentence_encode_bert = list(zip(sentence_tokenise_bert, sentence_encode_bert))
 st.write(f"""\
-A basic word-level tokenisation, which splits a text by spaces, would produce next tokens:
 """)
 st.code(f"""
 {sentence_split}
@@ -98,25 +101,26 @@ st.code(f"""
 st.write(f"""\
-However, we notice that the punctuation may attached to the words. It is disadvantageous, how the tokenization dealt with the word "Don't".
-"Don't" stands for "do not", so it would be better tokenized as ["Do", "n't"]. (Hint: try another sentence: "I musn't tell lies. Don't do this.") This is where things start getting complicated,
-and part of the reason each model has its own tokenizer type. Depending on the rules we apply for tokenizing a text,
-a different tokenized output is generated for the same text.
-A more sophisticated algorithm, with several optimizations, might generate a different set of tokens: """)
 st.code(f"""
 {sentence_tokenise_bert}
 """)
 with st.expander("click here to look at the Python code:"):
     st.code(f"""\
-from transformers import AutoTokenizer
-sentence = "{sentence}"
-sentence_split = sentence.split()
-tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
-sentence_tokenise_bert = tokenizer.tokenize(sentence)
-sentence_encode_bert = tokenizer.encode(sentence)
-sentence_encode_bert = list(zip(sentence_tokenise_bert, sentence_encode_bert))
     """, language='python')

 preface_image, preface_text,  = st.columns(2)
 # preface_image.image("https://static.streamlit.io/examples/dice.jpg")
 # preface_image.image("""https://assets.digitalocean.com/articles/alligator/boo.svg""")
+preface_text.write("""\
+    *Transformers represent a revolutionary class of machine learning architectures that have sparked
+    immense interest. While numerous insightful tutorials are available, the evolution of transformer architectures over
+    the last few years has led to significant simplifications. These advancements have made it increasingly
+    straightforward to understand their inner workings. In this series of articles, I aim to provide a direct, clear explanation of
+    how and why modern transformers function, unburdened by the historical complexities associated with their inception.*
 """)
 divider()
+st.write("""\
+    In order to understand the recent success in AI we need to understand the Transformer architecture. Its
+    rise in the field of Natural Language Processing (NLP) is largely attributed to a combination of several key
+    advancements:
+    - Tokenisers and Embeddings
+    - Attention and Self-Attention
+    - Encoder-Decoder architecture
+    Understanding these foundational concepts is crucial to comprehending the overall structure and function of the
+    Transformer model. They are the building blocks from which the rest of the model is constructed, and their roles
+    within the architecture are essential to the model's ability to process and generate language. In my view,
+    a comprehensive and simple explanation may give a reader a significant advantage in using LLMs. Feynman once said:
+    "*I think I can safely say that nobody understands quantum mechanics.*". Because he couldn't explain it to a freshman.
+    Given the importance and complexity of these concepts, I have chosen to dedicate the first article in this series
+    solely to Tokenisation and embeddings. The decision to separate the topics into individual articles is driven by a
+    desire to provide a thorough and in-depth understanding of each component of the Transformer model.
+    Note: *HuggingFace provides an exceptional [tutorial on Transformer models](https://huggingface.co/docs/transformers/index).
+    That tutorial is particularly beneficial for readers willing to dive into advanced topics.*
 """)
 with st.expander("Copernicus Museum in Warsaw"):
+    st.write("""\
+    Have you ever visited the Copernicus Museum in Warsaw? It's an engaging interactive hub that allows
+    you to familiarize yourself with various scientific topics. The experience is both entertaining and educational,
+    providing the opportunity to explore different concepts firsthand. **They even feature a small neural network that
+    illustrates the neuron activation process during the recognition of handwritten digits!**
+    Taking inspiration from this approach, we'll embark on our journey into the world of Transformer models by first
+    establishing a firm understanding of tokenisation and embeddings. This foundation will equip us with the knowledge
+    needed to delve into the more complex aspects of these models later on.
+    I encourage you not to hesitate in modifying parameters or experimenting with different models in the provided
+    examples. This hands-on exploration can significantly enhance your learning experience. So, let's begin our journey
+    through this virtual, interactive museum of AI. Enjoy the exploration!
 """)
     st.image("https://i.pinimg.com/originals/04/11/2c/04112c791a859d07a01001ac4f436e59.jpg")
 st.header("Tokenisers and Tokenisation")
+st.write("""\
+    Tokenisation is the initial step in the data preprocessing pipeline for natural language processing (NLP)
+    models. It involves breaking down a piece of text—whether a sentence, paragraph, or document—into smaller units,
+    known as "tokens". In English and many other languages, a token often corresponds to a word, but it can also be a
+    subword, character, or n-gram. The choice of token size depends on various factors, including the task at hand and
+    the language of the text.
 """)
 from transformers import AutoTokenizer
 sentence_encode_bert = list(zip(sentence_tokenise_bert, sentence_encode_bert))
 st.write(f"""\
+    A basic word-level tokenisation, which splits a text by spaces, would produce next tokens:
 """)
 st.code(f"""
 {sentence_split}
 st.write(f"""\
+    However, we notice that the punctuation may attached to the words. It is disadvantageous, how the tokenization dealt with the word "Don't".
+    "Don't" stands for "do not", so it would be better tokenized as ["Do", "n't"]. (Hint: try another sentence: "I musn't tell lies. Don't do this.") This is where things start getting complicated,
+    and part of the reason each model has its own tokenizer type. Depending on the rules we apply for tokenizing a text,
+    a different tokenized output is generated for the same text.
+    A more sophisticated algorithm, with several optimizations, might generate a different set of tokens:
+""")
 st.code(f"""
 {sentence_tokenise_bert}
 """)
 with st.expander("click here to look at the Python code:"):
     st.code(f"""\
+        from transformers import AutoTokenizer
+        sentence = "{sentence}"
+        sentence_split = sentence.split()
+        tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
+        sentence_tokenise_bert = tokenizer.tokenize(sentence)
+        sentence_encode_bert = tokenizer.encode(sentence)
+        sentence_encode_bert = list(zip(sentence_tokenise_bert, sentence_encode_bert))
     """, language='python')