Spaces:

imh0
/

transformers-p1-embeddings

Runtime error

File size: 28,835 Bytes

import streamlit as st

# TODO: move to 'utils'
mystyle = '''
    <style>
        p {
            text-align: justify;
        }
    </style>
    '''
st.markdown(mystyle, unsafe_allow_html=True)


def divider():
    _, c, _ = st.columns(3)
    c.divider()

st.title("Transformers: Tokenisers and Embeddings")

preface_image, preface_text,  = st.columns(2)
# preface_image.image("https://static.streamlit.io/examples/dice.jpg")
# preface_image.image("""https://assets.digitalocean.com/articles/alligator/boo.svg""")
preface_text.write("""\
    *Transformers represent a revolutionary class of machine learning architectures that have sparked 
    immense interest. While numerous insightful tutorials are available, the evolution of transformer architectures over 
    the last few years has led to significant simplifications. These advancements have made it increasingly 
    straightforward to understand their inner workings. In this series of articles, I aim to provide a direct, clear explanation of 
    how and why modern transformers function, unburdened by the historical complexities associated with their inception.*
""")

divider()

st.write("""\
    In order to understand the recent success in AI we need to understand the Transformer architecture. Its 
    rise in the field of Natural Language Processing (NLP) is largely attributed to a combination of several key 
    advancements:
    
    - Tokenisers and Embeddings 
    - Attention and Self-Attention
    - Encoder-Decoder architecture
    
    Understanding these foundational concepts is crucial to comprehending the overall structure and function of the 
    Transformer model. They are the building blocks from which the rest of the model is constructed, and their roles 
    within the architecture are essential to the model's ability to process and generate language. In my view, 
    a comprehensive and simple explanation may give a reader a significant advantage in using LLMs. Feynman once said: 
    "*I think I can safely say that nobody understands quantum mechanics.*". Because he couldn't explain it to a freshman.
    
    Given the importance and complexity of these concepts, I have chosen to dedicate the first article in this series 
    solely to Tokenisation and embeddings. The decision to separate the topics into individual articles is driven by a 
    desire to provide a thorough and in-depth understanding of each component of the Transformer model.
    
    Note: *HuggingFace provides an exceptional [tutorial on Transformer models](https://huggingface.co/docs/transformers/index). 
    That tutorial is particularly beneficial for readers willing to dive into advanced topics.*
""")

with st.expander("Copernicus Museum in Warsaw"):
    st.write("""\
    Have you ever visited the Copernicus Museum in Warsaw? It's an engaging interactive hub that allows 
    you to familiarize yourself with various scientific topics. The experience is both entertaining and educational, 
    providing the opportunity to explore different concepts firsthand. **They even feature a small neural network that 
    illustrates the neuron activation process during the recognition of handwritten digits!**
    
    Taking inspiration from this approach, we'll embark on our journey into the world of Transformer models by first 
    establishing a firm understanding of tokenisation and embeddings. This foundation will equip us with the knowledge 
    needed to delve into the more complex aspects of these models later on.
    
    I encourage you not to hesitate in modifying parameters or experimenting with different models in the provided 
    examples. This hands-on exploration can significantly enhance your learning experience. So, let's begin our journey 
    through this virtual, interactive museum of AI. Enjoy the exploration!
""")
    st.image("https://i.pinimg.com/originals/04/11/2c/04112c791a859d07a01001ac4f436e59.jpg")

divider()


st.header("Tokenisers and Tokenisation")

st.write("""\
    Tokenisation is the initial step in the data preprocessing pipeline for natural language processing (NLP) 
    models. It involves breaking down a piece of text—whether a sentence, paragraph, or document—into smaller units, 
    known as "tokens". In English and many other languages, a token often corresponds to a word, but it can also be a 
    subword, character, or n-gram. The choice of token size depends on various factors, including the task at hand and 
    the language of the text.
""")

from transformers import AutoTokenizer

sentence = st.text_input("Consider the sentence: (you can change it):", value="Tokenising text is a fundamental step for NLP models.")
sentence_split = sentence.split()
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
sentence_tokenise_bert = tokenizer.tokenize(sentence)
sentence_encode_bert = tokenizer.encode(sentence)
sentence_encode_bert = list(zip(sentence_tokenise_bert, sentence_encode_bert))

st.write(f"""\
    A basic word-level tokenisation, which splits a text by spaces, would produce next tokens:
""")
st.code(f"""
{sentence_split}
""")


st.write(f"""\
    However, we notice that the punctuation may attached to the words. It is disadvantageous, how the tokenization dealt with the word "Don't". 
    "Don't" stands for "do not", so it would be better tokenized as ["Do", "n't"]. (Hint: try another sentence: "I musn't tell lies. Don't do this.") This is where things start getting complicated, 
    and part of the reason each model has its own tokenizer type. Depending on the rules we apply for tokenizing a text, 
    a different tokenized output is generated for the same text. 
    A more sophisticated algorithm, with several optimizations, might generate a different set of tokens: 
""")
st.code(f"""
{sentence_tokenise_bert}
""")

with st.expander("click here to look at the Python code:"):
    st.code(f"""\
        from transformers import AutoTokenizer
        
        sentence = "{sentence}"
        sentence_split = sentence.split()
        tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
        sentence_tokenise_bert = tokenizer.tokenize(sentence)
        sentence_encode_bert = tokenizer.encode(sentence)
        sentence_encode_bert = list(zip(sentence_tokenise_bert, sentence_encode_bert))
    """, language='python')


st.write("""
As machine learning models, including Transformers, work with numbers rather than words, each vocabulary 
entry is assigned a corresponding numerical value. Here is a potential key-value, vocabulary-based representation of 
the input (so called 'token ids'):
"""
)

st.code(f"""
{sentence_encode_bert}
""")


st.write("""
What distinguishes subword Tokenisation is its reliance on statistical rules and algorithms, learned from 
the pretraining corpus. The resulting Tokeniser creates a vocabulary, which usually represents the most frequently 
used words and subwords. For example, Byte Pair Encoding (BPE) first encodes the most frequent words as single 
tokens, while less frequent words are represented by multiple tokens, each representing a word part.

There are numerous different Tokenisers available, including spaCy, Moses, Byte-Pair Encoding (BPE), 
Byte-level BPE, WordPiece, Unigram, and SentencePiece. It's crucial to choose a specific Tokeniser and stick with it. 
Changing the Tokeniser is akin to altering the model's language on the fly—imagine studying physics in English and 
then taking the exam in French or Spanish. You might get lucky, but it's a considerable risk.
""")

training_dataset = """\
   Beautiful is better than ugly.
   Explicit is better than implicit.
   Simple is better than complex.
   Complex is better than complicated.
   Flat is better than nested.
   Sparse is better than dense.
   Readability counts.
   """

tokeniser_name = st.selectbox(label="Choose your tokeniser", options=["BPE", 'Unigram', 'WordPiece'])
if tokeniser_name == 'BPE':
    st.subheader("Byte-Pair Encoding (BPE)")
    st.write("""\
        Byte-Pair Encoding (BPE) was introduced in [Neural Machine Translation of Rare Words with Subword 
        Units (Sennrich et al., 2015)](https://arxiv.org/abs/1508.07909). BPE relies on a pre-tokenizer that splits the 
        training data into words. Pre-tokenization can be as simple as space tokenization, e.g. GPT-2, Roberta. More 
        advanced pre-tokenization include rule-based tokenization, e.g. XLM, FlauBERT which uses Moses for most 
        languages, or GPT which uses Spacy and ftfy, to count the frequency of each word in the training corpus.
        
        After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the 
        training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the 
        set of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so 
        until the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a 
        hyperparameter to define before training the tokenizer.
        
        As an example, let’s assume that after pre-tokenization, the following set of words including their frequency has 
        been determined:
    """)
    st.code(""" ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5) """)
    st.write("""\
        Consequently, the base vocabulary is ["b", "g", "h", "n", "p", "s", "u"]. Splitting all words into symbols of the base vocabulary, we obtain:
    """)
    st.code(""" ("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5) """)
    st.write("""\
        BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs 
        most frequently. In the example above "h" followed by "u" is present 10 + 5 = 15 times (10 times in the 10 
        occurrences of "hug", 5 times in the 5 occurrences of "hugs"). However, the most frequent symbol pair is "u" 
        followed by "g", occurring 10 + 5 + 5 = 20 times in total. Thus, the first merge rule the tokenizer learns is to 
        group all "u" symbols followed by a "g" symbol together. Next, "ug" is added to the vocabulary. The set of words 
        then becomes
    """)
    st.code(""" ("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5) """)
    st.write("""\
        BPE then identifies the next most common symbol pair. It’s "u" followed by "n", which occurs 16 
        times. "u", "n" is merged to "un" and added to the vocabulary. The next most frequent symbol pair is "h" followed 
        by "ug", occurring 15 times. Again the pair is merged and "hug" can be added to the vocabulary.
        
        At this stage, the vocabulary is ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"] and our set of unique words is represented as
    """)
    st.code(""" ("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5) """)
    st.write("""\
        Assuming, that the Byte-Pair Encoding training would stop at this point, the learned merge rules 
        would then be applied to new words (as long as those new words do not include symbols that were not in the base 
        vocabulary). For instance, the word "bug" would be tokenized to ["b", "ug"] but "mug" would be tokenized as [
        "[unk]", "ug"] since the symbol "m" is not in the base vocabulary. In general, single letters such as "m" are not 
        replaced by the "[unk]" symbol because the training data usually includes at least one occurrence of each letter, 
        but it is likely to happen for very special characters like emojis.
        
        As mentioned earlier, the vocabulary size, i.e. the base vocabulary size + the number of merges, is a hyperparameter 
        to choose. For instance GPT has a vocabulary size of 40,478 since they have 478 base characters and chose to stop 
        training after 40,000 merges. 
    """)


    st.subheader(":green[Try Yourself:]")
    st.write(f"""\
        *Using text area field below try to find or create a comprehensive vocabulary (training dataset) for Tokenisation, which can enhance the 
        efficiency of the process. This approach helps to eliminate unknown tokens, thereby making the token sequence 
        more understandable and containing less tokens (ids)* 
      """)

    training_dataset = st.text_area("*Training Dataset - Vocabulary:*", value=training_dataset, height=200)
    training_dataset = training_dataset.split('\n')
    vocabulary_size = st.number_input("Vocabulary Size:", value=100000)
    sentence = st.text_input(label="*Text to tokenise:*",
                             value="[CLS]  Tokenising text is a fundamental step for NLP models. [SEP] [PAD] [PAD] [PAD]")


    from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
    tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
    tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
    tokenizer.decoder = decoders.ByteLevel()
    trainer = trainers.BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=vocabulary_size)
    tokenizer.train_from_iterator(training_dataset, trainer=trainer)
    output = tokenizer.encode(sentence)

    st.write("*Tokens:*")
    st.code(f"""{output.tokens}""")
    st.code(f"""\
    ids: {output.ids}
    attention_mast: {output.attention_mask}
    """)

    st.write(""" *well done if you get ids like these: [1, 57, 49, 28, 10, 58, 55, 52, 31, 54, 5, 2, 3, 3, 3]!*""")

    with st.expander("Python code:"):
        st.code(f"""
            from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
            
            tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
            tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
            tokenizer.decoder = decoders.ByteLevel()
            trainer = trainers.BpeTrainer(
                special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], 
                vocab_size={vocabulary_size})
            training_dataset = {training_dataset}
            tokenizer.train_from_iterator(training_dataset, trainer=trainer)
            output = tokenizer.encode("{sentence}")
                """, language='python')
elif tokeniser_name == 'Unigram':
    st.subheader("""Unigram""")
    st.write("""\
        Unigram is a subword tokenization algorithm introduced in [Subword Regularization: Improving Neural 
        Network Translation Models with Multiple Subword Candidates (Kudo, 2018)](https://arxiv.org/pdf/1804.10959.pdf). 
        In contrast to BPE or WordPiece, Unigram initializes its base vocabulary to a large number of symbols and 
        progressively trims down each symbol to obtain a smaller vocabulary. The base vocabulary could for instance 
        correspond to all pre-tokenized words and the most common substrings. Unigram is not used directly for any of the 
        models in the transformers, but it’s used in conjunction with SentencePiece.
        
        At each training step, the Unigram algorithm defines a loss (often defined as the log-likelihood) over the training 
        data given the current vocabulary and a unigram language model. Then, for each symbol in the vocabulary, 
        the algorithm computes how much the overall loss would increase if the symbol was to be removed from the vocabulary. 
        Unigram then removes p (with p usually being 10% or 20%) percent of the symbols whose loss increase is the lowest, 
        i.e. those symbols that least affect the overall loss over the training data. This process is repeated until the 
        vocabulary has reached the desired size. The Unigram algorithm always keeps the base characters so that any word can 
        be tokenized.
        
        Because Unigram is not based on merge rules (in contrast to BPE and WordPiece), the algorithm has several ways of 
        tokenizing new text after training. As an example, if a trained Unigram tokenizer exhibits the vocabulary:
    """)
    st.code(""" ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"] """)
    st.write("""\
        "hugs" could be tokenized both as ["hug", "s"], ["h", "ug", "s"] or ["h", "u", "g", "s"]. So which 
        one to choose? Unigram saves the probability of each token in the training corpus on top of saving the vocabulary 
        so that the probability of each possible tokenization can be computed after training. The algorithm simply picks 
        the most likely tokenization in practice, but also offers the possibility to sample a possible tokenization 
        according to their probabilities.
    """)

    st.subheader("Try Yourself:")
    st.write(f"""\
        *Using text area field below try to find or create a comprehensive vocabulary (training dataset) for Tokenisation, which can enhance the 
        efficiency of the process. This approach helps to eliminate unknown tokens, thereby making the token sequence 
        more understandable and containing less tokens (ids)* 
          """)

    training_dataset = st.text_area("*Training Dataset - Vocabulary(change it and looks at resulted tokens):*", value=training_dataset, height=200)
    training_dataset = training_dataset.split('\n')
    vocabulary_size = st.number_input("Vocabulary Size:", value=100000)
    sentence = st.text_input(label="*Text to tokenise:*",
                             value="[CLS]  Tokenising text is a fundamental step for NLP models. [SEP] [PAD] [PAD] [PAD]")

    from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers

    tokenizer = Tokenizer(models.Unigram())
    tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
    tokenizer.decoder = decoders.ByteLevel()
    trainer = trainers.UnigramTrainer(
        vocab_size=vocabulary_size,
        unk_token="[UNK]",
        # initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    )
    tokenizer.train_from_iterator(training_dataset, trainer=trainer)
    output = tokenizer.encode(sentence)

    # TODO: make it more visible, container with a differect color or something
    st.write("*Tokens:*")
    st.code(f"""{output.tokens}""")
    st.code(f"""\
        ids: {output.ids}
        attention_mast: {output.attention_mask}
        """)

    st.write(""" *well done if you get ids like these: [1, 57, 49, 28, 10, 58, 55, 52, 31, 54, 5, 2, 3, 3, 3]!*""")
    with st.expander("Python code:"):
        st.code(f"""\
            from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
            
            tokenizer = Tokenizer(models.Unigram())
            tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
            tokenizer.decoder = decoders.ByteLevel()
            trainer = trainers.UnigramTrainer(
                vocab_size={vocabulary_size},
                special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
            )
            training_dataset = {training_dataset}
            tokenizer.train_from_iterator(training_dataset, trainer=trainer)
            output = tokenizer.encode("{sentence}") 
    """, language='python')
elif tokeniser_name == 'WordPiece':
    st.subheader("""WordPiece""")
    st.write("""\
        WordPiece is the subword tokenization algorithm used for BERT, DistilBERT, and Electra. The 
        algorithm was outlined in [Japanese and Korean Voice Search (Schuster et al., 
        2012)](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf) and is very 
        similar to BPE. WordPiece first initializes the vocabulary to include every character present in the training 
        data and progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the 
        most frequent symbol pair, but the one that maximizes the likelihood of the training data once added to the 
        vocabulary.
        
        So what does this mean exactly? Referring to the example from BPE tokeniser, maximizing the likelihood of the training data is 
        equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by 
        its second symbol is the greatest among all symbol pairs. E.g. "u", followed by "g" would have only been merged if 
        the probability of "ug" divided by "u", "g" would have been greater than for any other symbol pair. Intuitively, 
        WordPiece is slightly different to BPE in that it evaluates what it loses by merging two symbols to ensure it’s worth 
        it. 
    """)

    st.subheader("Try Yourself:")
    st.write(f"""\
        *Using text area field below try to find or create a comprehensive vocabulary (training dataset) for Tokenisation, which can enhance the 
        efficiency of the process. This approach helps to eliminate unknown tokens, thereby making the token sequence 
        more understandable and containing less tokens (ids)* 
    """)

    training_dataset = st.text_area("*Training Dataset - Vocabulary(change it and looks at resulted tokens):*",
                                    value=training_dataset, height=200)
    training_dataset = training_dataset.split('\n')
    vocabulary_size = st.number_input("Vocabulary Size:", value=100000)
    sentence = st.text_input(label="*Text to tokenise:*",
                             value="[CLS]  Tokenising text is a fundamental step for NLP models. [SEP] [PAD] [PAD] [PAD]")

    from tokenizers import Tokenizer, decoders, models, pre_tokenizers, trainers

    tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
    tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
    tokenizer.decoder = decoders.ByteLevel()
    trainer = trainers.WordPieceTrainer(
        vocab_size=vocabulary_size,
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    )
    tokenizer.train_from_iterator(training_dataset, trainer=trainer)
    output = tokenizer.encode(sentence)

    # TODO: make it more visible, container with a differect color or something
    st.write("*Tokens:*")
    st.code(f"""{output.tokens}""")
    st.code(f"""\
            ids: {output.ids}
            attention_mast: {output.attention_mask}
            """)

    st.write(""" *well done if you get ids like these: [1, 76, 72, 50, 10, 77, 71, 68, 66, 78, 5, 2, 3, 3, 3]!*""")
    with st.expander("Python code:"):
        st.code(f"""\
            from tokenizers import Tokenizer, decoders, models, pre_tokenizers, trainers
            
            tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
            trainer = trainers.WordPieceTrainer(
                vocab_size={vocabulary_size},
                special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
            )
            training_dataset = {training_dataset}
            tokenizer.train_from_iterator(training_dataset, trainer=trainer)
            output = tokenizer.encode("{sentence}") 
        """, language='python')


with st.expander("Special tokens meaning:"):
    st.write("""\
        \\#\\# prefix: It means that the preceding string is not whitespace, any token with this prefix should be 
        merged with the previous token when you convert the tokens back to a string.
        
        [UNK]: Stands for "unknown". This token is used to represent any word that is not in the model's vocabulary. Since 
        most models have a fixed-size vocabulary, it's not possible to have a unique token for every possible word. The [UNK] 
        token is used as a catch-all for any words the model hasn't seen before. E.g. in our example we 'decided' that Large 
        Language (LL) abbreviation is not part of the model's vocabulary.
        
        [CLS]: Stands for "classification". In models like BERT, this token is added at the beginning of every input 
        sequence. The representation (embedding) of this token is used as the aggregate sequence representation for 
        classification tasks. In other words, the model is trained to encode the meaning of the entire sequence into this token.
        
        [SEP]: Stands for "separator". This token is used to separate different sequences when the model needs to take more 
        than one input sequence. For example, in question-answering tasks, the model takes two inputs: a question and a 
        passage that contains the answer. The two inputs are separated by a [SEP] token.
        
        [MASK]: This token is specific to models like BERT, which are trained with a masked language modelling objective. 
        During training, some percentage of the input tokens are replaced with the [MASK] token, and the model's goal is to 
        predict the original value of the masked tokens.
        
        [PAD]: Stands for "padding". This token is used to fill in the extra spaces when batching sequences of different 
        lengths together. Since models require input sequences to be the same length, shorter sequences are extended with [
        PAD] tokens. In our example, we extended the length of the input sequence to 16 tokens.
""")


with st.expander("References:"):
    st.write("""\
    - https://huggingface.co/docs/transformers/tokenizer_summary
    - https://huggingface.co/docs/tokenizers/training_from_memory
    """)

divider()
st.header("Embeddings")
st.caption("TBD...")

st.write("""\
    Following tokenization, each token is transformed into a vector of numeric characteristics, a process 
    known as 'embedding.' In this context, 'embedding' refers to the mapping of the discrete, categorical space of words 
    or tokens into a continuous, numeric space, which the model can manipulate more effectively.
    
    Each dimension in this high-dimensional space can encapsulate a different facet of the token's meaning. For instance, 
    one dimension might capture the tense of a token if it's a verb, while another dimension might capture the degree of 
    positivity or negativity if the token is an adjective expressing sentiment. For instance: 
""")
st.code("""\
    "I" -> [noun, person]
    "love" -> [verb, feeling]
    "machine" -> [noun, automation]
    "learn" -> [verb, knowledge]
    "##ing" -> [gerund, continues]
""")

st.write("""\
    The actual embeddings in a typical NLP model would be in a much higher-dimensional space (often several hundred dimensions), but the idea is the same.
    Embeddings are dynamically learned from the data, with the model adjusting these embeddings during 
    training to minimize the discrepancy between the predicted and actual outputs for a set of training examples. 
    Consequently, tokens with similar meanings often end up with similar embeddings.

    In the context of Transformers, these embeddings are the inputs that the model uses. Once again, we represent all the 
    characteristics using numbers, not words.
""")

col1, col2 = st.columns(2)
token_king = col1.text_input("Choose words to compare embeddings:", value="king")
token_queen = col2.text_input("Choose words to compare embeddings:", value="queen")

from torch import nn
from transformers import AutoConfig
from transformers import AutoTokenizer
import pandas as pd
import openai

model_ckpt = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
king_id = tokenizer(token_king, return_tensors="pt", add_special_tokens=False)
queen_id = tokenizer(token_queen, return_tensors="pt", add_special_tokens=False)

config = AutoConfig.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
king_embeddings = token_emb(king_id.input_ids)
queen_embeddings = token_emb(queen_id.input_ids)
king_emb_np = king_embeddings.reshape(-1).detach().numpy()
queen_emb_np = queen_embeddings.reshape(-1).detach().numpy()


openai.api_key = st.secrets["OPENAI_API_KEY"]
EMBEDDING_MODEL = 'text-embedding-ada-002'
EMBEDDING_CTX_LENGTH = 8191
EMBEDDING_ENCODING = 'cl100k_base'
king = openai.Embedding.create(input=token_king, model=EMBEDDING_MODEL)["data"][0]["embedding"]
queen = openai.Embedding.create(input=token_queen, model=EMBEDDING_MODEL)["data"][0]["embedding"]

st.write("Google's 'bert-base-uncased' model embeddings:")
df = pd.DataFrame({f'"{token_king}" embeddings': king_emb_np[:50], f'"{token_queen}" embeddings': queen_emb_np[:50]})
st.line_chart(df)


st.write("OpenAI's 'text-embedding-ada-002' model embeddings:")
df = pd.DataFrame({f'"{token_king}" embeddings': king[:50], f'"{token_queen}" embeddings': queen[:50]})
st.line_chart(df)



with st.expander("References:"):
    st.write("""\
        - https://huggingface.co/blog/getting-started-with-embeddings
        - https://huggingface.co/blog/1b-sentence-embeddings
    """)