Spaces:

imh0
/

transformers-p1-embeddings

Runtime error

App Files Files Community

im commited on Jul 30, 2023

Commit

92b778c

•

1 Parent(s): a6a7ec1

add vector database 3d space visualisation

Browse files

Files changed (1) hide show

app.py +84 -12

app.py CHANGED Viewed

@@ -548,6 +548,7 @@ st.plotly_chart(fig, use_container_width=True)
 with st.expander("Python Code:"):
     st.code(f"""\
         import openai
         EMBEDDING_MODEL = 'text-embedding-ada-002'
@@ -582,7 +583,84 @@ fig.update_layout(coloraxis_showscale=False)
 fig.update_layout(width=6000)
 st.plotly_chart(fig, use_container_width=True)
-st.subheader(":green[Try Yourself:]")
 from langchain.embeddings.openai import OpenAIEmbeddings
 from langchain.vectorstores import FAISS
@@ -599,13 +677,6 @@ def search_vector_database(term):
     docs = db.similarity_search_by_vector(embedding_vector)
     return docs
-st.write("""\
-    *There is a vector database containing two words: 'king' and 'queen'. Your task is to pinpoint search
-    terms that would yield either of these words. To facilitate this, use the previously presented similarity matrix to
-    seek out words that give a higher correlation with the word in question. For instance, you might want to explore
-    terms such as 'king', 'queen', 'dog', 'prince', 'man', 'minister', 'boy'.*
-""")
-embeddings_query = st.text_input(label="search term")
 if embeddings_query is not None and embeddings_query != '':
     docs = search_vector_database(embeddings_query)
     st.warning(docs[0].page_content)
@@ -623,7 +694,7 @@ if embeddings_query is not None and embeddings_query != '':
         """)
 divider()
-st.caption("Conclusion")
 st.write("""\
     As embedding algorithms are trained on a vast corpus of data, they inherently encapsulate a rich
     tapestry of information about our language and even the world at large. Therefore, they can be used for:
@@ -643,10 +714,11 @@ with st.expander("References:"):
         - https://platform.openai.com/docs/guides/embeddings/use-cases
     """)
 divider()
 st.header("Dimensionality Reduction (optional)")
 st.write("""\
     As was mentioned above, embedding vectors are learned in such a way that words with similar meanings
     are located close to each other in the space.  However, this is an abstract concept that might be difficult to
@@ -728,7 +800,7 @@ elif dimensionality_name == 'PCA':
     """)
     embedding_dim = 1536
     embeddings = st.text_input("words to explore:",
-                               value="king queen man woman prince prince princess counselor minister teacher")
     embeddings = embeddings.split()
     embeddings = {word: get_embeddings(word) for word in embeddings}
@@ -787,7 +859,7 @@ elif dimensionality_name == 't-SNE':
     """)
     embedding_dim = 1536
     embeddings = st.text_input("words to explore:",
-                               value="king queen man woman prince prince princess counselor minister teacher")
     embeddings = embeddings.split()
     embeddings = {word: get_embeddings(word) for word in embeddings}

 with st.expander("Python Code:"):
     st.code(f"""\
         import openai
+        import numpy as np
         EMBEDDING_MODEL = 'text-embedding-ada-002'
 fig.update_layout(width=6000)
 st.plotly_chart(fig, use_container_width=True)
+st.subheader("Vector Databases")
+st.write("""\
+    In a vector database, each item (e.g., a document) is represented as a point in a multidimensional
+    space. Each point is a vector that represents the features of the item. The goal is to place similar items close to
+    each other and dissimilar items farther apart. In the case of documents, the features could be derived from the words
+    in the document, and the similarity might be based on the overlapping words or concepts between the documents.
+    The retrieval of documents based on search terms involves two main steps:
+    Vectorization of the search query: The search query is converted into a vector using the same process used to vectorize the documents in the database.
+    Vector similarity search: The vector database then identifies the vectors that are closest to the query vector.
+    This is typically done using a distance metric like Euclidean distance or cosine similarity. The documents
+    corresponding to these vectors are returned as the search results.
+    As you correctly assumed, we leverage embedding algorithms to vectorise documents. Let's generate a 3D
+    visualization of the document vectors and a search query. For simplicity, let's assume we have a vector database
+    of documents that has been reduced to 3 dimensions, and we'll also have a 3D vector for a search query.
+""")
+with st.expander("The Euclidean distance between two points in 3D space is calculated as:"):
+    st.latex("""\\text{Distance}(A(x_1, y_1, z_1), B(x_2, y_2, z_2)) = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2 + (z_2 - z_1)^2}""")
+st.write("""\
+    The document that corresponds to the vector with the smallest distance to the query vector is
+    considered the most relevant document. The 3D plot above now shows lines from the query vector (in red) to each
+    document vector (in blue). Each line represents the Euclidean distance from the query vector to a document vector.
+""")
+embeddings = st.text_input("vector space:", value="king queen prince princess counselor minister teacher")
+embeddings = embeddings.split()
+embeddings_query = st.text_input(label="search term", value='woman')
+import numpy as np
+import plotly.express as px
+import plotly.graph_objects as go
+from sklearn.manifold import TSNE
+embeddings = {word: get_embeddings(word) for word in embeddings}
+embeddings[embeddings_query] = get_embeddings(embeddings_query)
+tsne = TSNE(n_components=3, perplexity=3, random_state=0)
+embedding_matrix = np.array(list(embeddings.values()))
+reduced_embeddings = tsne.fit_transform(embedding_matrix)
+df = pd.DataFrame(reduced_embeddings, columns=["X", "Y", "Z"])
+df["Word"] = list(embeddings.keys())
+fig = px.scatter_3d(df, x="X", y="Y", z="Z", text="Word", title="Vector Space", width=800, height=800)
+docs = reduced_embeddings[:-1]
+query = reduced_embeddings[-1]
+distances = np.linalg.norm(docs - query, axis=1)
+closest_doc_index = np.argmin(distances)
+closest_doc = docs[closest_doc_index]
+for doc in docs:
+    fig.add_trace(go.Scatter3d(x=[query[0], doc[0]], y=[query[1], doc[1]], z=[query[2], doc[2]], mode='lines', line=dict(color='purple', width=2, dash='dash')))
+fig.add_trace(go.Scatter3d(x=[query[0], closest_doc[0]], y=[query[1], closest_doc[1]], z=[query[2], closest_doc[2]], name='closest', mode='lines', line=dict(color='purple', width=2)))
+st.plotly_chart(fig, use_container_width=True)
+st.write("""\
+    This visualization represents the core concept of a vector database search. The database converts the
+    search query into a vector, then finds the document vectors that are closest to the query vector. Those documents are
+    considered the most relevant to the search query.
+    It's important to note that in a real-world application, the vectors would likely exist in much higher dimensional
+    space. However, the same principles apply: the search algorithm finds the document vectors that are nearest to the
+    query vector based on some distance metric.
+""")
+st.subheader(":green[Try Yourself]")
+st.write("""\
+    *There is a vector database containing two words (documents): 'king' and 'queen'. Your task is to pinpoint search
+    terms that would yield either of these words. To facilitate this, use the previously presented similarity matrix to
+    seek out words that give a higher correlation with the word in question. For instance, you might want to explore
+    terms such as 'king', 'queen', 'dog', 'prince', 'man', 'minister', 'boy'.*
+""")
+embeddings_query = st.text_input(label="search term")
 from langchain.embeddings.openai import OpenAIEmbeddings
 from langchain.vectorstores import FAISS
     docs = db.similarity_search_by_vector(embedding_vector)
     return docs
 if embeddings_query is not None and embeddings_query != '':
     docs = search_vector_database(embeddings_query)
     st.warning(docs[0].page_content)
         """)
 divider()
+st.subheader("Conclusion")
 st.write("""\
     As embedding algorithms are trained on a vast corpus of data, they inherently encapsulate a rich
     tapestry of information about our language and even the world at large. Therefore, they can be used for:
         - https://platform.openai.com/docs/guides/embeddings/use-cases
     """)
+# *********************************************
 divider()
 st.header("Dimensionality Reduction (optional)")
 st.write("""\
     As was mentioned above, embedding vectors are learned in such a way that words with similar meanings
     are located close to each other in the space.  However, this is an abstract concept that might be difficult to
     """)
     embedding_dim = 1536
     embeddings = st.text_input("words to explore:",
+                               value="king queen man woman prince princess counselor minister teacher")
     embeddings = embeddings.split()
     embeddings = {word: get_embeddings(word) for word in embeddings}
     """)
     embedding_dim = 1536
     embeddings = st.text_input("words to explore:",
+                               value="king queen man woman prince princess counselor minister teacher")
     embeddings = embeddings.split()
     embeddings = {word: get_embeddings(word) for word in embeddings}