JP Hwang
added reference to dim reduction app
2ce23da
# ========== (c) JP Hwang 25/9/2022 ==========
import logging
import pandas as pd
import numpy as np
import streamlit as st
import plotly.express as px
from scipy import spatial
import random
# ===== SET UP LOGGER =====
logger = logging.getLogger(__name__)
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
sh = logging.StreamHandler()
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
sh.setFormatter(formatter)
root_logger.addHandler(sh)
# ===== END LOGGER SETUP =====
desired_width = 320
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', desired_width)
sizes = [1, 20, 30]
def get_top_tokens(ser_in):
from collections import Counter
tkn_list = '_'.join(ser_in.tolist()).split('_')
tkn_counts = Counter(tkn_list)
common_tokens = [i[0] for i in tkn_counts.most_common(10)]
return common_tokens
def build_chart(df_in):
fig = px.scatter_3d(df_in, x='r', y='g', z='b',
template='plotly_white',
color=df_in['simple_name'],
color_discrete_sequence=df_in['rgb'],
size='size',
hover_data=['name'])
fig.update_layout(
showlegend=False,
margin=dict(l=5, r=5, t=20, b=5)
)
return fig
def preproc_data():
df = pd.read_csv('data/colors.csv', names=['simple_name', 'name', 'hex', 'r', 'g', 'b'])
# Preprocessing
df['rgb'] = df.apply(lambda x: f'rgb({x.r}, {x.g}, {x.b})', axis=1)
# Get top 'basic' color names
df = df.assign(category=df.simple_name.apply(lambda x: x.split('_')[-1]))
# Set default size attribute
df['size'] = sizes[0]
return df
def get_top_colors(df):
top_colors = df['category'].value_counts()[:15].index.tolist()
top_colors = [c for c in top_colors if c in df.simple_name.values]
return top_colors
def main():
st.title('Colorful vectors')
st.markdown("""
You might have heard that objects like
words or images can be represented by "vectors".
What does that mean, exactly? It seems like a tricky concept, but it doesn't have to be.
Let's start here, where colors are represented in 3-D space 🌈.
Each axis represents how much of primary colors `(red, green, and blue)`
each color comprises.
For example, `Magenta` is represented by `(255, 0, 255)`,
and `(80, 200, 120)` represents `Emerald`.
That's all a *vector* is in this context - a sequence of numbers.
Take a look at the resulting 3-D image below; it's kind of mesmerising!
(You can spin the image around, as well as zoom in/out.)
"""
)
df = preproc_data()
fig = build_chart(df)
st.plotly_chart(fig)
st.markdown("""
### Why does this matter?
You see here that similar colors are placed close to each other in space.
It seems obvious, but **this** is the crux of why a *vector representation* is so powerful.
These objects being located *in space* based on their key property (`color`)
enables an easy, objective assessment of similarity.
Let's take this further:
""")
# ===== SCALAR SEARCH =====
st.header('Searching in vector space')
st.markdown("""
Imagine that you need to identify colors similar to a given color.
You could do it by name, for instance looking for colors containing matching words.
But remember that in the 3-D chart above, similar colors are physically close to each other.
So all you actually need to do is to calculate distances, and collect points based on a threshold!
That's probably still a bit abstract - so pick a 'base' color, and we'll go from there.
In fact - try a few different colors while you're at it!
""")
top_colors = get_top_colors(df)
# def_choice = random.randrange(len(top_colors))
query = st.selectbox('Pick a "base" color:', top_colors, index=5)
match = df[df.simple_name == query].iloc[0]
scalar_filter = df.simple_name.str.contains(query)
st.markdown(f"""
The color `{match.simple_name}` is also represented
in our 3-D space by `({match.r}, {match.g}, {match.b})`.
Let's see what we can find using either of these properties.
(Oh, you can adjust the similarity threshold below as well.)
""")
with st.expander(f"Similarity search options"):
st.markdown(f"""
Do you want to find lots of similar colors, or
just a select few *very* similar colors to `{match.simple_name}`.
""")
thresh_sel = st.slider('Select a similarity threshold',
min_value=20, max_value=160,
value=80, step=20)
st.markdown("---")
df['size'] = sizes[0]
df.loc[scalar_filter, 'size'] = sizes[1]
df.loc[df.simple_name == match.simple_name, 'size'] = sizes[2]
scalar_fig = build_chart(df)
scalar_hits = df[scalar_filter]['name'].values
# ===== VECTOR SEARCH =====
vector = match[['r', 'g', 'b']].values.tolist()
dist_metric = 'euc'
def get_dist(a, b, method):
if method == 'euc':
return np.linalg.norm(a-b)
else:
return spatial.distance.cosine(a, b)
df['dist'] = df[['r', 'g', 'b']].apply(lambda x: get_dist(x, vector, dist_metric), axis=1)
df['size'] = sizes[0]
if dist_metric == 'euc':
vec_filter = df['dist'] < thresh_sel
else:
vec_filter = df['dist'] < 0.05
df.loc[vec_filter, 'size'] = sizes[1]
df.loc[((df['r'] == vector[0]) &
(df['g'] == vector[1]) &
(df['b'] == vector[2])
),
'size'] = sizes[2]
vector_fig = build_chart(df)
vector_hits = df[vec_filter].sort_values('dist')['name'].values
# ===== OUTPUTS =====
col1, col2 = st.columns(2)
with col1:
st.markdown(f"These colors contain the text: `{match.simple_name}`:")
st.plotly_chart(scalar_fig, use_container_width=True)
st.markdown(f"Found {len(scalar_hits)} colors containing the string `{query}`.")
with st.expander(f"Click to see the whole list"):
st.markdown("- " + "\n- ".join(scalar_hits))
with col2:
st.markdown(f"These colors are close to the vector `({match.r}, {match.g}, {match.b})`:")
st.plotly_chart(vector_fig, use_container_width=True)
st.markdown(f"Found {len(vector_hits)} colors similar to `{query}` based on its `(R, G, B)` values.")
with st.expander(f"Click to see the whole list"):
st.markdown("- " + "\n- ".join(vector_hits))
# ===== REFLECTIONS =====
unique_hits = [c for c in vector_hits if c not in scalar_hits]
st.markdown("---")
st.header("So what?")
st.markdown("""
What did you notice?
The thing that stood out to me is how *robust* and *consistent*
the vector search results are.
It manages to find a bunch of related colors
regardless of what it's called. It doesn't matter that the color
'scarlet' does not contain the word 'red';
it goes ahead and finds all the neighboring colors based on a consistent criterion.
It easily found these colors which it otherwise would not have based on the name alone:
""")
with st.expander(f"See list:"):
st.markdown("- " + "\n- ".join(unique_hits))
st.markdown("""
I think it's brilliant - think about how much of a pain word searching is,
and how inconsistent it is. This has so many advantages!
---
""")
st.header("Generally speaking...")
st.markdown("""
Obviously, this is a pretty simple, self-contained example.
Colors are particularly suited for representing using just a few
numbers, like our primary colors. One number represents how much
`red` each color contains, another for `green`, and the last for `blue`.
But that core concept of representing similarity along different
properties using numbers is exactly what happens in other domains.
The only differences are in *how many* numbers are used, and what
they represent. For example, words or documents might be represented by
hundreds (e.g. 300 or 768) of AI-derived numbers.
We'll take a look at those examples as well later on.
Techniques used to visualise those high-dimensional vectors are called
dimensionality reduction techniques. If you would like to see this in action, check out
[this app](https://huggingface.co/spaces/jphwang/reduce_dimensions).
""")
st.markdown("""
---
If you liked this - [follow me (@_jphwang) on Twitter](https://twitter.com/_jphwang)!
""")
if __name__ == '__main__':
main()