Spaces:

digitiamosrl
/

recsys-and-customer-segmentation

Running

App Files Files Community

tave-st commited on Oct 21, 2022

Commit

e618873

1 Parent(s): 86bb7fc

initial commit

Browse files

Files changed (10) hide show

.gitattributes +1 -0
.gitignore +1 -0
Data/OnlineRetail.csv +3 -0
README.md +54 -13
pages/clustering.py +373 -0
recommender.py +126 -0
recommender_system.py +366 -0
requirements.txt +11 -0
requirements_freezed.txt +68 -0
utils.py +45 -0

.gitattributes CHANGED Viewed

@@ -31,3 +31,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.csv filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ __pycache__

Data/OnlineRetail.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d07aec9960083af2339975a3f9d3b26313b342dcd9f86cce0b919b1cde639a44
+size 45580638

README.md CHANGED Viewed

@@ -1,13 +1,54 @@
----
-title: Demo Confindustria
-emoji: 🐨
-colorFrom: purple
-colorTo: blue
-sdk: streamlit
-sdk_version: 1.10.0
-app_file: app.py
-pinned: false
-license: mit
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Demo Confindustria
+Demo with recsys and clustering for the [online retail](https://www.kaggle.com/datasets/vijayuv/onlineretail?select=OnlineRetail.csv) dataset.
+## Objective
+Recommender system:
+    1. interactively select a user
+    2. show all the recommendations for the user
+    3. explain why we get these suggestions (which purchased object influences the most)
+    4. plot the purchases and suggested articles
+Clustering:
+    1. compute the user clustering
+    2. plot users and their clusters
+    3. explain the meaning of the clusters (compute the mean metrics or literally explain them)
+## Setup
+In your terminal run:
+```bash
+# Enable the env
+source .venv/bin/activate
+# Install the dependencies
+pip install -r requirements.txt
+# Or install the freezed dependencies from the requirements_freezed.txt
+# You are ready to rock!
+```
+## Run
+In your terminal run:
+```bash
+streamlit run recommender_system.py
+# Now the defualt browser will be opened with
+# the stramlit page. It you want to customize the
+# execution of streaming, refer to its documentation.
+```
+## Resources
+- [streamlit](https://streamlit.io/)
+- [implicit](https://github.com/benfred/implicit), recsys library
+- [t-sne guide](https://distill.pub/2016/misread-tsne/)
+- [RFM segmentation](https://www.omniconvert.com/blog/rfm-score/)

pages/clustering.py ADDED Viewed

	@@ -0,0 +1,373 @@

+from collections import defaultdict
+import streamlit as st
+from utils import load_and_preprocess_data
+import pandas as pd
+import numpy as np
+import altair as alt
+from sklearn.mixture import GaussianMixture
+import plotly.express as px
+import itertools
+from typing import Dict, List
+SIDEBAR_DESCRIPTION = """
+# Client clustering
+To cluster a client, we adopt the RFM metrics. They stand for:
+- R = recency, that is the number of days since the last purchase
+    in the store
+- F = frequency, that is the number of times a customer has ordered something
+- M = monetary value, that is how much a customer has spent buying
+    from your business.
+Given these 3 metrics, we can cluster the customers and find a suitable
+"definition" based on the clusters they belong to. Since the dataset
+we're using right now as about 5000 distinct customers, we identify
+3 clusters for each metric.
+## How we compute the clusters
+We resort to a simple KMeans algorithm. It tries to find the clusters
+based on the distance between points. In particular, near points tend to be associated
+with the same cluster, while further points should belong to different clusters.
+""".lstrip()
+FREQUENCY_CLUSTERS_EXPLAIN = """
+The **frequency** denotes how frequently a customer has ordered.
+There 3 available clusters for this metric:
+- cluster 0: denotes a customer that purchases one or few times (range [{}, {}])
+- cluster 1: these customer have a discrete amount of orders (range [{}, {}])
+- cluster 2: these customer purchases lots of times (range [{}, {}])
+-------
+""".lstrip()
+RECENCY_CLUSTERS_EXPLAIN = """
+The **recency** refers to how recently a customer has bought;
+There 3 available clusters for this metric:
+- cluster 0: the last order of these client is long time ago (range [{}, {}])
+- cluster 1: these are clients that purchases something not very recently (range [{}, {}])
+- cluster 2: the last order of these client is a few days/weeks ago (range [{}, {}])
+-------
+""".lstrip()
+MONETARY_CLUSTERS_EXPLAIN = """
+The **revenue** refers to how much a customer has spent buying
+from your business.
+There 3 available clusters for this metric:
+- cluster 0: these clients spent little money (range [{}, {}])
+- cluster 1: these clients spent a considerable amount of money (range [{}, {}])
+- cluster 2: these clients spent lots of money (range [{}, {}])
+-------
+""".lstrip()
+EXPLANATION_DICT = {
+    "Frequency_cluster": FREQUENCY_CLUSTERS_EXPLAIN,
+    "Recency_cluster": RECENCY_CLUSTERS_EXPLAIN,
+    "Revenue_cluster": MONETARY_CLUSTERS_EXPLAIN,
+}
+def create_features(df: pd.DataFrame):
+    """Creates a new dataframe with the RFM features for each client."""
+    # Compute frequency, the number of distinct time a user purchased.
+    client_features = df.groupby("CustomerID")["InvoiceDate"].nunique().reset_index()
+    client_features.columns = ["CustomerID", "Frequency"]
+    # Add monetary value, the total revenue for  each single user.
+    client_takings = df.groupby("CustomerID")["Price"].sum()
+    client_features["Revenue"] = client_takings.values
+    # Add recency, i.e. the days since the last purchase in the store.
+    max_date = df.groupby("CustomerID")["InvoiceDate"].max().reset_index()
+    max_date.columns = ["CustomerID", "LastPurchaseDate"]
+    client_features["Recency"] = (
+        max_date["LastPurchaseDate"].max() - max_date["LastPurchaseDate"]
+    ).dt.days
+    return client_features
+@st.cache
+def cluster_clients(df: pd.DataFrame):
+    """Computes the RFM features and clusters for each user based on the RFM metrics."""
+    df_rfm = create_features(df)
+    for to_cluster, order in zip(
+        ["Revenue", "Frequency", "Recency"], ["ascending", "ascending", "descending"]
+    ):
+        kmeans = GaussianMixture(n_components=3, random_state=42)
+        labels = kmeans.fit_predict(df_rfm[[to_cluster]])
+        df_rfm[f"{to_cluster}_cluster"] = _order_cluster(kmeans, labels, order)
+    return df_rfm
+def _order_cluster(cluster_model: GaussianMixture, clusters, order="ascending"):
+    """Orders the cluster by order."""
+    centroids = cluster_model.means_.sum(axis=1)
+    if order.lower() == "descending":
+        centroids *= -1
+    ascending_order = np.argsort(centroids)
+    lookup_table = np.zeros_like(ascending_order)
+    # Cluster will start from 1
+    lookup_table[ascending_order] = np.arange(cluster_model.n_components) + 1
+    return lookup_table[clusters]
+def show_purhcase_history(user: int, df: pd.DataFrame):
+    user_purchases = df.loc[df.CustomerID == user, ["Price", "InvoiceDate"]]
+    expenses = user_purchases.groupby(user_purchases.InvoiceDate).sum()
+    expenses.columns = ["Expenses"]
+    expenses = expenses.reset_index()
+    c = (
+        alt.Chart(expenses)
+        .mark_line(point=True)
+        .encode(
+            x=alt.X("InvoiceDate", timeUnit="yearmonthdate", title="Date"),
+            y="Expenses",
+        )
+        .properties(title="User expenses")
+    )
+    st.altair_chart(c, use_container_width=True)
+def show_user_info(user: int, df_rfm: pd.DataFrame):
+    """Prints some information about the user.
+    The main information are the total expenses, how
+    many times he purchases in the store, and the clusters
+    he belongs to.
+    """
+    user_row = df_rfm[df_rfm["CustomerID"] == user]
+    if len(user_row) == 0:
+        st.write(f"No user with id {user}")
+    output = []
+    output.append(f"The user purchased **{user_row['Frequency'].squeeze()} times**.\n")
+    output.append(
+        f"She/he spent **{user_row['Revenue'].squeeze()} dollars** in total.\n"
+    )
+    output.append(
+        f"The last time she/he bought something was **{user_row['Recency'].squeeze()} days ago**.\n"
+    )
+    output.append(f"She/he belongs to the clusters: ")
+    for cluster in [column for column in user_row.columns if "_cluster" in column]:
+        output.append(f"- {cluster} = {user_row[cluster].squeeze()}")
+    st.write("\n".join(output))
+    return (
+        user_row["Recency_cluster"].squeeze(),
+        user_row["Frequency_cluster"].squeeze(),
+        user_row["Revenue_cluster"].squeeze(),
+    )
+def explain_cluster(cluster_info):
+    """Displays a popup menu explinging the meanining of the clusters."""
+    with st.expander("Show information about the clusters"):
+        st.write(
+            "**Note**: these values are valid for these dataset."
+            "Different dataset will have different number of clusters"
+            " and values"
+        )
+        for cluster, info in cluster_info.items():
+            st.write(EXPLANATION_DICT[cluster].format(*info))
+def categorize_user(recency_cluster, frequency_cluster, monetary_cluster):
+    """Describe the user with few words based on the cluster he belongs to."""
+    score = f"{recency_cluster}{frequency_cluster}{monetary_cluster}"
+    # @fixme: find a better approeach. These elif chains don't scale at all.
+    description = ""
+    if score == "111":
+        description = "Tourist"
+    elif score.startswith("2"):
+        description = "Losing interest"
+    elif score == "133":
+        description = "Former lover"
+    elif score == "123":
+        description = "Former passionate client"
+    elif score == "113":
+        description = "Spent a lot, but never come back"
+    elif score.startswith("1"):
+        description = "About to dump"
+    elif score == "313":
+        description = "Potential lover"
+    elif score == "312":
+        description = "Interesting new client"
+    elif score == "311":
+        description = "New customer"
+    elif score == "333":
+        description = "Gold client"
+    elif score == "322":
+        description = "Lovers"
+    else:
+        description = "Average client"
+    st.write(f"The customer can be described as: **{description}**")
+def plot_rfm_distribution(df_rfm: pd.DataFrame, cluster_info: Dict[str, List[int]]):
+    """Plots 3 histograms for the RFM metrics."""
+    for x in ("Revenue", "Frequency", "Recency"):
+        fig = px.histogram(df_rfm, x=x, log_y=True, title=f"{x} metric")
+        # Get the max value in the cluster info. The cluster info is a list of min - max
+        # values per cluster.
+        values = cluster_info[f"{x}_cluster"]
+        for n_cluster, i in enumerate(range(1, len(values), 2)):
+            fig.add_vline(
+                x=values[i],
+                annotation_text=f"End of cluster {n_cluster+1}",
+                line_dash="dot",
+                annotation=dict(textangle=90, font_color="red"),
+            )
+        st.plotly_chart(fig, use_container_width=True)
+def display_dataframe_heatmap(df_rfm: pd.DataFrame):
+    """Displays an heatmap of how many clients lay in the clusters.
+    This method uses some black magic coming from the dataframe
+    styling guide.
+    """
+    # Create a dataframe with the count of clients for each group
+    # of cluster.
+    count = (
+        df_rfm.groupby(["Recency_cluster", "Frequency_cluster", "Revenue_cluster"])[
+            "CustomerID"
+        ]
+        .count()
+        .reset_index()
+    )
+    count = count.rename(columns={"CustomerID": "Count"})
+    # Remove duplicates
+    count = count.drop_duplicates(
+        ["Revenue_cluster", "Frequency_cluster", "Recency_cluster"]
+    )
+    # Use the count column as values, then index with the clusters.
+    count = count.pivot(
+        index=["Revenue_cluster", "Frequency_cluster"],
+        columns="Recency_cluster",
+        values="Count",
+    )
+    # Style manipulation
+    cell_hover = {
+        "selector": "td",
+        "props": "font-size:1.5em",
+    }
+    index_names = {
+        "selector": ".index_name",
+        "props": "font-style: italic; color: Black; font-weight:normal;font-size:1.5em;",
+    }
+    headers = {
+        "selector": "th:not(.index_name)",
+        "props": "background-color: White; color: black; font-size:1.5em",
+    }
+    # Finally, display
+    # We cannot directly print the dataframe since the streamlit
+    # functin remove the multiindex. Thus, we extract the html representation
+    # and then display it.
+    st.markdown("## Heatmap: how the client are distributed between clusters")
+    st.write(
+        count.style.format(thousands=" ", precision=0, na_rep="Missing")
+        .set_table_styles([cell_hover, index_names, headers])
+        .background_gradient(cmap="coolwarm")
+        .to_html(),
+        unsafe_allow_html=True,
+    )
+def main():
+    st.sidebar.markdown(SIDEBAR_DESCRIPTION)
+    df, _, _ = load_and_preprocess_data()
+    df_rfm = cluster_clients(df)
+    st.markdown(
+        "# Dataset "
+        "\nThis is the processed dataset with information about the clients, such as"
+        " the RFM values and the clusters they belong to."
+        )
+    st.dataframe(df_rfm)
+    cluster_info_dict = defaultdict(list)
+    with st.expander("Show more details about the clusters"):
+        for cluster in [column for column in df_rfm.columns if "_cluster" in column]:
+            st.write(cluster)
+            cluster_info = (
+                df_rfm.groupby(cluster)[cluster.split("_")[0]]
+                .describe()
+                .reset_index(names="Cluster")
+            )
+            min_cluster = cluster_info["min"].astype(int)
+            max_cluster = cluster_info["max"].astype(int)
+            min_max_interlieved = list(itertools.chain(*zip(min_cluster, max_cluster)))
+            cluster_info_dict[cluster].extend(min_max_interlieved)
+            st.dataframe(cluster_info)
+    st.markdown("## RFM metric distribution")
+    plot_rfm_distribution(df_rfm, cluster_info_dict)
+    display_dataframe_heatmap(df_rfm)
+    st.markdown("## Interactive exploration")
+    filter_by_cluster = st.checkbox(
+        "Filter client: only one client per cluster type",
+        value=True,
+    )
+    client_to_select = (
+        df_rfm.groupby(["Recency_cluster", "Frequency_cluster", "Revenue_cluster"])["CustomerID"].first().values
+        if filter_by_cluster
+        else df["CustomerID"].unique()
+    )
+    # Let the user select the user to investigate
+    user = st.selectbox(
+        "Select a customer to show more information about him.",
+        client_to_select,
+    )
+    show_purhcase_history(user, df)
+    recency, frequency, revenue = show_user_info(user, df_rfm)
+    categorize_user(recency, frequency, revenue)
+    explain_cluster(cluster_info_dict)
+main()

recommender.py ADDED Viewed

	@@ -0,0 +1,126 @@

+from implicit.als import AlternatingLeastSquares
+from implicit.lmf import LogisticMatrixFactorization
+from implicit.bpr import BayesianPersonalizedRanking
+from implicit.nearest_neighbours import bm25_weight
+from scipy.sparse import csr_matrix
+from typing import Dict, Any
+MODEL = {
+    "lmf": LogisticMatrixFactorization,
+    "als": AlternatingLeastSquares,
+    "bpr": BayesianPersonalizedRanking,
+}
+def _get_sparse_matrix(values, user_idx, product_idx):
+    return csr_matrix(
+        (values, (user_idx, product_idx)),
+        shape=(len(user_idx.unique()), len(product_idx.unique())),
+    )
+def _get_model(name: str, **params):
+    model = MODEL.get(name)
+    if model is None:
+        raise ValueError("No model with name {}".format(name))
+    return model(**params)
+class InternalStatusError(Exception):
+    pass
+class Recommender:
+    def __init__(
+        self,
+        values,
+        user_idx,
+        product_idx,
+    ):
+        self.user_product_matrix = _get_sparse_matrix(values, user_idx, product_idx)
+        self.user_idx = user_idx
+        self.product_idx = product_idx
+        # This variable will be set during training phase
+        self.model = None
+        self.fitted = False
+    def create_and_fit(
+        self,
+        model_name: str,
+        weight_strategy: str = "bm25",
+        model_params: Dict[str, Any] = {},
+    ):
+        weight_strategy = weight_strategy.lower()
+        if weight_strategy == "bm25":
+            data = bm25_weight(
+                self.user_product_matrix,
+                K1=1.2,
+                B=0.75,
+            )
+        elif weight_strategy == "balanced":
+            # Balance the positive and negative (nan) entries
+            # http://stanford.edu/~rezab/nips2014workshop/submits/logmat.pdf
+            total_size = (
+                self.user_product_matrix.shape[0] * self.user_product_matrix.shape[1]
+            )
+            sum = self.user_product_matrix.sum()
+            num_zeros = total_size - self.user_product_matrix.count_nonzero()
+            data = self.user_product_matrix.multiply(num_zeros / sum)
+        elif weight_strategy == "same":
+            data = self.user_product_matrix
+        else:
+            raise ValueError("Weight strategy not supported")
+        self.model = _get_model(model_name, **model_params)
+        self.fitted = True
+        self.model.fit(data)
+        return self
+    def recommend_products(
+        self,
+        user_id,
+        items_to_recommend = 5,
+    ):
+        """Finds the recommended items for the user.
+        Returns:
+            (items, scores) pair, where item is already the name of the suggested item.
+        """
+        if not self.fitted:
+            raise InternalStatusError(
+                "Cannot recommend products without previously fitting the model."
+                " Please, consider fitting the model before recommening products."
+            )
+        return self.model.recommend(
+            user_id,
+            self.user_product_matrix[user_id],
+            filter_already_liked_items=True,
+            N=items_to_recommend,
+        )
+    def explain_recommendation(
+        self,
+        user_id,
+        suggested_item_id,
+        recommended_items,
+    ):
+        _, items_score_contrib, _ = self.model.explain(
+            user_id,
+            self.user_product_matrix,
+            suggested_item_id,
+            N=recommended_items,
+        )
+        return items_score_contrib
+    def similar_users(self, user_id):
+        return self.model.similar_users(user_id)
+    @property
+    def item_factors(self):
+        return self.model.item_factors

recommender_system.py ADDED Viewed

	@@ -0,0 +1,366 @@

+import streamlit as st
+import pandas as pd
+import altair as alt
+from recommender import Recommender
+from sklearn.decomposition import PCA
+from sklearn.manifold import TSNE
+from os import cpu_count
+import numpy as np
+import time
+import random
+from utils import load_and_preprocess_data
+import matplotlib.pyplot as plt
+from typing import Union, List, Dict, Any, TYPE_CHECKING
+import plotly.graph_objects as go
+COLUMN_NOT_DISPLAY = [
+    "StockCode",
+    "UnitPrice",
+    "Country",
+    "CustomerIndex",
+    "ProductIndex",
+]
+SIDEBAR_DESCRIPTION = """
+# Recommender system
+## What is it?
+A recommender system is a tool that suggests something new to a particular
+user that she/he might be interest in. It becomes really useful when
+the number of items that a user can choose from is high.
+## How does it work?
+A recommender system internally finds similar users and similar items,
+based on a suitable definition of "similarity".
+For example, users that purchased the same items can be considered similar.
+When we want to suggest new items to a user, a recommender system exploits
+the items bought by similar users as a starting point for the suggestion.
+The items bought by similar users are compared to the items that the user
+already bought. If they are new and similar, the model suggests them.
+## How we prepare the data
+For each user, we compute the quantity purchased for every single item.
+This will be the metric the value considered by the modele to compute
+the similarity. The item that a user has never bought will
+be left at zero. These zeros will be the subject of the recommendation.
+""".lstrip()
+@st.cache(allow_output_mutation=True)
+def create_and_fit_recommender(
+    model_name: str,
+    values: Union[pd.DataFrame, "np.ndarray"],
+    users: Union[pd.DataFrame, "np.ndarray"],
+    products: Union[pd.DataFrame, "np.ndarray"],
+) -> Recommender:
+    recommender = Recommender(
+        values,
+        users,
+        products,
+    )
+    recommender.create_and_fit(
+        model_name,
+        # Fine-tuned values
+        model_params=dict(
+            factors=190,
+            alpha=0.6,
+            regularization=0.06,
+        ),
+    )
+    return recommender
+def explain_recommendation(
+    recommender: Recommender,
+    user_id: int,
+    suggestions: List[int],
+    df: pd.DataFrame,
+):
+    output = []
+    n_recommended = len(suggestions)
+    for suggestion in suggestions:
+        explained = recommender.explain_recommendation(
+            user_id, suggestion, n_recommended
+        )
+        suggested_items_id = [id[0] for id in explained]
+        suggested_description = (
+            df.loc[df.ProductIndex == suggestion][["Description", "ProductIndex"]]
+            .drop_duplicates(subset=["ProductIndex"])["Description"]
+            .unique()[0]
+        )
+        similar_items_description = (
+            df.loc[df["ProductIndex"].isin(suggested_items_id)][
+                ["Description", "ProductIndex"]
+            ]
+            .drop_duplicates(subset=["ProductIndex"])["Description"]
+            .unique()
+        )
+        output.append(
+            f"The item **{suggested_description.strip()}** "
+            "has been suggested because it is similar to the following products"
+            " bought by the user:"
+        )
+        for description in similar_items_description:
+            output.append(f"- {description.strip()}")
+    with st.expander("See why the model recommended these products"):
+        st.write("\n".join(output))
+    st.write("------")
+def print_suggestions(suggestions: List[int], df: pd.DataFrame):
+    similar_items_description = (
+        df.loc[df["ProductIndex"].isin(suggestions)][["Description", "ProductIndex"]]
+        .drop_duplicates(subset=["ProductIndex"])["Description"]
+        .unique()
+    )
+    output = ["The model suggests the following products:"]
+    for description in similar_items_description:
+        output.append(f"- {description.strip()}")
+    st.write("\n".join(output))
+def display_user_char(user: int, data: pd.DataFrame):
+    subset = data[data.CustomerIndex == user]
+    # products = subset.groupby("ProductIndex").agg(
+    #     {"Description": lambda x: x.iloc[0], "Quantity": sum}
+    # )
+    st.write(
+        "The user {} bought {} distinct products. Here is the purchase history: ".format(
+            user, subset["Description"].nunique()
+        )
+    )
+    st.dataframe(
+        subset.sort_values("InvoiceDate").drop(
+            # Do not show the customer since we are display the
+            # information for a specific customer.
+            COLUMN_NOT_DISPLAY + ["CustomerID"],
+            axis=1,
+        )
+    )
+    st.write("-----")
+def _extract_description(df, products):
+    desc = df[df["ProductIndex"].isin(products)].drop_duplicates(
+        "ProductIndex", ignore_index=True
+    )[["ProductIndex", "Description"]]
+    return desc.set_index("ProductIndex")
+def display_recommendation_plots(
+    user_id: int,
+    suggestions: List[int],
+    df: pd.DataFrame,
+    model: Recommender,
+):
+    """Plots a t-SNE with the suggested items, togheter with the purchases of
+    similar users.
+    """
+    # Get the purchased items that contribute the most to the suggestions
+    contributions = []
+    n_recommended = len(suggestions)
+    for suggestion in suggestions:
+        items_and_score = model.explain_recommendation(
+            user_id, suggestion, n_recommended
+        )
+        contributions.append([t[0] for t in items_and_score])
+    contributions = np.unique(np.concatenate(contributions))
+    print("Contribution computed")
+    print(contributions)
+    print("=" * 80)
+    # Find the purchases of similar users
+    bought_by_similar_users = []
+    sim_users, _ = model.similar_users(user_id)
+    for u in sim_users:
+        _, sim_purchases = model.user_product_matrix[u].nonzero()
+        bought_by_similar_users.append(sim_purchases)
+    bought_by_similar_users = np.unique(np.concatenate(bought_by_similar_users))
+    print("Similar bought computed")
+    print(bought_by_similar_users)
+    print("=" * 80)
+    # Compute the t-sne
+    # Concate all the vectors to compute a single time the decomposition
+    to_decompose = np.concatenate(
+        (
+            model.item_factors[suggestions],
+            model.item_factors[contributions],
+            model.item_factors[bought_by_similar_users],
+        )
+    )
+    print(f"Shape to decompose: {to_decompose.shape}")
+    with st.spinner("Computing plots (this might take around 60 seconds)..."):
+        elapsed = time.time()
+        decomposed = _tsne_decomposition(
+            to_decompose,
+            dict(
+                perplexity=30,
+                metric="euclidean",
+                n_iter=1_000,
+                random_state=42,
+            ),
+        )
+    elapsed = time.time() - elapsed
+    print(f"TSNE computed in {elapsed}")
+    print("=" * 80)
+    # Extract the decomposed vectors
+    suggestion_dec = decomposed[: len(suggestions), :]
+    contribution_dec = decomposed[
+        len(suggestions) : len(suggestions) + len(contributions), :
+    ]
+    items_others_dec = decomposed[-len(bought_by_similar_users) :, :]
+    # Also, extract the description to create a nice hover in
+    # the final plot.
+    contribution_description = _extract_description(df, contributions)
+    items_other_description = _extract_description(df, bought_by_similar_users)
+    suggestion_description = _extract_description(df, suggestions)
+    # Plot the scatterplot
+    fig = go.Figure()
+    fig.add_trace(
+        go.Scatter(
+            x=contribution_dec[:, 0],
+            y=contribution_dec[:, 1],
+            mode="markers",
+            opacity=0.8,
+            name="Similar bought by user",
+            marker_symbol="square-open",
+            marker_color="darkviolet",
+            marker_size=10,
+            hovertext=contribution_description.loc[contributions].values.squeeze(),
+        )
+    )
+    fig.add_trace(
+        go.Scatter(
+            x=items_others_dec[:, 0],
+            y=items_others_dec[:, 1],
+            mode="markers",
+            name="Product bought by similar users",
+            opacity=0.7,
+            marker_symbol="circle-open",
+            marker_size=10,
+            hovertext=items_other_description.loc[
+                bought_by_similar_users
+            ].values.squeeze(),
+        )
+    )
+    fig.add_trace(
+        go.Scatter(
+            x=suggestion_dec[:, 0],
+            y=suggestion_dec[:, 1],
+            mode="markers",
+            name="Suggested",
+            marker_color="red",
+            marker_symbol="star",
+            marker_size=10,
+            hovertext=suggestion_description.loc[suggestions].values.squeeze(),
+        )
+    )
+    fig.update_xaxes(visible=False)
+    fig.update_yaxes(visible=False)
+    fig.update_layout(plot_bgcolor="white")
+    return fig
+def _tsne_decomposition(data: np.ndarray, tsne_args: Dict[str, Any]):
+    if data.shape[1] > 50:
+        print("Performing PCA...")
+        data = PCA(n_components=50).fit_transform(data)
+    return TSNE(
+        n_components=2,
+        n_jobs=cpu_count(),
+        **tsne_args,
+    ).fit_transform(data)
+def main():
+    # Load and process data
+    data, users, products = load_and_preprocess_data()
+    recommender = create_and_fit_recommender(
+        "als",
+        data["Quantity"],
+        users,
+        products,
+    )
+    st.markdown(
+        """# Recommender system
+The dataset used for these computations is the following:
+        """
+    )
+    st.sidebar.markdown(SIDEBAR_DESCRIPTION)
+    # Show the data
+    st.dataframe(
+        data.drop(
+            COLUMN_NOT_DISPLAY,
+            axis=1,
+        ),
+        use_container_width=True,
+    )
+    st.markdown("## Interactive suggestion")
+    with st.form("recommend"):
+        # Let the user select the user to investigate
+        user = st.selectbox(
+            "Select a customer to get his recommendations",
+            users.unique(),
+        )
+        items_to_recommend = st.slider("How many items to recommend?", 1, 10, 5)
+        print(items_to_recommend)
+        submitted = st.form_submit_button("Recommend!")
+        if submitted:
+            # show_purhcase_history(user, data)
+            display_user_char(user, data)
+            suggestions_and_score = recommender.recommend_products(
+                user, items_to_recommend
+            )
+            print_suggestions(suggestions_and_score[0], data)
+            explain_recommendation(recommender, user, suggestions_and_score[0], data)
+            st.markdown(
+                "## How the purchases of similar users influnce the recommendation"
+            )
+            fig = display_recommendation_plots(
+                user, suggestions_and_score[0], data, recommender
+            )
+            st.plotly_chart(fig)
+main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+pandas
+sklearn
+streamlit
+implicit
+scipy
+tqdm
+numpy
+matplotlib
+seaborn
+mlxtend
+plotly==5.9.0

requirements_freezed.txt ADDED Viewed

	@@ -0,0 +1,68 @@

+altair==4.2.0
+attrs==22.1.0
+black==22.10.0
+blinker==1.5
+cachetools==5.2.0
+certifi==2022.9.24
+charset-normalizer==2.1.1
+click==8.1.3
+commonmark==0.9.1
+contourpy==1.0.5
+cycler==0.11.0
+decorator==5.1.1
+entrypoints==0.4
+fonttools==4.37.4
+gitdb==4.0.9
+GitPython==3.1.29
+idna==3.4
+implicit==0.6.1
+importlib-metadata==5.0.0
+Jinja2==3.1.2
+joblib==1.2.0
+jsonschema==4.16.0
+kiwisolver==1.4.4
+MarkupSafe==2.1.1
+matplotlib==3.6.0
+mlxtend==0.21.0
+mypy-extensions==0.4.3
+numpy==1.23.4
+packaging==21.3
+pandas==1.5.0
+pathspec==0.10.1
+Pillow==9.2.0
+platformdirs==2.5.2
+plotly==5.9.0
+protobuf==3.20.3
+pyarrow==9.0.0
+pydeck==0.8.0b4
+Pygments==2.13.0
+Pympler==1.0.1
+pyparsing==3.0.9
+pyrsistent==0.18.1
+python-dateutil==2.8.2
+pytz==2022.5
+pytz-deprecation-shim==0.1.0.post0
+requests==2.28.1
+rich==12.6.0
+scikit-learn==1.1.2
+scipy==1.9.2
+seaborn==0.12.1
+semver==2.13.0
+six==1.16.0
+sklearn==0.0
+smmap==5.0.0
+streamlit==1.13.0
+tenacity==8.1.0
+threadpoolctl==3.1.0
+toml==0.10.2
+tomli==2.0.1
+toolz==0.12.0
+tornado==6.2
+tqdm==4.64.1
+typing_extensions==4.4.0
+tzdata==2022.5
+tzlocal==4.2
+urllib3==1.26.12
+validators==0.20.0
+watchdog==2.1.9
+zipp==3.9.0

utils.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import streamlit as st
+import pandas as pd
+@st.cache
+def load_and_preprocess_data():
+    df = pd.read_csv(
+        "Data/OnlineRetail.csv",
+        encoding="latin-1",
+    )
+    # Remove nans values
+    df = df.dropna()
+    # Use only positive quantites. This is not a robust approach,
+    # but to keep things simple it quite good.
+    df = df[df["Quantity"] > 0]
+    # Parse the date column
+    df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"]).dt.floor("d")
+    # Change customer id to int
+    df["CustomerID"] = df["CustomerID"].astype(int)
+    # Add price column
+    df["Price"] = df["Quantity"] * df["UnitPrice"]
+    # Get unique entries in the dataset of users and products
+    users = df["CustomerID"].unique()
+    products = df["StockCode"].unique()
+    # Create a categorical type for users and product. User ordered to ensure
+    # reproducibility
+    user_cat = pd.CategoricalDtype(categories=sorted(users), ordered=True)
+    product_cat = pd.CategoricalDtype(categories=sorted(products), ordered=True)
+    # Transform and get the indexes of the columns
+    user_idx = df["CustomerID"].astype(user_cat).cat.codes
+    product_idx = df["StockCode"].astype(product_cat).cat.codes
+    # Add the categorical index to the starting dataframe
+    df["CustomerIndex"] = user_idx
+    df["ProductIndex"] = product_idx
+    return df, user_idx, product_idx