Spaces:

NYU-DS-4-Everyone
/

Final

Running

App Files Files Community

ziyaforbes commited on Apr 29

Commit

dc607dd

•

1 Parent(s): 889f021

Upload 10 files

Browse files

Files changed (10) hide show

Introduction.py +98 -0
README.md +1 -12
amazon-dash.png +0 -0
ifood-data.csv +0 -0
pages/01 📊 Data Visualization.py +79 -0
pages/02 🤖 Model Prediction.py +54 -0
pages/03 🧑‍💻 Explainable AI.py +9 -0
pages/04 🦦 MLflow.py +245 -0
pages/05 🧑‍💻 Insights.py +23 -0
requirements.txt +8 -0

Introduction.py ADDED Viewed

	@@ -0,0 +1,98 @@

+import streamlit as st
+import pandas as pd
+from PIL import Image
+url = "https://upload.wikimedia.org/wikipedia/commons/6/6a/DoorDash_Logo.svg"
+st.image(url,  output_format="PNG", width=300)
+st.title("Doordash CRM Data Analysis")
+st.markdown("##### Context")
+st.markdown("Doordash is one of the leading food delivery apps in the United States, present in over seven thousand cities.")
+st.markdown("Keeping a high customer engagement is key for growing and consolidating the company’s position as the market leader.")
+st.markdown("To expand their product offering, the company is currently looking to launch a physical button that customers can press to automatically place an order for their favorite meal. Doordash is looking to maximize marketing efforts for this new product.")
+st.image('amazon-dash.png', caption="DoorDash's new Insta-Order Button", width = 300)
+st.markdown("##### Objectives")
+st.markdown("The objective of the team is to build a predictive model that will produce the highest profit for the next direct marketing campaign, scheduled for next month. The new campaign, sixth, aims at selling a new gadget to the Customer Database. The team is set on developing a model that predicts customer response to various marketing tactics, which will then be applied to the rest of the customer base.")
+st.markdown("Hopefully the model will allow the company to cherry pick the customers that are most likely to purchase the offer while leaving out the non-respondents, making the sixth campaign highly profitable. Moreover, other than maximizing the profit of this new campaign while reducing expenses, the CMO is interested in understanding the characteristic features of those customers who are responsive to purchasing the gadget.")
+st.markdown("### Key Goals:")
+st.markdown("1. Propose and describe a customer segmentation based on customers behaviors.")
+st.markdown("2. Create a predictive model which allows the company to maximize the profits while reducing expenses of the sixth marketing campaign.")
+st.markdown("3. By examining which past campaigns were the most responsive, the team can implement the most successful strategies into the sixth campaign to increase customer retention.")
+st.markdown("##### Data Source")
+st.markdown("The data set contains socio-demographic and firmographic features from about 2.240 customers who were contacted. Additionally, it contains a flag for those customers who responded to the campaign by purchasing the product.")
+df = pd.read_csv("ifood-data.csv")
+num = st.number_input('No. of Rows', 5, 10)
+head = st.radio('View from top (head) or bottom (tail)', ('Head', 'Tail'))
+if head == 'Head':
+  st.dataframe(df.head(num))
+else:
+  st.dataframe(df.tail(num))
+st.text('(Rows,Columns)')
+st.write(df.shape)
+st.markdown("### Fields")
+st.markdown("- AcceptedCmp1 - 1 if customer accepted the offer in the 1st campaign, 0 otherwise")
+st.markdown("- AcceptedCmp2 - 1 if customer accepted the offer in the 2nd campaign, 0 otherwise")
+st.markdown("- AcceptedCmp3 - 1 if customer accepted the offer in the 3rd campaign, 0 otherwise")
+st.markdown("- AcceptedCmp4 - 1 if customer accepted the offer in the 4th campaign, 0 otherwise")
+st.markdown("- AcceptedCmp5 - 1 if customer accepted the offer in the 5th campaign, 0 otherwise")
+st.markdown("- Response (target) - 1 if customer accepted the offer in the last campaign, 0 otherwise")
+st.markdown("- Complain - 1 if customer complained in the last 2 years")
+st.markdown("- DtCustomer - date of customer’s enrolment with the company")
+st.markdown("- Education - customer’s level of education")
+st.markdown("- Marital - customer’s marital status")
+st.markdown("- Kidhome - number of small children in customer’s household")
+st.markdown("- Teenhome - number of teenagers in customer’s household")
+st.markdown("- Income - customer’s yearly household income")
+st.markdown("- MntFishProducts - amount spent on fish products in the last 2 years")
+st.markdown("- MntMeatProducts - amount spent on meat products in the last 2 years")
+st.markdown("- MntFruits - amount spent on fruits products in the last 2 years")
+st.markdown("- MntSweetProducts - amount spent on sweet products in the last 2 years")
+st.markdown("- MntWines - amount spent on wine products in the last 2 years")
+st.markdown("- MntGoldProds - amount spent on gold products in the last 2 years")
+st.markdown("- NumDealsPurchases - number of purchases made with discount")
+st.markdown("- NumCatalogPurchases - number of purchases made using catalogue")
+st.markdown("- NumStorePurchases - number of purchases made directly in stores")
+st.markdown("- NumWebPurchases - number of purchases made through company’s web site")
+st.markdown("- NumWebVisitsMonth - number of visits to company’s web site in the last month")
+st.markdown("- Recency - number of days since the last purchase")
+st.markdown("### Description of Data")
+st.dataframe(df.describe())
+st.markdown("### Missing Values")
+st.markdown("Null or NaN values.")
+dfnull = df.isnull().sum()/len(df)*100
+totalmiss = dfnull.sum().round(2)
+totalmiss = round(totalmiss/len(df.columns),2)
+st.write("Percentage of total missing values: ",totalmiss)
+st.write(dfnull)
+if totalmiss <= 30:
+    st.success("We have less then 30 percent of missing values, which is good. This provides us with more accurate data as the null values will not significantly affect the outcomes of our conclusions. And no bias will steer towards misleading results. ")
+else:
+    st.warning("Poor data quality due to greater than 30 percent of missing value.")
+    st.markdown(" > Theoretically, 25 to 30 percent is the maximum missing values are allowed, there's no hard and fast rule to decide this threshold. It can vary from problem to problem.")
+st.markdown("### Completeness")
+st.markdown(" The ratio of non-missing values to total records in dataset and how comprehensive the data is.")
+st.write("Total data length:", len(df))
+nonmissing = (df.notnull().sum().round(2))
+completeness= round(sum(nonmissing)/df.size,2)
+st.write("Completeness ratio:",completeness)
+st.write(nonmissing)
+if completeness >= 0.80:
+    st.success("We have completeness ratio greater than 0.85, which is good. It shows that the vast majority of the data is available for us to use and analyze. ")
+else:
+    st.success("Poor data quality due to low completeness ratio( less than 0.85).")

README.md CHANGED Viewed

@@ -1,12 +1 @@
----
-title: Doordash CRM Data Analysis
-emoji: 🏢
-colorFrom: green
-colorTo: purple
-sdk: streamlit
-sdk_version: 1.33.0
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference


1	+ # doordash-data-analysis

amazon-dash.png ADDED Viewed

ifood-data.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

pages/01 📊 Data Visualization.py ADDED Viewed

	@@ -0,0 +1,79 @@

+import pandas as pd
+import streamlit as st
+import numpy as np
+import seaborn as sns
+import matplotlib.pyplot as plt
+from PIL import Image
+url = "https://upload.wikimedia.org/wikipedia/commons/6/6a/DoorDash_Logo.svg"
+st.image(url, output_format="PNG", width=300)
+st.title("Data Visualization")
+df_unclean = pd.read_csv("ifood-data.csv")
+st.dataframe(df_unclean)
+st.header("Cleaning the Data")
+st.metric(value = df_unclean.shape[0], label = "Rows")
+st.write('After dropping NA and cleaning up outliers')
+df = df_unclean.dropna()
+df = df[df["Year_Birth"] > 1940]
+st.metric(value = df.shape[0], label = "Rows")
+st.metric(value = df_unclean.shape[0] - df.shape[0], label = "Difference")
+# Education Levels
+st.bar_chart(df.groupby("Education").size(), color = "#FF3008")
+st.bar_chart(df.groupby("Year_Birth").size(), color = "#FF3008")
+accepted_cmp_dataset = df[["AcceptedCmp1","AcceptedCmp2","AcceptedCmp3","AcceptedCmp4","AcceptedCmp5"]]
+counts = accepted_cmp_dataset.sum()
+counts = counts.reset_index()
+counts.columns = ['Campaign', 'Frequency']
+st.bar_chart(counts.set_index('Campaign'), color = "#FF3008")
+df['Education'] = df['Education'].astype('category').cat.codes
+df['Marital_Status'] = df['Marital_Status'].astype('category').cat.codes
+df = df.drop(["Dt_Customer"], axis = 1)
+df = df.drop(["ID"], axis = 1)
+heatmap = plt.figure(figsize=(18, 10))
+sns.heatmap(df.corr().round(2), annot=True, cmap="Reds")
+st.pyplot(heatmap)
+# limit to just the most correlated vars
+recency_response = plt.figure()
+sns.boxplot(x=df['Response'], y=df['Recency'], color = "#ff3008")
+plt.xlabel('Response')
+plt.ylabel('Recency')
+plt.title('Box Plot of Response vs Recency')
+st.pyplot(recency_response)
+response_income = plt.figure()
+sns.barplot(x=df['Response'], y=df['Income'], color = "#ff3008")
+plt.xlabel('Response')
+plt.ylabel('Income')
+plt.title('Bar Plot of Response vs Income')
+st.pyplot(response_income)
+response_teenhome = plt.figure()
+sns.boxplot(x=df['Response'], y=df['Teenhome'], color = "#ff3008")
+plt.xlabel('Response')
+plt.ylabel('Teenhome')
+plt.title('Box Plot of Response vs Teenhome')
+st.pyplot(response_teenhome)
+response_kidhome = plt.figure()
+sns.boxplot(x=df['Response'], y=df['Kidhome'], color = "#ff3008")
+plt.xlabel('Response')
+plt.ylabel('Kidhome')
+plt.title('Box Plot of Response vs Kidhome')
+st.pyplot(response_kidhome)

pages/02 🤖 Model Prediction.py ADDED Viewed

	@@ -0,0 +1,54 @@

+import streamlit as st
+import pandas as pd
+from PIL import Image
+import sklearn.metrics as sk_metrics
+from sklearn.model_selection import train_test_split
+from sklearn.neighbors import KNeighborsClassifier
+from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
+from sklearn.model_selection import train_test_split # Import train_test_split function
+from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
+import matplotlib.pyplot as plt
+from sklearn.linear_model import LogisticRegression
+from sklearn.metrics import classification_report
+from codecarbon import EmissionsTracker
+url = "https://upload.wikimedia.org/wikipedia/commons/6/6a/DoorDash_Logo.svg"
+st.image(url,  output_format="PNG", width=300)
+st.title("Model Prediction")
+df_unclean = pd.read_csv("ifood-data.csv")
+df = df_unclean.dropna()
+df = df[df["Year_Birth"] > 1940]
+df['Education'] = df['Education'].astype('category').cat.codes
+df['Marital_Status'] = df['Marital_Status'].astype('category').cat.codes
+df = df.drop(["Dt_Customer"], axis = 1)
+df = df.drop(["ID"], axis = 1)
+X = df.drop(labels = ['Response'], axis = 1)
+y = df["Response"]
+X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 42)
+st.multiselect("Select Parameters", df.columns)
+tracker = EmissionsTracker()
+tracker.start()
+logmodel = LogisticRegression()
+logmodel.fit(X_train,y_train)
+log_results = logmodel.predict(X_test)
+clf = DecisionTreeClassifier()
+clf = clf.fit(X_train,y_train)
+tree_results = clf.predict(X_test)
+knn = KNeighborsClassifier()
+knn.fit(X_train, y_train)
+knn_results = knn.predict(X_test)
+emissions = tracker.stop()
+print(f"Estimated emissions for training the model: {emissions:.4f} kg of CO2")
+st.metric(label = "Log Accuracy", value = round(metrics.accuracy_score(y_test, log_results)*100, 2))
+st.metric(label = "Tree Accuracy", value = round(metrics.accuracy_score(y_test, tree_results)*100, 2))
+st.metric(label = "kNN Accuracy", value = round(metrics.accuracy_score(y_test, knn_results)*100, 2))

pages/03 🧑‍💻 Explainable AI.py ADDED Viewed

	@@ -0,0 +1,9 @@

+import streamlit as st
+import pandas as pd
+from PIL import Image
+url = "https://upload.wikimedia.org/wikipedia/commons/6/6a/DoorDash_Logo.svg"
+st.image(url,  output_format="PNG", width=300)
+st.title("Explainable AI")

pages/04 🦦 MLflow.py ADDED Viewed

	@@ -0,0 +1,245 @@

+import pandas as pd
+import seaborn as sn
+# Commented out IPython magic to ensure Python compatibility.
+from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
+from sklearn.model_selection import train_test_split # Import train_test_split function
+from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
+import matplotlib.pyplot as plt
+from sklearn.linear_model import LogisticRegression
+from sklearn.metrics import classification_report
+# %matplotlib inline
+df = pd.read_csv("ifood-data.csv")
+df.head()
+df['Education'] = df['Education'].astype('category').cat.codes
+df['Marital_Status'] = df['Marital_Status'].astype('category').cat.codes
+df = df.drop(["Dt_Customer"], axis = 1)
+df = df.drop(["ID"], axis = 1)
+df = df.dropna()
+df.head()
+plt.figure(figsize=(16, 10))
+sns.heatmap(df.corr(), annot=True)
+plt.show()
+X = df.drop(labels = ['Response'], axis = 1)
+y = df["Response"]
+X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 42)
+logmodel = LogisticRegression()
+logmodel.fit(X_train,y_train)
+prediction = logmodel.predict(X_test)
+print(classification_report(y_test,prediction))
+# Create Decision Tree classifer object
+clf = DecisionTreeClassifier()
+# Train Decision Tree Classifer
+clf = clf.fit(X_train,y_train)
+#Predict the response for test dataset
+y_pred = clf.predict(X_test)
+print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
+feature_cols = X.columns
+feature_cols
+from sklearn.tree import export_graphviz
+feature_names = X.columns
+dot_data = export_graphviz(clf, out_file=None,
+                         feature_names=feature_cols,
+                         class_names=['0','1'],
+                         filled=True, rounded=True,
+                         special_characters=True)
+graph = graphviz.Source(dot_data)
+graph
+# Create Decision Tree classifer object
+clf = DecisionTreeClassifier(max_depth=3)
+# Train Decision Tree Classifer
+clf = clf.fit(X_train,y_train)
+#Predict the response for test dataset
+y_pred = clf.predict(X_test)
+# Model Accuracy, how often is the classifier correct?
+print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
+import graphviz
+from sklearn.tree import export_graphviz
+feature_names = X.columns
+dot_data = export_graphviz(clf, out_file=None,
+                         feature_names=feature_cols,
+                         class_names=['0','1'],
+                         filled=True, rounded=True,
+                         special_characters=True)
+graph = graphviz.Source(dot_data)
+graph
+from shapash.explainer.smart_explainer import SmartExplainer
+xpl = SmartExplainer(clf)
+y_pred = pd.Series(y_pred)
+X_test = X_test.reset_index(drop=True)
+xpl.compile(x=X_test, y_pred=y_pred)
+xpl.plot.features_importance()
+from sklearn.neighbors import KNeighborsClassifier
+knn = KNeighborsClassifier()
+knn.fit(X_train, y_train)
+results = knn.predict(X_test)
+print("Accuracy:",metrics.accuracy_score(y_test, results))
+# Import necessary libraries
+import numpy as np # a Python library used for working with arrays
+import pandas as pd # it allows us to analyze big data and make conclusions based on statistical theories
+from pycaret.datasets import get_data #  allows you to easily access and load built-in datasets for machine learning experimentation
+from pycaret.classification import * #  imports all the classification-related functions
+from sklearn.model_selection import train_test_split # This function is commonly used to split a dataset into training and testing subsets.
+import mlflow # MLflow is an open-source platform for managing the machine learning lifecycle
+from sklearn import metrics as sk_metrics # imports the metrics module from the sklearn library and use various evaluation metrics and scoring functions provided by sci)kit
+# Split data into training and testing sets
+loan_train, loan_test = train_test_split(df, test_size=0.2, random_state=42)
+# Initialize PyCaret setup with the training set
+cls1 = setup(data = loan_train, target = 'Response')
+# Compare all models and select top 3
+top3 = compare_models(include=['lr', 'knn', 'dt'], n_select=3)
+# Log each model into mlflow separately
+for i, model in enumerate(top3, 1):
+    with mlflow.start_run(run_name = f"Model: {model}"):
+        model_name = "model_" + str(i)
+        # Log model
+        mlflow.sklearn.log_model(model, model_name)
+        # Log parameters
+        params = model.get_params()
+        for key, value in params.items():
+            mlflow.log_param(key, value)
+        # Predict on the testing set and log metrics
+        y_pred = predict_model(model, data=loan_test.drop('Response', axis=1))
+        y_test = loan_test['Response']
+        # Calculate metrics
+        accuracy = sk_metrics.accuracy_score(y_test, y_pred["prediction_label"])
+        precision = sk_metrics.precision_score(y_test, y_pred["prediction_label"], average='weighted')
+        recall = sk_metrics.recall_score(y_test, y_pred["prediction_label"], average='weighted')
+        f1 = sk_metrics.f1_score(y_test, y_pred["prediction_label"], average='weighted')
+        # Log metrics
+        mlflow.log_metric("Accuracy", accuracy)
+        mlflow.log_metric("Precision", precision)
+        mlflow.log_metric("Recall", recall)
+        mlflow.log_metric("F1 Score", f1)
+        mlflow.end_run()
+# Split data into training and testing sets
+loan_train, loan_test = train_test_split(df, test_size=0.2, random_state=42)
+# Define the list of max_depth values to try
+max_depth_values = [2, 4, 6, 8, 10]
+# Loop over each max_depth value
+for depth in max_depth_values:
+    with mlflow.start_run(run_name=f"Decision Tree (Max Depth: {depth})"):
+        # Initialize and train the decision tree model
+        model = DecisionTreeClassifier(max_depth=depth)
+        model.fit(loan_train.drop('Response', axis=1), loan_train['Response'])
+        # Log model parameters
+        mlflow.log_param("max_depth", depth)
+        # Predict on the testing set and log metrics
+        y_pred = model.predict(loan_test.drop('Response', axis=1))
+        y_test = loan_test['Response']
+        # Calculate metrics
+        accuracy = sk_metrics.accuracy_score(y_test, y_pred)
+        precision = sk_metrics.precision_score(y_test, y_pred, average='weighted')
+        recall = sk_metrics.recall_score(y_test, y_pred, average='weighted')
+        f1 = sk_metrics.f1_score(y_test, y_pred, average='weighted')
+        # Log metrics
+        mlflow.log_metric("Accuracy", accuracy)
+        mlflow.log_metric("Precision", precision)
+        mlflow.log_metric("Recall", recall)
+        mlflow.log_metric("F1 Score", f1)
+        # Log the trained model
+        mlflow.sklearn.log_model(model, "decision_tree_model")
+        mlflow.end_run()
+from sklearn.neighbors import KNeighborsClassifier
+# Split data into training and testing sets
+loan_train, loan_test = train_test_split(df, test_size=0.2, random_state=42)
+# Define the list of n_neighbors values to try
+n_neighbors_values = [3, 5, 7, 9, 11]
+# Loop over each n_neighbors value
+for n_neighbors in n_neighbors_values:
+    with mlflow.start_run(run_name=f"KNN (n_neighbors: {n_neighbors})"):
+        # Initialize and train the KNN model
+        model = KNeighborsClassifier(n_neighbors=n_neighbors)
+        model.fit(loan_train.drop('Response', axis=1), loan_train['Response'])
+        # Log model parameters
+        mlflow.log_param("n_neighbors", n_neighbors)
+        # Predict on the testing set and log metrics
+        y_pred = model.predict(loan_test.drop('Response', axis=1))
+        y_test = loan_test['Response']
+        # Calculate metrics
+        accuracy = sk_metrics.accuracy_score(y_test, y_pred)
+        precision = sk_metrics.precision_score(y_test, y_pred, average='weighted')
+        recall = sk_metrics.recall_score(y_test, y_pred, average='weighted')
+        f1 = sk_metrics.f1_score(y_test, y_pred, average='weighted')
+        # Log metrics
+        mlflow.log_metric("Accuracy", accuracy)
+        mlflow.log_metric("Precision", precision)
+        mlflow.log_metric("Recall", recall)
+        mlflow.log_metric("F1 Score", f1)
+        # Log the trained model
+        mlflow.sklearn.log_model(model, "knn_model")
+        mlflow.end_run()

pages/05 🧑‍💻 Insights.py ADDED Viewed

	@@ -0,0 +1,23 @@

+import streamlit as st
+import pandas as pd
+from PIL import Image
+url = "https://upload.wikimedia.org/wikipedia/commons/6/6a/DoorDash_Logo.svg"
+st.image(url,  output_format="PNG", width=300)
+df_unclean = pd.read_csv("ifood-data.csv")
+df = df_unclean.dropna()
+df = df[df["Year_Birth"] > 1940]
+st.title("Insights")
+st.header("Our Ideal Customer")
+response_by_age = df.groupby('Year_Birth')['Response'].mean()
+# Finding the age group with the highest proportion of Response equal to 1
+ideal_age = response_by_age.idxmax()
+highest_response_proportion = response_by_age.max()
+st.metric(value = ideal_age, label = "Ideal Birth Year")

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+codecarbon
+mlflow
+numpy
+pandas
+scikit-learn
+tensorflow
+matplotlib
+streamlit