Spaces:

NYU-DS-4-Everyone
/

Final

Running

App Files Files Community

andrewicus commited on May 2

Commit

a62d129

•

1 Parent(s): a8ac252

Draft of all Pages

Browse files

Files changed (5) hide show

Introduction.py +98 -0
ifood-data.csv +0 -0
pages/01 📊 Data Visualization.py +4 -1
pages/02 🤖 Model Prediction.py +47 -20
requirements.txt +1 -0

Introduction.py ADDED Viewed

	@@ -0,0 +1,98 @@

+import streamlit as st
+import pandas as pd
+from PIL import Image
+url = "https://upload.wikimedia.org/wikipedia/commons/6/6a/DoorDash_Logo.svg"
+st.image(url,  output_format="PNG", width=300)
+st.title("Doordash CRM Data Analysis")
+st.markdown("##### Context")
+st.markdown("Doordash is one of the leading food delivery apps in the United States, present in over seven thousand cities.")
+st.markdown("Keeping a high customer engagement is key for growing and consolidating the company’s position as the market leader.")
+st.markdown("To expand their product offering, the company is currently looking to launch a physical button that customers can press to automatically place an order for their favorite meal. Doordash is looking to maximize marketing efforts for this new product.")
+st.image('amazon-dash.png', caption="DoorDash's new Insta-Order Button", width = 300)
+st.markdown("##### Objectives")
+st.markdown("The objective of the team is to build a predictive model that will produce the highest profit for the next direct marketing campaign, scheduled for next month. The new campaign, sixth, aims at selling a new gadget to the Customer Database. The team is set on developing a model that predicts customer response to various marketing tactics, which will then be applied to the rest of the customer base.")
+st.markdown("Hopefully the model will allow the company to cherry pick the customers that are most likely to purchase the offer while leaving out the non-respondents, making the sixth campaign highly profitable. Moreover, other than maximizing the profit of this new campaign while reducing expenses, the CMO is interested in understanding the characteristic features of those customers who are responsive to purchasing the gadget.")
+st.markdown("### Key Goals:")
+st.markdown("1. Propose and describe a customer segmentation based on customers behaviors.")
+st.markdown("2. Create a predictive model which allows the company to maximize the profits while reducing expenses of the sixth marketing campaign.")
+st.markdown("3. By examining which past campaigns were the most responsive, the team can implement the most successful strategies into the sixth campaign to increase customer retention.")
+st.markdown("##### Data Source")
+st.markdown("The data set contains socio-demographic and firmographic features from about 2.240 customers who were contacted. Additionally, it contains a flag for those customers who responded to the campaign by purchasing the product.")
+df = pd.read_csv("ifood-data.csv")
+num = st.number_input('No. of Rows', 5, 10)
+head = st.radio('View from top (head) or bottom (tail)', ('Head', 'Tail'))
+if head == 'Head':
+  st.dataframe(df.head(num))
+else:
+  st.dataframe(df.tail(num))
+st.text('(Rows,Columns)')
+st.write(df.shape)
+st.markdown("### Fields")
+st.markdown("- AcceptedCmp1 - 1 if customer accepted the offer in the 1st campaign, 0 otherwise")
+st.markdown("- AcceptedCmp2 - 1 if customer accepted the offer in the 2nd campaign, 0 otherwise")
+st.markdown("- AcceptedCmp3 - 1 if customer accepted the offer in the 3rd campaign, 0 otherwise")
+st.markdown("- AcceptedCmp4 - 1 if customer accepted the offer in the 4th campaign, 0 otherwise")
+st.markdown("- AcceptedCmp5 - 1 if customer accepted the offer in the 5th campaign, 0 otherwise")
+st.markdown("- Response (target) - 1 if customer accepted the offer in the last campaign, 0 otherwise")
+st.markdown("- Complain - 1 if customer complained in the last 2 years")
+st.markdown("- DtCustomer - date of customer’s enrolment with the company")
+st.markdown("- Education - customer’s level of education")
+st.markdown("- Marital - customer’s marital status")
+st.markdown("- Kidhome - number of small children in customer’s household")
+st.markdown("- Teenhome - number of teenagers in customer’s household")
+st.markdown("- Income - customer’s yearly household income")
+st.markdown("- MntFishProducts - amount spent on fish products in the last 2 years")
+st.markdown("- MntMeatProducts - amount spent on meat products in the last 2 years")
+st.markdown("- MntFruits - amount spent on fruits products in the last 2 years")
+st.markdown("- MntSweetProducts - amount spent on sweet products in the last 2 years")
+st.markdown("- MntWines - amount spent on wine products in the last 2 years")
+st.markdown("- MntGoldProds - amount spent on gold products in the last 2 years")
+st.markdown("- NumDealsPurchases - number of purchases made with discount")
+st.markdown("- NumCatalogPurchases - number of purchases made using catalogue")
+st.markdown("- NumStorePurchases - number of purchases made directly in stores")
+st.markdown("- NumWebPurchases - number of purchases made through company’s web site")
+st.markdown("- NumWebVisitsMonth - number of visits to company’s web site in the last month")
+st.markdown("- Recency - number of days since the last purchase")
+st.markdown("### Description of Data")
+st.dataframe(df.describe())
+st.markdown("### Missing Values")
+st.markdown("Null or NaN values.")
+dfnull = df.isnull().sum()/len(df)*100
+totalmiss = dfnull.sum().round(2)
+totalmiss = round(totalmiss/len(df.columns),2)
+st.write("Percentage of total missing values: ",totalmiss)
+st.write(dfnull)
+if totalmiss <= 30:
+    st.success("We have less then 30 percent of missing values, which is good. This provides us with more accurate data as the null values will not significantly affect the outcomes of our conclusions. And no bias will steer towards misleading results. ")
+else:
+    st.warning("Poor data quality due to greater than 30 percent of missing value.")
+    st.markdown(" > Theoretically, 25 to 30 percent is the maximum missing values are allowed, there's no hard and fast rule to decide this threshold. It can vary from problem to problem.")
+st.markdown("### Completeness")
+st.markdown(" The ratio of non-missing values to total records in dataset and how comprehensive the data is.")
+st.write("Total data length:", len(df))
+nonmissing = (df.notnull().sum().round(2))
+completeness= round(sum(nonmissing)/df.size,2)
+st.write("Completeness ratio:",completeness)
+st.write(nonmissing)
+if completeness >= 0.80:
+    st.success("We have completeness ratio greater than 0.85, which is good. It shows that the vast majority of the data is available for us to use and analyze. ")
+else:
+    st.success("Poor data quality due to low completeness ratio( less than 0.85).")

ifood-data.csv CHANGED Viewed

The diff for this file is too large to render. See raw diff

pages/01 📊 Data Visualization.py CHANGED Viewed

@@ -26,15 +26,17 @@ st.metric(value = df_unclean.shape[0] - df.shape[0], label = "Difference")
 # Education Levels
 st.bar_chart(df.groupby("Education").size(), color = "#FF3008")
 st.bar_chart(df.groupby("Year_Birth").size(), color = "#FF3008")
 accepted_cmp_dataset = df[["AcceptedCmp1","AcceptedCmp2","AcceptedCmp3","AcceptedCmp4","AcceptedCmp5"]]
 counts = accepted_cmp_dataset.sum()
 counts = counts.reset_index()
 counts.columns = ['Campaign', 'Frequency']
 st.bar_chart(counts.set_index('Campaign'), color = "#FF3008")
 df['Education'] = df['Education'].astype('category').cat.codes
@@ -43,6 +45,7 @@ df['Marital_Status'] = df['Marital_Status'].astype('category').cat.codes
 df = df.drop(["Dt_Customer"], axis = 1)
 df = df.drop(["ID"], axis = 1)
 heatmap = plt.figure(figsize=(18, 10))
 sns.heatmap(df.corr().round(2), annot=True, cmap="Reds")
 st.pyplot(heatmap)

 # Education Levels
+st.header("Education Distribution")
 st.bar_chart(df.groupby("Education").size(), color = "#FF3008")
+st.header("Birth Year Distribution")
 st.bar_chart(df.groupby("Year_Birth").size(), color = "#FF3008")
+st.header("Birth Year Distribution")
 accepted_cmp_dataset = df[["AcceptedCmp1","AcceptedCmp2","AcceptedCmp3","AcceptedCmp4","AcceptedCmp5"]]
 counts = accepted_cmp_dataset.sum()
 counts = counts.reset_index()
 counts.columns = ['Campaign', 'Frequency']
 st.bar_chart(counts.set_index('Campaign'), color = "#FF3008")
 df['Education'] = df['Education'].astype('category').cat.codes
 df = df.drop(["Dt_Customer"], axis = 1)
 df = df.drop(["ID"], axis = 1)
+st.header("Heatmap")
 heatmap = plt.figure(figsize=(18, 10))
 sns.heatmap(df.corr().round(2), annot=True, cmap="Reds")
 st.pyplot(heatmap)

pages/02 🤖 Model Prediction.py CHANGED Viewed

@@ -11,6 +11,7 @@ import matplotlib.pyplot as plt
 from sklearn.linear_model import LogisticRegression
 from sklearn.metrics import classification_report
 from codecarbon import EmissionsTracker
 url = "https://upload.wikimedia.org/wikipedia/commons/6/6a/DoorDash_Logo.svg"
 st.image(url,  output_format="PNG", width=300)
@@ -24,31 +25,57 @@ df['Education'] = df['Education'].astype('category').cat.codes
 df['Marital_Status'] = df['Marital_Status'].astype('category').cat.codes
 df = df.drop(["Dt_Customer"], axis = 1)
 df = df.drop(["ID"], axis = 1)
-X = df.drop(labels = ['Response'], axis = 1)
-y = df["Response"]
-X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 42)
-st.multiselect("Select Parameters", df.columns)
-tracker = EmissionsTracker()
-tracker.start()
-logmodel = LogisticRegression()
-logmodel.fit(X_train,y_train)
-log_results = logmodel.predict(X_test)
-clf = DecisionTreeClassifier()
-clf = clf.fit(X_train,y_train)
-tree_results = clf.predict(X_test)
-knn = KNeighborsClassifier()
-knn.fit(X_train, y_train)
-knn_results = knn.predict(X_test)
-emissions = tracker.stop()
-print(f"Estimated emissions for training the model: {emissions:.4f} kg of CO2")
-st.metric(label = "Log Accuracy", value = round(metrics.accuracy_score(y_test, log_results)*100, 2))
-st.metric(label = "Tree Accuracy", value = round(metrics.accuracy_score(y_test, tree_results)*100, 2))
-st.metric(label = "kNN Accuracy", value = round(metrics.accuracy_score(y_test, knn_results)*100, 2))

 from sklearn.linear_model import LogisticRegression
 from sklearn.metrics import classification_report
 from codecarbon import EmissionsTracker
+import time
 url = "https://upload.wikimedia.org/wikipedia/commons/6/6a/DoorDash_Logo.svg"
 st.image(url,  output_format="PNG", width=300)
 df['Marital_Status'] = df['Marital_Status'].astype('category').cat.codes
 df = df.drop(["Dt_Customer"], axis = 1)
 df = df.drop(["ID"], axis = 1)
+params = st.multiselect("Select Parameters", df.columns, default = ["Year_Birth"])
+model = st.selectbox("Select Model", ["Logistic Regression", "K-Nearest Neighbors", "Decision Tree"])
+if not params:
+    st.warning("Please select at least one parameter.")
+else:
+    X = df.drop(labels = ['Response'], axis = 1)
+    X = df[params]
+    y = df["Response"]
+    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 42)
+    model_start_time = time.time()
+    tracker = EmissionsTracker()
+    tracker.start()
+    if(model == "Logistic Regression"):
+        logmodel = LogisticRegression()
+        logmodel.fit(X_train,y_train)
+        model_accuracy = logmodel.predict(X_test)
+    elif(model == "K-Nearest Neighbors"):
+        knn = KNeighborsClassifier()
+        knn.fit(X_train, y_train)
+        model_accuracy = knn.predict(X_test)
+    else:
+        clf = DecisionTreeClassifier(max_depth=3)
+        clf = clf.fit(X_train,y_train)
+        model_accuracy = clf.predict(X_test)
+        import graphviz
+        from sklearn.tree import export_graphviz
+        # Assuming `clf` and `X` are defined somewhere in your code
+        # Your code for exporting the decision tree graph
+        feature_names = X.columns
+        feature_cols = X.columns
+        dot_data = export_graphviz(clf, out_file=None,
+                                feature_names=feature_cols,
+                                class_names=['0', '1'],
+                                filled=True, rounded=True,
+                                special_characters=True)
+        # Display the graph using streamlit_graphviz
+        st.graphviz_chart(dot_data)
+    model_end_time = time.time()
+    model_execution_time = model_end_time - model_start_time
+    emissions = tracker.stop()
+    print(f"Estimated emissions for training the model: {emissions:.4f} kg of CO2")
+    st.metric(label = "Accuracy", value = str(round(metrics.accuracy_score(y_test, model_accuracy)*100, 2)) + "%")
+    st.metric(label = "Execution time:", value = str(model_execution_time) + "s")

requirements.txt CHANGED Viewed

@@ -7,3 +7,4 @@ tensorflow
 matplotlib
 streamlit
 seaborn

 matplotlib
 streamlit
 seaborn
+graphviz