diff --git "a/P2G7_Allen.ipynb" "b/P2G7_Allen.ipynb" new file mode 100644--- /dev/null +++ "b/P2G7_Allen.ipynb" @@ -0,0 +1,4715 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Graded Challange 7" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# McDonald's Store Reviews" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Name: Allen\n", + "\n", + "Batch: 003\n", + "\n", + "Dataset: https://www.kaggle.com/datasets/nelgiriyewithana/mcdonalds-store-reviews\n", + "\n", + "Deployment: " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Bab 1: Introduction" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Background" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Having project about NLP and fulfing my curiosity, as data science i want to know and predict over 33,000 anonymized reviews of McDonald's stores in the United States, scraped from Google reviews. It provides valuable insights into customer experiences and opinions about various McDonald's locations across the country." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Problem Statement" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As data science or working in IT at Mc Donald's the manager asked me to analysis reviews of McDonald's stores and to predicting the anonymous reviews from google to know what review and how rating the customers gave to the McDonald's. it is important to evaluate the services and the mistakes ever happened in McDonald's to gain back the trust from the customer and increase the customer satisfaction so that they want comeback again that relate to the increase of business revenue in the company." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EoPJoCkCNz7d" + }, + "source": [ + "# Bab 2: Import Libraries" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "OUAjNzh_Nz7e", + "outputId": "faf71de6-0e5d-474e-ea96-1f74d8f69b19" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[nltk_data] Downloading package stopwords to\n", + "[nltk_data] C:\\Users\\user\\AppData\\Roaming\\nltk_data...\n", + "[nltk_data] Package stopwords is already up-to-date!\n", + "[nltk_data] Downloading package punkt to\n", + "[nltk_data] C:\\Users\\user\\AppData\\Roaming\\nltk_data...\n", + "[nltk_data] Package punkt is already up-to-date!\n", + "[nltk_data] Downloading package wordnet to\n", + "[nltk_data] C:\\Users\\user\\AppData\\Roaming\\nltk_data...\n", + "[nltk_data] Package wordnet is already up-to-date!\n" + ] + }, + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Import Libraries\n", + "\n", + "import re\n", + "import nltk\n", + "import string\n", + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "import pickle\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "from sklearn.preprocessing import LabelEncoder\n", + "import tensorflow as tf\n", + "import tensorflow_hub as tf_hub\n", + "from nltk.tokenize import word_tokenize\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.feature_extraction.text import CountVectorizer\n", + "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n", + "from sklearn.metrics import classification_report,ConfusionMatrixDisplay, precision_score,recall_score,accuracy_score,f1_score\n", + "from tensorflow.keras.layers import Embedding\n", + "from tensorflow.keras.models import Sequential\n", + "from tensorflow.keras.layers import Dense, LSTM, Bidirectional, GRU, Dropout, Reshape\n", + "from collections import Counter\n", + "from nltk.stem import WordNetLemmatizer\n", + "from nltk.probability import FreqDist\n", + "from nltk.tokenize import word_tokenize\n", + "nltk.download('stopwords')\n", + "nltk.download('punkt')\n", + "nltk.download('wordnet')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wx505m6HNz7j" + }, + "source": [ + "# Bab 3: Data Loading" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 486 + }, + "id": "IK1fxoQnNz7k", + "outputId": "907e38f7-2b4d-4aff-ee8b-a90c21a0f177" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
reviewer_idstore_namecategorystore_addresslatitudelongituderating_countreview_timereviewrating
01McDonald'sFast food restaurant13749 US-183 Hwy, Austin, TX 78750, United States30.460718-97.7928741,2403 months agoWhy does it look like someone spit on my food?...1 star
12McDonald'sFast food restaurant13749 US-183 Hwy, Austin, TX 78750, United States30.460718-97.7928741,2405 days agoIt'd McDonalds. It is what it is as far as the...4 stars
23McDonald'sFast food restaurant13749 US-183 Hwy, Austin, TX 78750, United States30.460718-97.7928741,2405 days agoMade a mobile order got to the speaker and che...1 star
34McDonald'sFast food restaurant13749 US-183 Hwy, Austin, TX 78750, United States30.460718-97.7928741,240a month agoMy mc. Crispy chicken sandwich was ���ï¿...5 stars
45McDonald'sFast food restaurant13749 US-183 Hwy, Austin, TX 78750, United States30.460718-97.7928741,2402 months agoI repeat my order 3 times in the drive thru, a...1 star
\n", + "
" + ], + "text/plain": [ + " reviewer_id store_name category \\\n", + "0 1 McDonald's Fast food restaurant \n", + "1 2 McDonald's Fast food restaurant \n", + "2 3 McDonald's Fast food restaurant \n", + "3 4 McDonald's Fast food restaurant \n", + "4 5 McDonald's Fast food restaurant \n", + "\n", + " store_address latitude longitude \\\n", + "0 13749 US-183 Hwy, Austin, TX 78750, United States 30.460718 -97.792874 \n", + "1 13749 US-183 Hwy, Austin, TX 78750, United States 30.460718 -97.792874 \n", + "2 13749 US-183 Hwy, Austin, TX 78750, United States 30.460718 -97.792874 \n", + "3 13749 US-183 Hwy, Austin, TX 78750, United States 30.460718 -97.792874 \n", + "4 13749 US-183 Hwy, Austin, TX 78750, United States 30.460718 -97.792874 \n", + "\n", + " rating_count review_time \\\n", + "0 1,240 3 months ago \n", + "1 1,240 5 days ago \n", + "2 1,240 5 days ago \n", + "3 1,240 a month ago \n", + "4 1,240 2 months ago \n", + "\n", + " review rating \n", + "0 Why does it look like someone spit on my food?... 1 star \n", + "1 It'd McDonalds. It is what it is as far as the... 4 stars \n", + "2 Made a mobile order got to the speaker and che... 1 star \n", + "3 My mc. Crispy chicken sandwich was ���ï¿... 5 stars \n", + "4 I repeat my order 3 times in the drive thru, a... 1 star " + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Load dataset\n", + "df_ori=pd.read_csv('McDonald_s_Reviews.csv', encoding=\"latin-1\")\n", + "\n", + "\n", + "# create duplicate df_ori\n", + "df=df_ori.copy()\n", + "\n", + "# Show top 5 data\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 590 + }, + "id": "H3rtdXHNNz7l", + "outputId": "9db68151-1837-458a-91ad-0852e927bdb9" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
reviewer_idstore_namecategorystore_addresslatitudelongituderating_countreview_timereviewrating
3339133392McDonald'sFast food restaurant3501 Biscayne Blvd, Miami, FL 33137, United St...25.81-80.1890982,8104 years agoThey treated me very badly.1 star
3339233393McDonald'sFast food restaurant3501 Biscayne Blvd, Miami, FL 33137, United St...25.81-80.1890982,810a year agoThe service is very good5 stars
3339333394McDonald'sFast food restaurant3501 Biscayne Blvd, Miami, FL 33137, United St...25.81-80.1890982,810a year agoTo remove hunger is enough4 stars
3339433395McDonald'sFast food restaurant3501 Biscayne Blvd, Miami, FL 33137, United St...25.81-80.1890982,8105 years agoIt's good, but lately it has become very expen...5 stars
3339533396McDonald'sFast food restaurant3501 Biscayne Blvd, Miami, FL 33137, United St...25.81-80.1890982,8102 years agothey took good care of me5 stars
\n", + "
" + ], + "text/plain": [ + " reviewer_id store_name category \\\n", + "33391 33392 McDonald's Fast food restaurant \n", + "33392 33393 McDonald's Fast food restaurant \n", + "33393 33394 McDonald's Fast food restaurant \n", + "33394 33395 McDonald's Fast food restaurant \n", + "33395 33396 McDonald's Fast food restaurant \n", + "\n", + " store_address latitude \\\n", + "33391 3501 Biscayne Blvd, Miami, FL 33137, United St... 25.81 \n", + "33392 3501 Biscayne Blvd, Miami, FL 33137, United St... 25.81 \n", + "33393 3501 Biscayne Blvd, Miami, FL 33137, United St... 25.81 \n", + "33394 3501 Biscayne Blvd, Miami, FL 33137, United St... 25.81 \n", + "33395 3501 Biscayne Blvd, Miami, FL 33137, United St... 25.81 \n", + "\n", + " longitude rating_count review_time \\\n", + "33391 -80.189098 2,810 4 years ago \n", + "33392 -80.189098 2,810 a year ago \n", + "33393 -80.189098 2,810 a year ago \n", + "33394 -80.189098 2,810 5 years ago \n", + "33395 -80.189098 2,810 2 years ago \n", + "\n", + " review rating \n", + "33391 They treated me very badly. 1 star \n", + "33392 The service is very good 5 stars \n", + "33393 To remove hunger is enough 4 stars \n", + "33394 It's good, but lately it has become very expen... 5 stars \n", + "33395 they took good care of me 5 stars " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Show lowest 5 data\n", + "df.tail()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "s7hBcYukNz7m", + "outputId": "fbffd88a-534d-46ef-816b-9711d6f02baa" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 33396 entries, 0 to 33395\n", + "Data columns (total 10 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 reviewer_id 33396 non-null int64 \n", + " 1 store_name 33396 non-null object \n", + " 2 category 33396 non-null object \n", + " 3 store_address 33396 non-null object \n", + " 4 latitude 32736 non-null float64\n", + " 5 longitude 32736 non-null float64\n", + " 6 rating_count 33396 non-null object \n", + " 7 review_time 33396 non-null object \n", + " 8 review 33396 non-null object \n", + " 9 rating 33396 non-null object \n", + "dtypes: float64(2), int64(1), object(7)\n", + "memory usage: 2.5+ MB\n" + ] + } + ], + "source": [ + "# Show all information about dataset\n", + "df.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P5V5dmtQNz7o" + }, + "source": [ + "It looks like our data have missing value, let's check it" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "O9mUPzDPNz7p", + "outputId": "2f0753de-35ff-482e-9ce8-7c2b48f710bc" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "reviewer_id 0\n", + "store_name 0\n", + "category 0\n", + "store_address 0\n", + "latitude 660\n", + "longitude 660\n", + "rating_count 0\n", + "review_time 0\n", + "review 0\n", + "rating 0\n", + "dtype: int64" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Check missing value\n", + "df.isnull().sum()" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "h7ZQ6bLpNz7q" + }, + "outputs": [], + "source": [ + "# Remove missing value of latitude\n", + "df.dropna(subset=['latitude '], inplace= True)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "B0hvqVKrNz7s" + }, + "outputs": [], + "source": [ + "# Remove missing value of longitude\n", + "df.dropna(subset=['longitude'], inplace= True)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0DlbE81cNz7t", + "outputId": "a3568cb1-b0af-439b-9e0a-63c5969cac3a" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "reviewer_id 0\n", + "store_name 0\n", + "category 0\n", + "store_address 0\n", + "latitude 0\n", + "longitude 0\n", + "rating_count 0\n", + "review_time 0\n", + "review 0\n", + "rating 0\n", + "dtype: int64" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Check again missing value\n", + "df.isnull().sum()" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "d6zaRPLqNz7t", + "outputId": "1206d79c-0a9a-4798-861b-9031c8982e24" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['reviewer_id', 'store_name', 'category', 'store_address', 'latitude ',\n", + " 'longitude', 'rating_count', 'review_time', 'review', 'rating'],\n", + " dtype='object')" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Show all data columns\n", + "df.columns" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "id": "fQT2YI-gNz7u" + }, + "outputs": [], + "source": [ + "# Removing white space in columns latitude\n", + "df.columns=df.columns.str.strip()" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "rzYkoDL1Nz7w", + "outputId": "a2aff493-3817-46f9-d48e-adfc15961e21" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['reviewer_id', 'store_name', 'category', 'store_address', 'latitude',\n", + " 'longitude', 'rating_count', 'review_time', 'review', 'rating'],\n", + " dtype='object')" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Check again white space in columns\n", + "df.columns" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "w_UOTLulNz7x", + "outputId": "bee20b74-d737-470e-9405-ebcb3d527e6e" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "(32736, 10)" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Show number of columns and rows\n", + "df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "RoYjWQJjNz7x", + "outputId": "ae52f785-97e0-45c2-c6b5-b512305a5910" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "reviewer_id 32736\n", + "store_name 2\n", + "category 1\n", + "store_address 39\n", + "latitude 39\n", + "longitude 39\n", + "rating_count 50\n", + "review_time 39\n", + "review 21634\n", + "rating 5\n", + "dtype: int64" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Show how many unique value in every columns of dataset\n", + "df.nunique()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "RsiZ3f1QNz7y", + "outputId": "a220ff2f-40ce-455a-dfcd-3b5ca679fb64" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array(['Why does it look like someone spit on my food?\\nI had a normal transaction, everyone was chill and polite, but now i dont want to eat this. Im trying not to think about what this milky white/clear substance is all over my food, i d*** sure am not coming back.',\n", + " \"It'd McDonalds. It is what it is as far as the food and atmosphere go. The staff here does make a difference. They are all friendly, accommodating and always smiling. Makes for a more pleasant experience than many other fast food places.\",\n", + " 'Made a mobile order got to the speaker and checked it in.\\nLine was not moving so I had to leave otherwise I���������������������������d be late for work.\\nNever got the refund in the app.\\nI called them and they said I could only get my money back in person because it was stuck in the system.\\nWent there in person the next day and the manager told me she wasnï¿',\n", + " ..., 'To remove hunger is enough',\n", + " \"It's good, but lately it has become very expensive.\",\n", + " 'they took good care of me'], dtype=object)" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Show unique value of column feature\n", + "df.review.unique()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dnFfuMagNz7y" + }, + "source": [ + "From above we gain information that useful for us to do `Text Preprocessing` or to reduce our vocabluary or find stopwords to make the process of our model faster and expected to have improvement." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "LKZmS01ANz7z", + "outputId": "9fdc22c8-426f-4980-d672-753e335eea5c" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array(['1 star', '4 stars', '5 stars', '2 stars', '3 stars'], dtype=object)" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Show unique value of target feature\n", + "df.rating.unique()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Bu2VkyfBNz7z" + }, + "source": [ + "Later we will divide each rating into classes, 1 star - 2 stars will enten class `Negative`, 3 stars will enter class `Neutral`, and 4 stars- 5 stars will enter class `Positive`" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "o3AobjNcNz7z", + "outputId": "ef44b2c8-6d03-4478-c3b3-727f5b674f23" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([\"McDonald's\", \"ýýýMcDonald's\"], dtype=object)" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Show store name unique value\n", + "df.store_name.unique()" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "id": "OsLJv2IbNz70" + }, + "outputs": [], + "source": [ + "# Remove undefined name of the store\n", + "df['store_name'] = df['store_name'].str.replace(\"ýýýMcDonald's\", \"McDonald's\")" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "zfXS-AvZNz70", + "outputId": "7d3de49b-8689-4e91-85e7-234a342a978e" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([\"McDonald's\"], dtype=object)" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Checking the store name unique value\n", + "df.store_name.unique()" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "y1WhyZUYNz70", + "outputId": "cd9b56cd-5dfa-48ae-c44f-b6d70a914459" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Show data duplication\n", + "df.duplicated().sum()" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 300 + }, + "id": "78VywmGUNz71", + "outputId": "73fe1aa3-171d-48e0-be3b-26da96c97c29" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
reviewer_idlatitudelongitude
count32736.00000032736.00000032736.000000
mean16580.60502234.442546-90.647033
std9700.7965135.34411616.594844
min1.00000025.790295-121.995421
25%8184.75000028.655350-97.792874
50%16368.50000033.931261-81.471414
75%25202.25000040.727401-75.399919
max33396.00000044.981410-73.459820
\n", + "
" + ], + "text/plain": [ + " reviewer_id latitude longitude\n", + "count 32736.000000 32736.000000 32736.000000\n", + "mean 16580.605022 34.442546 -90.647033\n", + "std 9700.796513 5.344116 16.594844\n", + "min 1.000000 25.790295 -121.995421\n", + "25% 8184.750000 28.655350 -97.792874\n", + "50% 16368.500000 33.931261 -81.471414\n", + "75% 25202.250000 40.727401 -75.399919\n", + "max 33396.000000 44.981410 -73.459820" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# looking data mean std min median max\n", + "df.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UVpe6iezNz71" + }, + "source": [ + "# Bab 4: Exploratory Data Anlaysis (EDA)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 472 + }, + "id": "D1I7q7N0Nz71", + "outputId": "302d4156-eddb-4940-8cb8-195cfe7ca990" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Plot Bar chart Sentiments\n", + "df['rating'].value_counts().nlargest(10).plot(kind='barh')\n", + "plt.title('Lowest to Highes Rating')\n", + "plt.xlabel('Count')\n", + "plt.ylabel('Rating')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Looking from information above we can said that the most customer of MC'D givin rating 5 stars which is satisfied with the restaurant from the menu served or the service of the restaurant but unfortunately from the data the second highest rating giving by customer is 1 stars where it is very negative response from customer. At the end we can conclude that customers of Mc'd here still satisfied because the third place of rating is filled with 4 stars rating that indicates postiveness rating bigger than negativeness" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 70 + }, + "id": "PqAVnt4jNz72", + "outputId": "f5719b97-3d10-477b-ccdb-0694cb09d5c8" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'Why does it look like someone spit on my food?\\nI had a normal transaction, everyone was chill and polite, but now i dont want to eat this. Im trying not to think about what this milky white/clear substance is all over my food, i d*** sure am not coming back.'" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# example of 1 star review\n", + "df[df.rating == '1 star'].iloc[0].review" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + }, + "id": "PwnyJMngNz72", + "outputId": "f12bae77-cf04-4598-8e70-5d79ba535b4d" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'The staff are very friendly and they do their job perfectly'" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# example of 5 star review\n", + "df[df.rating == '5 stars'].iloc[1].review" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 428 + }, + "id": "3TaDhK_LNz73", + "outputId": "04c15a71-3e08-4a22-c170-9f737606b402" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Pie Plot Percentages of Platforms\n", + "df['category'].value_counts().plot(kind='pie', autopct='%1.1f%%')\n", + "plt.title('Percentages of Platforms')\n", + "plt.legend()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Creating Label Plot\n", + "le = LabelEncoder()\n", + "\n", + "df['label'] = le.fit_transform(df['label'])\n", + "\n", + "plt.figure(figsize=(12, 5))\n", + "\n", + "plt.subplot(1, 2, 2)\n", + "sns.countplot(x='label', data=df)\n", + "plt.title('label Distribution Last 3 Months')\n", + "plt.xlabel('label')\n", + "plt.ylabel('Frequency')\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "from information above we can said that the label, class from customers' review the highest giving rating with label 2 that is `Positive Response`, 0 is `Negative Response`, and 1 is `Neutral Response`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K3S-6jd6Nz74" + }, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZBOz5-4KNz74" + }, + "source": [ + "# Bab 5: Feature Engineering / Text Preprocessing" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qlBW_K0ANz75" + }, + "source": [ + "### Remove Stopwords & Doing Lemmatization" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "dGuhGt2pNz75", + "outputId": "7f0da7dd-a068-42ef-a530-8bba10960bd6" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Stopwords from NLTK\n", + "179 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n", + "\n", + "Out Final Stopwords\n", + "181 ['she', 'these', 'm', 'few', 'am', 'aren', 'ain', 'did', 'have', 'above', 'doesn', 'its', 'mustn', 'doing', 'been', 'himself', \"that'll\", 'any', 'theirs', 'by', 'only', 'just', 'he', 'didn', 'itself', 'has', \"needn't\", \"you're\", 'be', 'with', 'against', 'were', 't', 'some', 'ours', 'once', 'own', 'shouldn', 'wouldn', 'more', 'ma', 'mine', 'each', 're', 'until', 'it', \"shan't\", 'that', \"didn't\", 'very', 'in', \"she's\", 'out', 'both', \"won't\", 'you', 'from', 'will', 'there', 'too', 'hasn', 'up', 'her', 'are', 's', 'i', 'nor', 'had', \"doesn't\", 'a', 'having', 'yourselves', 'further', 'was', \"wasn't\", 'same', 'all', \"weren't\", 'to', 'him', 'then', 'their', 'why', 'we', 'themselves', 'again', 'hers', 'after', \"should've\", 'd', 'this', 'his', 'won', \"mightn't\", 'below', 'on', 'ourselves', 'they', 'now', 'shan', 'our', 'while', 'here', 'under', 've', 'do', 'should', \"aren't\", 'don', 'down', \"mustn't\", 'me', 'such', 'most', 'isn', 'into', 'hadn', 'other', 'where', 'wasn', 'weren', \"hasn't\", 'when', \"it's\", 'them', 'for', \"you've\", 'or', 'what', 'an', 'yours', 'off', 'haven', 'as', 'll', 'myself', 'because', 'which', 'being', 'your', 'than', 'herself', \"shouldn't\", 'and', 'does', 'who', 'if', 'through', 'the', \"wouldn't\", \"couldn't\", 'about', 'whom', 'y', 'is', 'but', 'how', 'between', 'not', 'no', 'so', 'can', \"you'd\", 'my', 'before', \"haven't\", 'during', 'of', 'couldn', 'at', 'yourself', 'over', 'aye', 'needn', \"hadn't\", 'o', \"isn't\", \"you'll\", 'mightn', \"don't\", 'those']\n" + ] + } + ], + "source": [ + "# Define Stopwords\n", + "## Load Stopwords from NLTK\n", + "from nltk.corpus import stopwords\n", + "stop_words_en = stopwords.words(\"english\")\n", + "\n", + "print('Stopwords from NLTK')\n", + "print(len(stop_words_en), stop_words_en)\n", + "print('')\n", + "\n", + "## Create A New Stopwords\n", + "new_stop_words = ['aye', 'mine', 'have']\n", + "\n", + "# Define Lemmatizer\n", + "lemmatizer = WordNetLemmatizer()\n", + "\n", + "## Merge Stopwords\n", + "stop_words_en = stop_words_en + new_stop_words\n", + "stop_words_en = list(set(stop_words_en))\n", + "print('Out Final Stopwords')\n", + "print(len(stop_words_en), stop_words_en)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fT2U3ncclVSY" + }, + "source": [ + "### Cleaning Text, Lemmatization, and Tokenization" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "id": "Im-ymtFjNz75" + }, + "outputs": [], + "source": [ + "# Create A Function for review Preprocessing\n", + "\n", + "def review_preprocessing(review):\n", + " # Case folding\n", + " review = review.lower()\n", + "\n", + " # Mention removal\n", + " review = re.sub(\"@[A-Za-z0-9_]+\", \" \", review)\n", + "\n", + " # Hashtags removal\n", + " review = re.sub(\"#[A-Za-z0-9_]+\", \" \", review)\n", + "\n", + " # Newline removal (\\n)\n", + " review = re.sub(r\"\\\\n\", \" \",review)\n", + "\n", + " # Whitespace removal\n", + " review = review.strip()\n", + "\n", + " # URL removal\n", + " review = re.sub(r\"http\\S+\", \" \", review)\n", + " review = re.sub(r\"www.\\S+\", \" \", review)\n", + "\n", + " # Non-letter removal (such as emoticon, symbol (like μ, $, 兀), etc\n", + " review = re.sub(\"[^A-Za-z\\s']\", \" \", review)\n", + " review = re.sub(\"['ï']\", \" \", review)\n", + " review = re.sub(\"['¿']\", \" \", review)\n", + " review = re.sub(\"['½']\", \" \", review)\n", + " review = re.sub(\"['ý']\", \" \", review)\n", + " # Tokenization\n", + " tokens = word_tokenize(review)\n", + "\n", + " # Stopwords removal\n", + " tokens = [word for word in tokens if word not in stop_words_en]\n", + "\n", + " # Lemmetize\n", + " tokens = [lemmatizer.lemmatize(word) for word in tokens]\n", + "\n", + " # Combining Tokens\n", + " review = ' '.join(tokens)\n", + "\n", + " return review" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pdh-h-PFlYaZ" + }, + "source": [ + "Show the difference column `review` after cleaned in column `preprocessing_review`" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 964 + }, + "id": "eolbr2EfNz8F", + "outputId": "8fe585bc-4f47-49d1-ac06-a5ebf9b358ce" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
reviewer_idstore_namecategorystore_addresslatitudelongituderating_countreview_timereviewratingpreprocessing_review
01McDonald'sFast food restaurant13749 US-183 Hwy, Austin, TX 78750, United States30.460718-97.7928741,2403 months agoWhy does it look like someone spit on my food?...1 starlook like someone spit food normal transaction...
12McDonald'sFast food restaurant13749 US-183 Hwy, Austin, TX 78750, United States30.460718-97.7928741,2405 days agoIt'd McDonalds. It is what it is as far as the...4 starsmcdonalds far food atmosphere go staff make di...
23McDonald'sFast food restaurant13749 US-183 Hwy, Austin, TX 78750, United States30.460718-97.7928741,2405 days agoMade a mobile order got to the speaker and che...1 starmade mobile order got speaker checked line mov...
34McDonald'sFast food restaurant13749 US-183 Hwy, Austin, TX 78750, United States30.460718-97.7928741,240a month agoMy mc. Crispy chicken sandwich was ���ï¿...5 starsmc crispy chicken sandwich customer service qu...
45McDonald'sFast food restaurant13749 US-183 Hwy, Austin, TX 78750, United States30.460718-97.7928741,2402 months agoI repeat my order 3 times in the drive thru, a...1 starrepeat order time drive thru still manage mess...
....................................
3339133392McDonald'sFast food restaurant3501 Biscayne Blvd, Miami, FL 33137, United St...25.810000-80.1890982,8104 years agoThey treated me very badly.1 startreated badly
3339233393McDonald'sFast food restaurant3501 Biscayne Blvd, Miami, FL 33137, United St...25.810000-80.1890982,810a year agoThe service is very good5 starsservice good
3339333394McDonald'sFast food restaurant3501 Biscayne Blvd, Miami, FL 33137, United St...25.810000-80.1890982,810a year agoTo remove hunger is enough4 starsremove hunger enough
3339433395McDonald'sFast food restaurant3501 Biscayne Blvd, Miami, FL 33137, United St...25.810000-80.1890982,8105 years agoIt's good, but lately it has become very expen...5 starsgood lately become expensive
3339533396McDonald'sFast food restaurant3501 Biscayne Blvd, Miami, FL 33137, United St...25.810000-80.1890982,8102 years agothey took good care of me5 starstook good care
\n", + "

32736 rows × 11 columns

\n", + "
" + ], + "text/plain": [ + " reviewer_id store_name category \\\n", + "0 1 McDonald's Fast food restaurant \n", + "1 2 McDonald's Fast food restaurant \n", + "2 3 McDonald's Fast food restaurant \n", + "3 4 McDonald's Fast food restaurant \n", + "4 5 McDonald's Fast food restaurant \n", + "... ... ... ... \n", + "33391 33392 McDonald's Fast food restaurant \n", + "33392 33393 McDonald's Fast food restaurant \n", + "33393 33394 McDonald's Fast food restaurant \n", + "33394 33395 McDonald's Fast food restaurant \n", + "33395 33396 McDonald's Fast food restaurant \n", + "\n", + " store_address latitude \\\n", + "0 13749 US-183 Hwy, Austin, TX 78750, United States 30.460718 \n", + "1 13749 US-183 Hwy, Austin, TX 78750, United States 30.460718 \n", + "2 13749 US-183 Hwy, Austin, TX 78750, United States 30.460718 \n", + "3 13749 US-183 Hwy, Austin, TX 78750, United States 30.460718 \n", + "4 13749 US-183 Hwy, Austin, TX 78750, United States 30.460718 \n", + "... ... ... \n", + "33391 3501 Biscayne Blvd, Miami, FL 33137, United St... 25.810000 \n", + "33392 3501 Biscayne Blvd, Miami, FL 33137, United St... 25.810000 \n", + "33393 3501 Biscayne Blvd, Miami, FL 33137, United St... 25.810000 \n", + "33394 3501 Biscayne Blvd, Miami, FL 33137, United St... 25.810000 \n", + "33395 3501 Biscayne Blvd, Miami, FL 33137, United St... 25.810000 \n", + "\n", + " longitude rating_count review_time \\\n", + "0 -97.792874 1,240 3 months ago \n", + "1 -97.792874 1,240 5 days ago \n", + "2 -97.792874 1,240 5 days ago \n", + "3 -97.792874 1,240 a month ago \n", + "4 -97.792874 1,240 2 months ago \n", + "... ... ... ... \n", + "33391 -80.189098 2,810 4 years ago \n", + "33392 -80.189098 2,810 a year ago \n", + "33393 -80.189098 2,810 a year ago \n", + "33394 -80.189098 2,810 5 years ago \n", + "33395 -80.189098 2,810 2 years ago \n", + "\n", + " review rating \\\n", + "0 Why does it look like someone spit on my food?... 1 star \n", + "1 It'd McDonalds. It is what it is as far as the... 4 stars \n", + "2 Made a mobile order got to the speaker and che... 1 star \n", + "3 My mc. Crispy chicken sandwich was ���ï¿... 5 stars \n", + "4 I repeat my order 3 times in the drive thru, a... 1 star \n", + "... ... ... \n", + "33391 They treated me very badly. 1 star \n", + "33392 The service is very good 5 stars \n", + "33393 To remove hunger is enough 4 stars \n", + "33394 It's good, but lately it has become very expen... 5 stars \n", + "33395 they took good care of me 5 stars \n", + "\n", + " preprocessing_review \n", + "0 look like someone spit food normal transaction... \n", + "1 mcdonalds far food atmosphere go staff make di... \n", + "2 made mobile order got speaker checked line mov... \n", + "3 mc crispy chicken sandwich customer service qu... \n", + "4 repeat order time drive thru still manage mess... \n", + "... ... \n", + "33391 treated badly \n", + "33392 service good \n", + "33393 remove hunger enough \n", + "33394 good lately become expensive \n", + "33395 took good care \n", + "\n", + "[32736 rows x 11 columns]" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Applying Text Preprocessing to the dfset\n", + "\n", + "df['preprocessing_review'] = df['review'].apply(lambda x: review_preprocessing(x))\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z7_T-iQEk2uU" + }, + "source": [ + "From data frame above you can see the difference before and after betwwen column review and preprocessing review it can be said we have cleaned the data well" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q36ix8vcNz8G" + }, + "source": [ + "## Target Conversion" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "PBF0CQjYNz8G", + "outputId": "c97e090d-efb1-43c6-cf0a-ca563c929dfb" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array(['1 star', '4 stars', '5 stars', '2 stars', '3 stars'], dtype=object)" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Display Target\n", + "df.rating.unique()" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 964 + }, + "id": "4s-97QRFNz8H", + "outputId": "2164688f-bd22-4960-f0c0-6726f0103264" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
reviewer_idstore_namecategorystore_addresslatitudelongituderating_countreview_timereviewratingpreprocessing_reviewlabel
01McDonald'sFast food restaurant13749 US-183 Hwy, Austin, TX 78750, United States30.460718-97.7928741,2403 months agoWhy does it look like someone spit on my food?...1 starlook like someone spit food normal transaction...0
12McDonald'sFast food restaurant13749 US-183 Hwy, Austin, TX 78750, United States30.460718-97.7928741,2405 days agoIt'd McDonalds. It is what it is as far as the...4 starsmcdonalds far food atmosphere go staff make di...2
23McDonald'sFast food restaurant13749 US-183 Hwy, Austin, TX 78750, United States30.460718-97.7928741,2405 days agoMade a mobile order got to the speaker and che...1 starmade mobile order got speaker checked line mov...0
34McDonald'sFast food restaurant13749 US-183 Hwy, Austin, TX 78750, United States30.460718-97.7928741,240a month agoMy mc. Crispy chicken sandwich was ���ï¿...5 starsmc crispy chicken sandwich customer service qu...2
45McDonald'sFast food restaurant13749 US-183 Hwy, Austin, TX 78750, United States30.460718-97.7928741,2402 months agoI repeat my order 3 times in the drive thru, a...1 starrepeat order time drive thru still manage mess...0
.......................................
3339133392McDonald'sFast food restaurant3501 Biscayne Blvd, Miami, FL 33137, United St...25.810000-80.1890982,8104 years agoThey treated me very badly.1 startreated badly0
3339233393McDonald'sFast food restaurant3501 Biscayne Blvd, Miami, FL 33137, United St...25.810000-80.1890982,810a year agoThe service is very good5 starsservice good2
3339333394McDonald'sFast food restaurant3501 Biscayne Blvd, Miami, FL 33137, United St...25.810000-80.1890982,810a year agoTo remove hunger is enough4 starsremove hunger enough2
3339433395McDonald'sFast food restaurant3501 Biscayne Blvd, Miami, FL 33137, United St...25.810000-80.1890982,8105 years agoIt's good, but lately it has become very expen...5 starsgood lately become expensive2
3339533396McDonald'sFast food restaurant3501 Biscayne Blvd, Miami, FL 33137, United St...25.810000-80.1890982,8102 years agothey took good care of me5 starstook good care2
\n", + "

32736 rows × 12 columns

\n", + "
" + ], + "text/plain": [ + " reviewer_id store_name category \\\n", + "0 1 McDonald's Fast food restaurant \n", + "1 2 McDonald's Fast food restaurant \n", + "2 3 McDonald's Fast food restaurant \n", + "3 4 McDonald's Fast food restaurant \n", + "4 5 McDonald's Fast food restaurant \n", + "... ... ... ... \n", + "33391 33392 McDonald's Fast food restaurant \n", + "33392 33393 McDonald's Fast food restaurant \n", + "33393 33394 McDonald's Fast food restaurant \n", + "33394 33395 McDonald's Fast food restaurant \n", + "33395 33396 McDonald's Fast food restaurant \n", + "\n", + " store_address latitude \\\n", + "0 13749 US-183 Hwy, Austin, TX 78750, United States 30.460718 \n", + "1 13749 US-183 Hwy, Austin, TX 78750, United States 30.460718 \n", + "2 13749 US-183 Hwy, Austin, TX 78750, United States 30.460718 \n", + "3 13749 US-183 Hwy, Austin, TX 78750, United States 30.460718 \n", + "4 13749 US-183 Hwy, Austin, TX 78750, United States 30.460718 \n", + "... ... ... \n", + "33391 3501 Biscayne Blvd, Miami, FL 33137, United St... 25.810000 \n", + "33392 3501 Biscayne Blvd, Miami, FL 33137, United St... 25.810000 \n", + "33393 3501 Biscayne Blvd, Miami, FL 33137, United St... 25.810000 \n", + "33394 3501 Biscayne Blvd, Miami, FL 33137, United St... 25.810000 \n", + "33395 3501 Biscayne Blvd, Miami, FL 33137, United St... 25.810000 \n", + "\n", + " longitude rating_count review_time \\\n", + "0 -97.792874 1,240 3 months ago \n", + "1 -97.792874 1,240 5 days ago \n", + "2 -97.792874 1,240 5 days ago \n", + "3 -97.792874 1,240 a month ago \n", + "4 -97.792874 1,240 2 months ago \n", + "... ... ... ... \n", + "33391 -80.189098 2,810 4 years ago \n", + "33392 -80.189098 2,810 a year ago \n", + "33393 -80.189098 2,810 a year ago \n", + "33394 -80.189098 2,810 5 years ago \n", + "33395 -80.189098 2,810 2 years ago \n", + "\n", + " review rating \\\n", + "0 Why does it look like someone spit on my food?... 1 star \n", + "1 It'd McDonalds. It is what it is as far as the... 4 stars \n", + "2 Made a mobile order got to the speaker and che... 1 star \n", + "3 My mc. Crispy chicken sandwich was ���ï¿... 5 stars \n", + "4 I repeat my order 3 times in the drive thru, a... 1 star \n", + "... ... ... \n", + "33391 They treated me very badly. 1 star \n", + "33392 The service is very good 5 stars \n", + "33393 To remove hunger is enough 4 stars \n", + "33394 It's good, but lately it has become very expen... 5 stars \n", + "33395 they took good care of me 5 stars \n", + "\n", + " preprocessing_review label \n", + "0 look like someone spit food normal transaction... 0 \n", + "1 mcdonalds far food atmosphere go staff make di... 2 \n", + "2 made mobile order got speaker checked line mov... 0 \n", + "3 mc crispy chicken sandwich customer service qu... 2 \n", + "4 repeat order time drive thru still manage mess... 0 \n", + "... ... ... \n", + "33391 treated badly 0 \n", + "33392 service good 2 \n", + "33393 remove hunger enough 2 \n", + "33394 good lately become expensive 2 \n", + "33395 took good care 2 \n", + "\n", + "[32736 rows x 12 columns]" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Change Target into Number\n", + "\n", + "df['label'] = df['rating'].replace({'1 star' : 0, '2 stars' : 0, '3 stars' : 1, '4 stars' : 2,'5 stars' : 2 })\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ip0D4ORCNz8H" + }, + "source": [ + "After doing review preprocessing and adding column label, now we will create new data frame that containing the target and one feature, in this case is column `review`" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": { + "id": "bPmNF3jANz8I" + }, + "outputs": [], + "source": [ + "# Create new data frame\n", + "df_new= df.drop(['reviewer_id', 'store_name', 'category', 'store_address', 'latitude',\n", + " 'longitude', 'rating_count', 'review_time', 'review','rating'], axis = 1)" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "RvyZKMlaNz8I", + "outputId": "efc18a5a-2484-4004-8556-c66ab437d5e8" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
preprocessing_reviewlabel
0look like someone spit food normal transaction...0
1mcdonalds far food atmosphere go staff make di...2
2made mobile order got speaker checked line mov...0
3mc crispy chicken sandwich customer service qu...2
4repeat order time drive thru still manage mess...0
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " preprocessing_review label\n", + "0 look like someone spit food normal transaction... 0\n", + "1 mcdonalds far food atmosphere go staff make di... 2\n", + "2 made mobile order got speaker checked line mov... 0\n", + "3 mc crispy chicken sandwich customer service qu... 2\n", + "4 repeat order time drive thru still manage mess... 0" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Display 5 top data\n", + "df_new.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "j7FqRZ48Nz8I", + "outputId": "ff8c8b04-d454-437d-b19a-2cdccf3b68aa" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "2 15705\n", + "0 12325\n", + "1 4706\n", + "Name: label, dtype: int64" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Target Distribution\n", + "\n", + "df_new['label'].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jSkF-4p9mUaC" + }, + "source": [ + "### Split X and y" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": { + "id": "6SkG6FIqmXN_" + }, + "outputs": [], + "source": [ + "#split Feature and target\n", + "X= df_new['preprocessing_review']\n", + "y= df_new['label']" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kuaDQDLsmihs" + }, + "source": [ + "### Split dataset Train, Validation, and Test" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "g21e0FydNz8J", + "outputId": "4ed51d58-4493-4e68-c74c-daa69401cafd" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Train Size : (25042,)\n", + "Val Size : (2783,)\n", + "Test Size : (4911,)\n" + ] + } + ], + "source": [ + "# df_new Splitting\n", + "\n", + "X_train_val, X_test, y_train_val, y_test = train_test_split(df_new.preprocessing_review,\n", + " df_new.label,\n", + " test_size=0.15,\n", + " random_state=20,\n", + " stratify=df_new.label)\n", + "\n", + "X_train, X_val, y_train, y_val = train_test_split(X_train_val,\n", + " y_train_val,\n", + " test_size=0.10,\n", + " random_state=20,\n", + " stratify=y_train_val)\n", + "\n", + "print('Train Size : ', X_train.shape)\n", + "print('Val Size : ', X_val.shape)\n", + "print('Test Size : ', X_test.shape)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1O0M7WBSmheE" + }, + "source": [ + "### Encoder" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "YGBW_HmCNz8J", + "outputId": "9aae1656-ca79-400a-a717-d47a01750712" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[1., 0., 0.],\n", + " [1., 0., 0.],\n", + " [1., 0., 0.],\n", + " ...,\n", + " [0., 0., 1.],\n", + " [0., 0., 1.],\n", + " [0., 0., 1.]], dtype=float32)" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Change Target to One Hot Encoding\n", + "\n", + "from tensorflow.keras.utils import to_categorical\n", + "\n", + "y_train_ohe = to_categorical(y_train)\n", + "y_val_ohe = to_categorical(y_val)\n", + "y_test_ohe = to_categorical(y_test)\n", + "y_train_ohe" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PM59peHgNz8K" + }, + "source": [ + "# Bab 6: Model Building" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6Y7LKrVJNz8K" + }, + "source": [ + "### Text Vectorization" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "jiETdM5zNz8K", + "outputId": "ddc364ff-8ed8-4e9a-e73a-cbee8bd8d5c9" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "<25042x10639 sparse matrix of type ''\n", + "\twith 257012 stored elements in Compressed Sparse Row format>" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Get Vocabularies\n", + "\n", + "Vectorize = CountVectorizer()\n", + "X_train_vec = Vectorize.fit_transform(X_train)\n", + "X_test_vec = Vectorize.transform(X_test)\n", + "\n", + "X_train_vec" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "wXmLWYNDNz8L", + "outputId": "30bd92b1-5f48-4637-826d-575283ef5e91" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Total Vocab : 10639\n", + "Maximum Sentence Length : 254 tokens\n" + ] + } + ], + "source": [ + "# Finding the Number of Vocabs and Max Token Length in One Document\n", + "\n", + "total_vocab = len(Vectorize.vocabulary_.keys())\n", + "max_sen_len = max([len(i.split(\" \")) for i in X_train])\n", + "\n", + "print('Total Vocab : ', total_vocab)\n", + "print('Maximum Sentence Length : ', max_sen_len, 'tokens')" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "id": "Urg2qpgqNz8L" + }, + "outputs": [], + "source": [ + "# Text Vectorization\n", + "\n", + "from tensorflow.keras.layers import TextVectorization\n", + "\n", + "text_vectorization = TextVectorization(max_tokens=total_vocab,\n", + " standardize=\"lower_and_strip_punctuation\",\n", + " split=\"whitespace\",\n", + " ngrams=None,\n", + " output_mode=\"int\",\n", + " output_sequence_length=max_sen_len,\n", + " input_shape=(1,)) # Only use in Sequential API\n", + "\n", + "text_vectorization.adapt(X_train)" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cNtotmm-Nz8O", + "outputId": "91f7731a-1528-4873-c666-0a62814a050c" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document example\n", + "look like someone spit food normal transaction everyone chill polite dont want eat im trying think milky white clear substance food sure coming back\n", + "\n", + "Result of Text Vectorization\n", + "tf.Tensor(\n", + "[[ 157 13 191 1898 2 447 1604 237 1946 376 224 95 60 523\n", + " 327 184 4301 776 952 3504 2 170 244 33 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0]], shape=(1, 254), dtype=int64)\n", + "Vector size : (1, 254)\n" + ] + } + ], + "source": [ + "# Example Result\n", + "\n", + "## Document example\n", + "print('Document example')\n", + "print(df_new.preprocessing_review[0])\n", + "print('')\n", + "\n", + "## Result of Text Vectorization\n", + "print('Result of Text Vectorization')\n", + "print(text_vectorization([df_new.preprocessing_review[0]]))\n", + "print('Vector size : ', text_vectorization([df_new.preprocessing_review[0]]).shape)" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "m26X8SBmNz8O", + "outputId": "80d9a8e4-4bca-48fc-cf1c-eda3135736e9" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['',\n", + " '[UNK]',\n", + " 'food',\n", + " 'order',\n", + " 'service',\n", + " 'good',\n", + " 'mcdonald',\n", + " 'place',\n", + " 'get',\n", + " 'time',\n", + " 'drive',\n", + " 'one',\n", + " 'fast',\n", + " 'like',\n", + " 'excellent',\n", + " 'staff',\n", + " 'customer',\n", + " 'go',\n", + " 'always',\n", + " 'great']" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# View the Top 20 Tokens (Sorted by the Highest Frequency of Appearance)\n", + "\n", + "text_vectorization.get_vocabulary()[:20]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BJq5y_joNz8P" + }, + "source": [ + "### Word Embedding" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": { + "id": "bxmHjEfcNz8Q" + }, + "outputs": [], + "source": [ + "# Embedding\n", + "embedding = Embedding(input_dim=total_vocab,\n", + " output_dim=128,\n", + " embeddings_initializer=\"uniform\",\n", + " input_length=max_sen_len)" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "l5ICLTo-Nz8R", + "outputId": "6f7c03e7-36ec-4f3f-8d89-c57eea680f59" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document example\n", + "look like someone spit food normal transaction everyone chill polite dont want eat im trying think milky white clear substance food sure coming back\n", + "\n", + "Result of Text Vectorization\n", + "tf.Tensor(\n", + "[[ 157 13 191 1898 2 447 1604 237 1946 376 224 95 60 523\n", + " 327 184 4301 776 952 3504 2 170 244 33 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0]], shape=(1, 254), dtype=int64)\n", + "Vector size : (1, 254)\n", + "\n", + "Result of Embedding\n", + "tf.Tensor(\n", + "[[[ 0.00472081 -0.01408794 -0.00837948 ... 0.03061861 0.01136633\n", + " 0.03773705]\n", + " [-0.00755378 0.04548157 -0.01270884 ... 0.03076415 -0.00011492\n", + " 0.00162794]\n", + " [-0.04925472 0.01381064 -0.03738972 ... -0.00957316 -0.00946026\n", + " -0.01780033]\n", + " ...\n", + " [-0.02504488 0.01843544 0.00167371 ... -0.02957155 -0.02274432\n", + " -0.01664193]\n", + " [-0.02504488 0.01843544 0.00167371 ... -0.02957155 -0.02274432\n", + " -0.01664193]\n", + " [-0.02504488 0.01843544 0.00167371 ... -0.02957155 -0.02274432\n", + " -0.01664193]]], shape=(1, 254, 128), dtype=float32)\n", + "Vector size : (1, 254, 128)\n" + ] + } + ], + "source": [ + "# Example Result\n", + "\n", + "## Document example\n", + "print('Document example')\n", + "print(df_new.preprocessing_review[0])\n", + "print('')\n", + "\n", + "## Result of Text Vectorization\n", + "print('Result of Text Vectorization')\n", + "print(text_vectorization([df_new.preprocessing_review[0]]))\n", + "print('Vector size : ', text_vectorization([df_new.preprocessing_review[0]]).shape)\n", + "print('')\n", + "\n", + "## Result of Embedding\n", + "print('Result of Embedding')\n", + "print(embedding(text_vectorization([df_new.preprocessing_review[0]])))\n", + "print('Vector size : ', embedding(text_vectorization([df_new.preprocessing_review[0]])).shape)" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": { + "id": "6tvQ6GTTNz8R" + }, + "outputs": [], + "source": [ + "# # Model Training using LSTM\n", + "# ## Clear Session\n", + "# seed = 20\n", + "# tf.keras.backend.clear_session()\n", + "# np.random.seed(seed)\n", + "# tf.random.set_seed(seed)\n", + "\n", + "# ## Define the architecture\n", + "# model_lstm_1 = Sequential()\n", + "# model_lstm_1.add(text_vectorization)\n", + "# model_lstm_1.add(embedding)\n", + "# model_lstm_1.add(Bidirectional(LSTM(32, return_sequences=True, kernel_initializer=tf.keras.initializers.GlorotUniform(seed))))\n", + "# model_lstm_1.add(Dropout(0.1))\n", + "# model_lstm_1.add(Bidirectional(LSTM(16, kernel_initializer=tf.keras.initializers.GlorotUniform(seed))))\n", + "# model_lstm_1.add(Dropout(0.1))\n", + "# model_lstm_1.add(Dense(3, activation='softmax'))\n", + "\n", + "# model_lstm_1.compile(loss='categorical_crossentropy', optimizer='adam', metrics='accuracy')\n", + "\n", + "# model_lstm_1_hist = model_lstm_1.fit(X_train, y_train_ohe, epochs=50, callbacks= tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3), validation_data=(X_val, y_val_ohe))" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": { + "id": "IEGlCv7MXwKf" + }, + "outputs": [], + "source": [ + "# # Plot Training Results\n", + "\n", + "# model_lstm_1_hist_df = pd.DataFrame(model_lstm_1_hist.history)\n", + "\n", + "# plt.figure(figsize=(15, 5))\n", + "# plt.subplot(1, 2, 1)\n", + "# sns.lineplot(data=model_lstm_1_hist_df[['accuracy', 'val_accuracy']])\n", + "# plt.grid()\n", + "# plt.title('Accuracy vs Val-Accuracy')\n", + "\n", + "# plt.subplot(1, 2, 2)\n", + "# sns.lineplot(data=model_lstm_1_hist_df[['loss', 'val_loss']])\n", + "# plt.grid()\n", + "# plt.title('Loss vs Val-Loss')\n", + "# plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": { + "id": "mwze8gZxYkUu" + }, + "outputs": [], + "source": [ + "# Download the Embedding Layer\n", + "\n", + "# url = 'https://www.kaggle.com/models/google/nnlm/frameworks/TensorFlow2/variations/tf2-preview-en-dim128-with-normalization/versions/1'\n", + "\n", + "# hub_layer = tf_hub.KerasLayer(url, output_shape=[128], input_shape=[], dtype=tf.string)" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": { + "id": "k8i02JnTZzzS" + }, + "outputs": [], + "source": [ + "# # Model Training using LSTM with Transfer Learning\n", + "# %%time\n", + "\n", + "# from tensorflow.keras.models import Sequential\n", + "# from tensorflow.keras.layers import Dense, LSTM, Bidirectional, GRU, Dropout, Reshape\n", + "\n", + "# ## Clear Session\n", + "# seed = 20\n", + "# tf.keras.backend.clear_session()\n", + "# np.random.seed(seed)\n", + "# tf.random.set_seed(seed)\n", + "\n", + "# ## Define the architecture\n", + "# model_lstm_2 = Sequential()\n", + "# model_lstm_2.add(hub_layer)\n", + "# model_lstm_2.add(Reshape((128, 1)))\n", + "# model_lstm_2.add(Bidirectional(LSTM(32, return_sequences=True, kernel_initializer=tf.keras.initializers.GlorotUniform(seed))))\n", + "# model_lstm_2.add(Dropout(0.1))\n", + "# model_lstm_2.add(Bidirectional(LSTM(16, kernel_initializer=tf.keras.initializers.GlorotUniform(seed))))\n", + "# model_lstm_2.add(Dropout(0.1))\n", + "# model_lstm_2.add(Dense(3, activation='softmax'))\n", + "\n", + "# model_lstm_2.compile(loss='categorical_crossentropy', optimizer='adam', metrics='accuracy')\n", + "\n", + "# model_lstm_2_hist = model_lstm_2.fit(X_train, y_train_ohe, epochs=50,callbacks= tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3), validation_data=(X_val, y_val_ohe))" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": { + "id": "6qQMpDc5bHgn" + }, + "outputs": [], + "source": [ + "# # Plot Training Results\n", + "\n", + "# model_lstm_2_hist_df = pd.DataFrame(model_lstm_2_hist.history)\n", + "\n", + "# plt.figure(figsize=(15, 5))\n", + "# plt.subplot(1, 2, 1)\n", + "# sns.lineplot(data=model_lstm_2_hist_df[['accuracy', 'val_accuracy']])\n", + "# plt.grid()\n", + "# plt.title('Accuracy vs Val-Accuracy')\n", + "\n", + "# plt.subplot(1, 2, 2)\n", + "# sns.lineplot(data=model_lstm_2_hist_df[['loss', 'val_loss']])\n", + "# plt.grid()\n", + "# plt.title('Loss vs Val-Loss')\n", + "# plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VBfmw6e0myVM" + }, + "source": [ + "You can see the comment above it's not that I forgot to delete it, but it makes it a piece of evidence and history in this notebook, at the fisrt i did this project with model LSTM, after doing it with improvement by doing transfer learning actually the result still low and i want to get better result so i tried using GRU Model and the result is much better. i cant drop the evidence in my notebook because the running process that is spend a lot of time, but in this project GRU model is much better than LSTM so i use GRU Model" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Tg3eCRB9rKyF" + }, + "source": [ + "# Bab 7: Model Definition & Model Training" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "879oRYDprQmS" + }, + "source": [ + "## GRU MODEL" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hYGDbcTQrUQv" + }, + "source": [ + "The reason using GRU model like i said above it is much better than LSTM model" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "bF6zDvfi9sWE", + "outputId": "a27af844-530d-4550-925f-5f6ced87083f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Epoch 1/50\n", + "783/783 [==============================] - 82s 92ms/step - loss: 0.5619 - accuracy: 0.7777 - val_loss: 0.5034 - val_accuracy: 0.8085\n", + "Epoch 2/50\n", + "783/783 [==============================] - 34s 43ms/step - loss: 0.4085 - accuracy: 0.8454 - val_loss: 0.4890 - val_accuracy: 0.8182\n", + "Epoch 3/50\n", + "783/783 [==============================] - 31s 40ms/step - loss: 0.3375 - accuracy: 0.8773 - val_loss: 0.5199 - val_accuracy: 0.8106\n", + "Epoch 4/50\n", + "783/783 [==============================] - 32s 41ms/step - loss: 0.2899 - accuracy: 0.8969 - val_loss: 0.5635 - val_accuracy: 0.8175\n", + "Epoch 5/50\n", + "783/783 [==============================] - 31s 39ms/step - loss: 0.2570 - accuracy: 0.9108 - val_loss: 0.5667 - val_accuracy: 0.8103\n", + "Epoch 6/50\n", + "783/783 [==============================] - 31s 39ms/step - loss: 0.2291 - accuracy: 0.9190 - val_loss: 0.5925 - val_accuracy: 0.8074\n", + "Epoch 7/50\n", + "783/783 [==============================] - 32s 41ms/step - loss: 0.2053 - accuracy: 0.9277 - val_loss: 0.6337 - val_accuracy: 0.8103\n", + "Epoch 8/50\n", + "783/783 [==============================] - 30s 39ms/step - loss: 0.1913 - accuracy: 0.9328 - val_loss: 0.6952 - val_accuracy: 0.8103\n", + "Epoch 9/50\n", + "783/783 [==============================] - 30s 38ms/step - loss: 0.1773 - accuracy: 0.9363 - val_loss: 0.7311 - val_accuracy: 0.8088\n", + "Epoch 10/50\n", + "783/783 [==============================] - 30s 39ms/step - loss: 0.1656 - accuracy: 0.9413 - val_loss: 0.7789 - val_accuracy: 0.7973\n", + "Epoch 11/50\n", + "783/783 [==============================] - 31s 39ms/step - loss: 0.1598 - accuracy: 0.9437 - val_loss: 0.8024 - val_accuracy: 0.8027\n", + "Epoch 12/50\n", + "783/783 [==============================] - 29s 37ms/step - loss: 0.1472 - accuracy: 0.9468 - val_loss: 0.8061 - val_accuracy: 0.8049\n", + "Epoch 13/50\n", + "783/783 [==============================] - 30s 39ms/step - loss: 0.1388 - accuracy: 0.9491 - val_loss: 0.8671 - val_accuracy: 0.8006\n", + "Epoch 14/50\n", + "783/783 [==============================] - 31s 39ms/step - loss: 0.1348 - accuracy: 0.9506 - val_loss: 0.8674 - val_accuracy: 0.8085\n", + "Epoch 15/50\n", + "783/783 [==============================] - 31s 40ms/step - loss: 0.1269 - accuracy: 0.9531 - val_loss: 0.9125 - val_accuracy: 0.8063\n", + "Epoch 16/50\n", + "783/783 [==============================] - 31s 40ms/step - loss: 0.1222 - accuracy: 0.9564 - val_loss: 0.9191 - val_accuracy: 0.8099\n", + "Epoch 17/50\n", + "783/783 [==============================] - 29s 37ms/step - loss: 0.1167 - accuracy: 0.9573 - val_loss: 0.9590 - val_accuracy: 0.7988\n", + "Epoch 18/50\n", + "783/783 [==============================] - 31s 40ms/step - loss: 0.1144 - accuracy: 0.9570 - val_loss: 0.9848 - val_accuracy: 0.8013\n", + "Epoch 19/50\n", + "783/783 [==============================] - 30s 39ms/step - loss: 0.1119 - accuracy: 0.9593 - val_loss: 0.9496 - val_accuracy: 0.8049\n", + "Epoch 20/50\n", + "783/783 [==============================] - 29s 37ms/step - loss: 0.1047 - accuracy: 0.9616 - val_loss: 0.9911 - val_accuracy: 0.8081\n", + "Epoch 21/50\n", + "783/783 [==============================] - 31s 40ms/step - loss: 0.1046 - accuracy: 0.9626 - val_loss: 0.9940 - val_accuracy: 0.7923\n", + "Epoch 22/50\n", + "783/783 [==============================] - 29s 38ms/step - loss: 0.0985 - accuracy: 0.9638 - val_loss: 1.0093 - val_accuracy: 0.8085\n", + "Epoch 23/50\n", + "783/783 [==============================] - 31s 40ms/step - loss: 0.0971 - accuracy: 0.9635 - val_loss: 1.0593 - val_accuracy: 0.8070\n", + "Epoch 24/50\n", + "783/783 [==============================] - 30s 38ms/step - loss: 0.0975 - accuracy: 0.9645 - val_loss: 1.0365 - val_accuracy: 0.8034\n", + "Epoch 25/50\n", + "783/783 [==============================] - 30s 39ms/step - loss: 0.0952 - accuracy: 0.9652 - val_loss: 1.0468 - val_accuracy: 0.8081\n", + "Epoch 26/50\n", + "783/783 [==============================] - 32s 41ms/step - loss: 0.0913 - accuracy: 0.9655 - val_loss: 1.1019 - val_accuracy: 0.8063\n", + "Epoch 27/50\n", + "783/783 [==============================] - 29s 37ms/step - loss: 0.0856 - accuracy: 0.9677 - val_loss: 1.1246 - val_accuracy: 0.8088\n", + "Epoch 28/50\n", + "783/783 [==============================] - 30s 39ms/step - loss: 0.0877 - accuracy: 0.9678 - val_loss: 1.0922 - val_accuracy: 0.8038\n", + "Epoch 29/50\n", + "783/783 [==============================] - 30s 38ms/step - loss: 0.0906 - accuracy: 0.9655 - val_loss: 1.0878 - val_accuracy: 0.8063\n", + "Epoch 30/50\n", + "783/783 [==============================] - 29s 37ms/step - loss: 0.0881 - accuracy: 0.9672 - val_loss: 1.0949 - val_accuracy: 0.8052\n" + ] + } + ], + "source": [ + "# Model Training using GRU\n", + "## Clear Session\n", + "seed = 20\n", + "tf.keras.backend.clear_session()\n", + "np.random.seed(seed)\n", + "tf.random.set_seed(seed)\n", + "\n", + "## Define the architecture\n", + "model_gru_1 = Sequential()\n", + "model_gru_1.add(text_vectorization)\n", + "model_gru_1.add(embedding)\n", + "model_gru_1.add(Bidirectional(GRU(32, return_sequences=True, kernel_initializer=tf.keras.initializers.GlorotUniform(seed))))\n", + "model_gru_1.add(Dropout(0.1))\n", + "model_gru_1.add(Bidirectional(GRU(16, kernel_initializer=tf.keras.initializers.GlorotUniform(seed))))\n", + "model_gru_1.add(Dropout(0.1))\n", + "model_gru_1.add(Dense(3, activation='softmax'))\n", + "\n", + "model_gru_1.compile(loss='categorical_crossentropy', optimizer='adam', metrics='accuracy')\n", + "\n", + "model_gru_1_hist = model_gru_1.fit(X_train, y_train_ohe, epochs=50, callbacks= tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3), validation_data=(X_val, y_val_ohe))" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Jc65KxxUrh0p", + "outputId": "06b80d61-a528-47ef-e73a-d348c2a74693" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Model: \"sequential\"\n", + "_________________________________________________________________\n", + " Layer (type) Output Shape Param # \n", + "=================================================================\n", + " text_vectorization (TextVe (None, 254) 0 \n", + " ctorization) \n", + " \n", + " embedding (Embedding) (None, 254, 128) 1361792 \n", + " \n", + " bidirectional (Bidirection (None, 254, 64) 31104 \n", + " al) \n", + " \n", + " dropout (Dropout) (None, 254, 64) 0 \n", + " \n", + " bidirectional_1 (Bidirecti (None, 32) 7872 \n", + " onal) \n", + " \n", + " dropout_1 (Dropout) (None, 32) 0 \n", + " \n", + " dense (Dense) (None, 3) 99 \n", + " \n", + "=================================================================\n", + "Total params: 1400867 (5.34 MB)\n", + "Trainable params: 1400867 (5.34 MB)\n", + "Non-trainable params: 0 (0.00 Byte)\n", + "_________________________________________________________________\n" + ] + } + ], + "source": [ + "# Summary\n", + "model_gru_1.summary()" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 865 + }, + "id": "B9QQmfvsrqL0", + "outputId": "9fecadff-ba80-4104-82e4-10bc5f28daaf" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Plot Layers\n", + "tf.keras.utils.plot_model(model_gru_1, show_shapes=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "V9gdYNvCyCHy", + "outputId": "4ee1a3db-f329-4bb1-e068-546f20714e65" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
lossaccuracyval_lossval_accuracy
250.0913300.9655381.1018910.806324
260.0856210.9676541.1246140.808839
270.0876690.9677741.0922290.803809
280.0905610.9654581.0878110.806324
290.0881460.9671751.0949230.805246
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " loss accuracy val_loss val_accuracy\n", + "25 0.091330 0.965538 1.101891 0.806324\n", + "26 0.085621 0.967654 1.124614 0.808839\n", + "27 0.087669 0.967774 1.092229 0.803809\n", + "28 0.090561 0.965458 1.087811 0.806324\n", + "29 0.088146 0.967175 1.094923 0.805246" + ] + }, + "execution_count": 51, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Create DataFrame\n", + "\n", + "model_gru_1_hist_df = pd.DataFrame(model_gru_1_hist.history)\n", + "model_gru_1_hist_df.tail()" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 330 + }, + "id": "oeJfqDK0PEbw", + "outputId": "b020499d-ddfc-4bea-c1a7-1171b5a11d55" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Plot Training Results\n", + "\n", + "model_gru_1_hist_df = pd.DataFrame(model_gru_1_hist.history)\n", + "\n", + "plt.figure(figsize=(15, 5))\n", + "plt.subplot(1, 2, 1)\n", + "sns.lineplot(data=model_gru_1_hist_df[['accuracy', 'val_accuracy']])\n", + "plt.grid()\n", + "plt.title('Accuracy vs Val-Accuracy')\n", + "\n", + "plt.subplot(1, 2, 2)\n", + "sns.lineplot(data=model_gru_1_hist_df[['loss', 'val_loss']])\n", + "plt.grid()\n", + "plt.title('Loss vs Val-Loss')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The model is overfit because the accuracy is 96% but the val accuracy is 80%. Okay now lets improve the model to goodfit data with transfer learning" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": { + "id": "DuYB8mmeRaib" + }, + "outputs": [], + "source": [ + "# Download the Embedding Layer\n", + "\n", + "url = 'https://www.kaggle.com/models/google/nnlm/frameworks/TensorFlow2/variations/tf2-preview-en-dim128-with-normalization/versions/1'\n", + "\n", + "hub_layer = tf_hub.KerasLayer(url, output_shape=[128], input_shape=[], dtype=tf.string)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DMwJ9zEAse6b" + }, + "source": [ + "### Model Improvement" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "94w3qqs6Rg5W", + "outputId": "80cc8816-e08a-41c1-eaf9-a3d4c928e747" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Epoch 1/50\n", + "783/783 [==============================] - 33s 28ms/step - loss: 0.8685 - accuracy: 0.6060 - val_loss: 0.7424 - val_accuracy: 0.7057\n", + "Epoch 2/50\n", + "783/783 [==============================] - 20s 26ms/step - loss: 0.7253 - accuracy: 0.7101 - val_loss: 0.7351 - val_accuracy: 0.6985\n", + "Epoch 3/50\n", + "783/783 [==============================] - 21s 27ms/step - loss: 0.6897 - accuracy: 0.7236 - val_loss: 0.6661 - val_accuracy: 0.7373\n", + "Epoch 4/50\n", + "783/783 [==============================] - 20s 25ms/step - loss: 0.6804 - accuracy: 0.7279 - val_loss: 0.6817 - val_accuracy: 0.7291\n", + "Epoch 5/50\n", + "783/783 [==============================] - 20s 26ms/step - loss: 0.6684 - accuracy: 0.7338 - val_loss: 0.6464 - val_accuracy: 0.7431\n", + "Epoch 6/50\n", + "783/783 [==============================] - 21s 26ms/step - loss: 0.6622 - accuracy: 0.7369 - val_loss: 0.6396 - val_accuracy: 0.7424\n", + "Epoch 7/50\n", + "783/783 [==============================] - 21s 27ms/step - loss: 0.6550 - accuracy: 0.7373 - val_loss: 0.6343 - val_accuracy: 0.7496\n", + "Epoch 8/50\n", + "783/783 [==============================] - 21s 26ms/step - loss: 0.6502 - accuracy: 0.7422 - val_loss: 0.6326 - val_accuracy: 0.7503\n", + "Epoch 9/50\n", + "783/783 [==============================] - 21s 27ms/step - loss: 0.6454 - accuracy: 0.7416 - val_loss: 0.6303 - val_accuracy: 0.7445\n", + "Epoch 10/50\n", + "783/783 [==============================] - 20s 25ms/step - loss: 0.6386 - accuracy: 0.7461 - val_loss: 0.6341 - val_accuracy: 0.7481\n", + "Epoch 11/50\n", + "783/783 [==============================] - 21s 26ms/step - loss: 0.6366 - accuracy: 0.7457 - val_loss: 0.6303 - val_accuracy: 0.7524\n", + "Epoch 12/50\n", + "783/783 [==============================] - 20s 26ms/step - loss: 0.6334 - accuracy: 0.7477 - val_loss: 0.6217 - val_accuracy: 0.7575\n", + "Epoch 13/50\n", + "783/783 [==============================] - 21s 26ms/step - loss: 0.6297 - accuracy: 0.7477 - val_loss: 0.6217 - val_accuracy: 0.7521\n", + "Epoch 14/50\n", + "783/783 [==============================] - 21s 26ms/step - loss: 0.6261 - accuracy: 0.7495 - val_loss: 0.6259 - val_accuracy: 0.7503\n", + "Epoch 15/50\n", + "783/783 [==============================] - 20s 26ms/step - loss: 0.6219 - accuracy: 0.7512 - val_loss: 0.6171 - val_accuracy: 0.7575\n", + "Epoch 16/50\n", + "783/783 [==============================] - 21s 26ms/step - loss: 0.6214 - accuracy: 0.7551 - val_loss: 0.6144 - val_accuracy: 0.7557\n", + "Epoch 17/50\n", + "783/783 [==============================] - 20s 26ms/step - loss: 0.6167 - accuracy: 0.7534 - val_loss: 0.6100 - val_accuracy: 0.7567\n", + "Epoch 18/50\n", + "783/783 [==============================] - 21s 26ms/step - loss: 0.6125 - accuracy: 0.7554 - val_loss: 0.6248 - val_accuracy: 0.7564\n", + "Epoch 19/50\n", + "783/783 [==============================] - 22s 28ms/step - loss: 0.6077 - accuracy: 0.7582 - val_loss: 0.6068 - val_accuracy: 0.7636\n", + "Epoch 20/50\n", + "783/783 [==============================] - 20s 25ms/step - loss: 0.6029 - accuracy: 0.7609 - val_loss: 0.6099 - val_accuracy: 0.7603\n", + "Epoch 21/50\n", + "783/783 [==============================] - 21s 27ms/step - loss: 0.6019 - accuracy: 0.7603 - val_loss: 0.6183 - val_accuracy: 0.7593\n", + "Epoch 22/50\n", + "783/783 [==============================] - 20s 26ms/step - loss: 0.5984 - accuracy: 0.7624 - val_loss: 0.6221 - val_accuracy: 0.7571\n", + "Epoch 23/50\n", + "783/783 [==============================] - 28s 36ms/step - loss: 0.5953 - accuracy: 0.7631 - val_loss: 0.6117 - val_accuracy: 0.7628\n", + "Epoch 24/50\n", + "783/783 [==============================] - 25s 32ms/step - loss: 0.5924 - accuracy: 0.7634 - val_loss: 0.6085 - val_accuracy: 0.7575\n", + "Epoch 25/50\n", + "783/783 [==============================] - 21s 27ms/step - loss: 0.5900 - accuracy: 0.7651 - val_loss: 0.6107 - val_accuracy: 0.7575\n", + "Epoch 26/50\n", + "783/783 [==============================] - 19s 24ms/step - loss: 0.5880 - accuracy: 0.7677 - val_loss: 0.5968 - val_accuracy: 0.7668\n", + "Epoch 27/50\n", + "783/783 [==============================] - 20s 25ms/step - loss: 0.5840 - accuracy: 0.7664 - val_loss: 0.6010 - val_accuracy: 0.7668\n", + "Epoch 28/50\n", + "783/783 [==============================] - 19s 24ms/step - loss: 0.5813 - accuracy: 0.7687 - val_loss: 0.5964 - val_accuracy: 0.7697\n", + "Epoch 29/50\n", + "783/783 [==============================] - 20s 25ms/step - loss: 0.5787 - accuracy: 0.7690 - val_loss: 0.6037 - val_accuracy: 0.7643\n", + "Epoch 30/50\n", + "783/783 [==============================] - 19s 25ms/step - loss: 0.5747 - accuracy: 0.7731 - val_loss: 0.6061 - val_accuracy: 0.7664\n", + "Epoch 31/50\n", + "783/783 [==============================] - 20s 26ms/step - loss: 0.5765 - accuracy: 0.7703 - val_loss: 0.5920 - val_accuracy: 0.7729\n", + "Epoch 32/50\n", + "783/783 [==============================] - 19s 24ms/step - loss: 0.5715 - accuracy: 0.7722 - val_loss: 0.5881 - val_accuracy: 0.7715\n", + "Epoch 33/50\n", + "783/783 [==============================] - 19s 25ms/step - loss: 0.5692 - accuracy: 0.7738 - val_loss: 0.5976 - val_accuracy: 0.7711\n", + "Epoch 34/50\n", + "783/783 [==============================] - 19s 24ms/step - loss: 0.5676 - accuracy: 0.7761 - val_loss: 0.6034 - val_accuracy: 0.7668\n", + "Epoch 35/50\n", + "783/783 [==============================] - 20s 25ms/step - loss: 0.5667 - accuracy: 0.7756 - val_loss: 0.6004 - val_accuracy: 0.7661\n", + "Epoch 36/50\n", + "783/783 [==============================] - 19s 25ms/step - loss: 0.5642 - accuracy: 0.7767 - val_loss: 0.5946 - val_accuracy: 0.7722\n", + "Epoch 37/50\n", + "783/783 [==============================] - 20s 25ms/step - loss: 0.5600 - accuracy: 0.7785 - val_loss: 0.5882 - val_accuracy: 0.7754\n", + "Epoch 38/50\n", + "783/783 [==============================] - 20s 25ms/step - loss: 0.5595 - accuracy: 0.7769 - val_loss: 0.5872 - val_accuracy: 0.7743\n", + "Epoch 39/50\n", + "783/783 [==============================] - 19s 24ms/step - loss: 0.5574 - accuracy: 0.7782 - val_loss: 0.5870 - val_accuracy: 0.7697\n", + "Epoch 40/50\n", + "783/783 [==============================] - 19s 24ms/step - loss: 0.5553 - accuracy: 0.7795 - val_loss: 0.5883 - val_accuracy: 0.7718\n", + "Epoch 41/50\n", + "783/783 [==============================] - 19s 25ms/step - loss: 0.5526 - accuracy: 0.7789 - val_loss: 0.5903 - val_accuracy: 0.7758\n", + "Epoch 42/50\n", + "783/783 [==============================] - 19s 24ms/step - loss: 0.5508 - accuracy: 0.7808 - val_loss: 0.5937 - val_accuracy: 0.7708\n", + "Epoch 43/50\n", + "783/783 [==============================] - 19s 24ms/step - loss: 0.5486 - accuracy: 0.7819 - val_loss: 0.5984 - val_accuracy: 0.7718\n", + "Epoch 44/50\n", + "783/783 [==============================] - 18s 23ms/step - loss: 0.5438 - accuracy: 0.7820 - val_loss: 0.5987 - val_accuracy: 0.7729\n", + "Epoch 45/50\n", + "783/783 [==============================] - 19s 25ms/step - loss: 0.5432 - accuracy: 0.7841 - val_loss: 0.5878 - val_accuracy: 0.7751\n", + "Epoch 46/50\n", + "783/783 [==============================] - 20s 25ms/step - loss: 0.5421 - accuracy: 0.7851 - val_loss: 0.5955 - val_accuracy: 0.7679\n", + "Epoch 47/50\n", + "783/783 [==============================] - 18s 24ms/step - loss: 0.5397 - accuracy: 0.7845 - val_loss: 0.5948 - val_accuracy: 0.7672\n", + "Epoch 48/50\n", + "783/783 [==============================] - 20s 25ms/step - loss: 0.5371 - accuracy: 0.7859 - val_loss: 0.5981 - val_accuracy: 0.7711\n", + "Epoch 49/50\n", + "783/783 [==============================] - 19s 24ms/step - loss: 0.5350 - accuracy: 0.7864 - val_loss: 0.5938 - val_accuracy: 0.7700\n", + "Epoch 50/50\n", + "783/783 [==============================] - 19s 25ms/step - loss: 0.5328 - accuracy: 0.7900 - val_loss: 0.5976 - val_accuracy: 0.7700\n", + "CPU times: user 18min 9s, sys: 50.7 s, total: 19min\n", + "Wall time: 17min 1s\n" + ] + } + ], + "source": [ + "# Model Training using Gru with Transfer Learning\n", + "%%time\n", + "\n", + "\n", + "## Clear Session\n", + "seed = 20\n", + "tf.keras.backend.clear_session()\n", + "np.random.seed(seed)\n", + "tf.random.set_seed(seed)\n", + "\n", + "## Define the architecture\n", + "model_gru_2 = Sequential()\n", + "model_gru_2.add(hub_layer)\n", + "model_gru_2.add(Reshape((128, 1)))\n", + "model_gru_2.add(Bidirectional(GRU(32, return_sequences=True, kernel_initializer=tf.keras.initializers.GlorotUniform(seed))))\n", + "model_gru_2.add(Dropout(0.1))\n", + "model_gru_2.add(Bidirectional(GRU(16, kernel_initializer=tf.keras.initializers.GlorotUniform(seed))))\n", + "model_gru_2.add(Dropout(0.1))\n", + "model_gru_2.add(Dense(3, activation='softmax'))\n", + "\n", + "model_gru_2.compile(loss='categorical_crossentropy', optimizer='adam', metrics='accuracy')\n", + "rounded_prediction= model_gru_2.predict\n", + "\n", + "model_gru_2_hist = model_gru_2.fit(X_train, y_train_ohe, epochs=50,callbacks= tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3), validation_data=(X_val, y_val_ohe))" + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6NGT0XGPp9y2", + "outputId": "7f54dbe4-6768-45ed-eefb-d071f57d1790" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Model: \"sequential\"\n", + "_________________________________________________________________\n", + " Layer (type) Output Shape Param # \n", + "=================================================================\n", + " keras_layer (KerasLayer) (None, 128) 124642688 \n", + " \n", + " reshape (Reshape) (None, 128, 1) 0 \n", + " \n", + " bidirectional (Bidirection (None, 128, 64) 6720 \n", + " al) \n", + " \n", + " dropout (Dropout) (None, 128, 64) 0 \n", + " \n", + " bidirectional_1 (Bidirecti (None, 32) 7872 \n", + " onal) \n", + " \n", + " dropout_1 (Dropout) (None, 32) 0 \n", + " \n", + " dense (Dense) (None, 3) 99 \n", + " \n", + "=================================================================\n", + "Total params: 124657379 (475.53 MB)\n", + "Trainable params: 14691 (57.39 KB)\n", + "Non-trainable params: 124642688 (475.47 MB)\n", + "_________________________________________________________________\n" + ] + } + ], + "source": [ + "# Summary\n", + "model_gru_2.summary()" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 331 + }, + "id": "fM7O4Go5Tvnx", + "outputId": "9e641ac5-8e34-478d-a866-b9cfdf46a466" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Plot Training Results\n", + "\n", + "model_gru_2_hist_df = pd.DataFrame(model_gru_2_hist.history)\n", + "\n", + "plt.figure(figsize=(15, 5))\n", + "plt.subplot(1, 2, 1)\n", + "sns.lineplot(data=model_gru_2_hist_df[['accuracy', 'val_accuracy']])\n", + "plt.grid()\n", + "plt.title('Accuracy vs Val-Accuracy')\n", + "\n", + "plt.subplot(1, 2, 2)\n", + "sns.lineplot(data=model_gru_2_hist_df[['loss', 'val_loss']])\n", + "plt.grid()\n", + "plt.title('Loss vs Val-Loss')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The plot describing the goodfit data because the accuracy is 78% and val accuracy is 77% that indicates small gap between both so it is `GOODFIT`" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "WLumg7c_wjQA", + "outputId": "5687a7a9-0176-4ff9-d0fe-eddb196c6c01" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
lossaccuracyval_lossval_accuracy
450.5421020.7851210.5955380.767876
460.5396520.7844820.5947880.767158
470.5371050.7859200.5980640.771110
480.5350230.7863990.5938280.770032
490.5328050.7899530.5975710.770032
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " loss accuracy val_loss val_accuracy\n", + "45 0.542102 0.785121 0.595538 0.767876\n", + "46 0.539652 0.784482 0.594788 0.767158\n", + "47 0.537105 0.785920 0.598064 0.771110\n", + "48 0.535023 0.786399 0.593828 0.770032\n", + "49 0.532805 0.789953 0.597571 0.770032" + ] + }, + "execution_count": 83, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Create DataFrame\n", + "\n", + "model_gru_2_hist_df = pd.DataFrame(model_gru_2_hist.history)\n", + "model_gru_2_hist_df.tail()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From information above we can conclude that the data accuracy decrease comapring to the default model but the data become goodfit because there is little gap in accuracy and val accuracy." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lnN0gTivqLsT" + }, + "source": [ + "# Bab 8: Model Evaluation" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bJA-FWuQqc-P" + }, + "source": [ + "### Evaluation Model Gru 1 without Transfer Learning" + ] + }, + { + "cell_type": "code", + "execution_count": 95, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "n6smnVhrqVD9", + "outputId": "ea910c41-b01c-4e8d-be3a-62102ad25640" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "154/154 [==============================] - 4s 24ms/step\n", + " precision recall f1-score support\n", + "\n", + " 0 0.82 0.85 0.84 1787\n", + " 1 0.49 0.60 0.54 582\n", + " 2 0.89 0.83 0.86 2542\n", + "\n", + " accuracy 0.81 4911\n", + " macro avg 0.74 0.76 0.75 4911\n", + "weighted avg 0.82 0.81 0.81 4911\n", + "\n" + ] + } + ], + "source": [ + "# Show Classification Report\n", + "y_pred_gru_test_1 = model_gru_1.predict(X_test)\n", + "y_pred_gru_test_1 = np.argmax((y_pred_gru_test_1), axis = -1)\n", + "print(classification_report(y_pred_gru_test_1,y_test))" + ] + }, + { + "cell_type": "code", + "execution_count": 96, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "N93DM_63qWNi", + "outputId": "7560e574-d584-4261-ded1-84a707c9ed3f" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create function of confusion matrix\n", + "def plot_confusion_matrix(y_true, y_pred, labels):\n", + " cm = confusion_matrix(y_true, y_pred)\n", + "\n", + " plt.figure(figsize=(8, 6))\n", + " sns.heatmap(cm, annot=True, fmt=\"d\", cmap=\"Blues\", xticklabels=labels, yticklabels=labels)\n", + " plt.title(\"Confusion Matrix\")\n", + " plt.xlabel(\"Predicted\")\n", + " plt.ylabel(\"True\")\n", + " plt.show()\n", + "\n", + "# Calling Function\n", + "plot_confusion_matrix(y_test, y_pred_gru_test_1, labels=[\"label1\", \"label2\", \"label3\"])\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-aErx07vqlr5" + }, + "source": [ + "### Evaluation Model Improvement Gru 2 with Transfer Learning" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Mv5JJNc0ssTO", + "outputId": "9d28417d-6b4a-49d9-84fc-1619f10441ee" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "154/154 [==============================] - 2s 10ms/step\n", + " precision recall f1-score support\n", + "\n", + " 0 0.86 0.75 0.80 2116\n", + " 1 0.28 0.70 0.40 280\n", + " 2 0.85 0.80 0.82 2515\n", + "\n", + " accuracy 0.77 4911\n", + " macro avg 0.66 0.75 0.67 4911\n", + "weighted avg 0.82 0.77 0.79 4911\n", + "\n" + ] + } + ], + "source": [ + "# Show Classification Report\n", + "y_pred_gru_test_2 = model_gru_2.predict(X_test)\n", + "y_pred_gru_test_2 = np.argmax((y_pred_gru_test_2), axis = -1)\n", + "print(classification_report(y_pred_gru_test_2,y_test))" + ] + }, + { + "cell_type": "code", + "execution_count": 91, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "UQCcXx_NjK6J", + "outputId": "94c8c71c-037d-4ebb-bda6-1737cb0ff246" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create function of confusion matrix\n", + "def plot_confusion_matrix(y_true, y_pred, labels):\n", + " cm = confusion_matrix(y_true, y_pred)\n", + "\n", + " plt.figure(figsize=(8, 6))\n", + " sns.heatmap(cm, annot=True, fmt=\"d\", cmap=\"Blues\", xticklabels=labels, yticklabels=labels)\n", + " plt.title(\"Confusion Matrix\")\n", + " plt.xlabel(\"Predicted\")\n", + " plt.ylabel(\"True\")\n", + " plt.show()\n", + "\n", + "# Caliing function\n", + "plot_confusion_matrix(y_test, y_pred_gru_test_2, labels=[\"label1\", \"label2\", \"label3\"])\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From 2 confusion matrix, you can see it clearly that GRU 2 model learning with transfer learning is much better than default. it can be seen from the All True Positive in 3 labels got the highes number and if we want to compare to the default model, the improvement model succesfully increase true postive level in label 1 and reduce the false negative level in label 2 and label 3. it can be said that the accuracy decrease but the data become goodfit because there is a little gap of the accuracy and val accuracy, it is `0.78` to `0.77`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BtUYbyIKtphG" + }, + "source": [ + "## Model Weakness" + ] + }, + { + "cell_type": "code", + "execution_count": 98, + "metadata": { + "id": "KkqzEf1LtKd4" + }, + "outputs": [], + "source": [ + "# create DF Act vs Pred\n", + "act_pred_imp = pd.DataFrame({\n", + " 'actual' : y_test,\n", + " 'prediction' : np.ndarray.flatten(y_pred_gru_test_2)\n", + "})\n", + "df_act_pred_imp = pd.concat([pd.DataFrame(X_test), act_pred_imp],axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 99, + "metadata": { + "id": "MdlvcjK2tQoh" + }, + "outputs": [], + "source": [ + "# split FP dan FN\n", + "act_pred_imp_FP = df_act_pred_imp[(df_act_pred_imp['actual']==0) &(df_act_pred_imp['prediction']!=0)]\n", + "act_pred_imp_FN = df_act_pred_imp[(df_act_pred_imp['actual']!=0) &(df_act_pred_imp['prediction']==0)]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jj_orjC7tug4" + }, + "source": [ + "### Weakness Words in False Positive" + ] + }, + { + "cell_type": "code", + "execution_count": 102, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0n-uaWxntTMR", + "outputId": "f43a3012-47f3-48c8-dbf6-55e990a3f2a4" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "mcdonald 60\n", + "food 50\n", + "order 40\n", + "good 30\n", + "like 28\n", + "service 27\n", + "go 26\n", + "place 23\n", + "people 22\n", + "always 22\n", + "time 22\n", + "nugget 20\n", + "need 20\n", + "mcdonalds 18\n", + "back 18\n", + "drive 17\n", + "get 17\n", + "fast 16\n", + "hot 15\n", + "staff 15\n" + ] + } + ], + "source": [ + "# Concatenate all the text data into a single string\n", + "all_text = ' '.join(act_pred_imp_FP['preprocessing_review'].values)\n", + "\n", + "# Tokenize the text into individual words\n", + "tokens = word_tokenize(all_text)\n", + "\n", + "# Count the frequency of each word\n", + "word_freq = FreqDist(tokens)\n", + "\n", + "# Get the top 20 most frequent words\n", + "most_common_words = word_freq.most_common(20)\n", + "\n", + "# Print the top 20 most frequent words\n", + "for word, freq in most_common_words:\n", + " print(word, freq)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qu3Q-JlEuBvH" + }, + "source": [ + "### Weakness Words in False Negative" + ] + }, + { + "cell_type": "code", + "execution_count": 103, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Si6Ikdt9t3B0", + "outputId": "5a99428f-9573-4b33-92a2-0bb453b46621" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "food 153\n", + "order 128\n", + "get 93\n", + "mcdonald 90\n", + "service 85\n", + "one 78\n", + "time 72\n", + "drive 66\n", + "good 53\n", + "like 51\n", + "thru 51\n", + "people 49\n", + "place 48\n", + "go 48\n", + "customer 41\n", + "got 39\n", + "wait 37\n", + "slow 36\n", + "long 36\n", + "fry 36\n" + ] + } + ], + "source": [ + "# Concatenate all the text data into a single string\n", + "all_text = ' '.join(act_pred_imp_FN['preprocessing_review'].values)\n", + "\n", + "# Tokenize the text into individual words\n", + "tokens = word_tokenize(all_text)\n", + "\n", + "# Count the frequency of each word\n", + "word_freq = FreqDist(tokens)\n", + "\n", + "# Get the top 20 most frequent words\n", + "most_common_words = word_freq.most_common(20)\n", + "\n", + "# Print the top 20 most frequent words\n", + "for word, freq in most_common_words:\n", + " print(word, freq)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tZd-pxjTuaBd" + }, + "source": [ + "Comparing weakness words in false positive and false negative we can said:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "68644kCjumQl" + }, + "source": [ + "- Both of false positive and false negative consist of top of words that frequnt appear like food and order, it looks like the model difficult to directly predict the review that contain word `food` and `order` categorized in what label. It is because the model can predict from review in dataset related to words like `good`, `bad`, or the most related to the services\n", + "\n", + "- Both have words `good` it indicates that the model got misleading context for example the review said that the foods,services,and ambiences are really good, which is will enter the label 2 or `positive` but the model can't predict the review that contain this word into negative meaning like \"The Sandwich or the chicken is not `good` and very though to be eaten.\n", + "\n", + "- The problems above is the one of factors that affect the miss prediction by the model so it is expexted for future works to increase the model performance with other model from `LSTM` and `GRU` model that have used in this project also can add more stopwords to reduce the vocabulary" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cJYzi7zOuSpp" + }, + "source": [ + "# Bab 9: Saving Model" + ] + }, + { + "cell_type": "code", + "execution_count": 104, + "metadata": { + "id": "dqS8zRtCXrXV" + }, + "outputs": [], + "source": [ + "from google.colab import files" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vu0n8BrnWmeU" + }, + "source": [ + "### Freeze Model" + ] + }, + { + "cell_type": "code", + "execution_count": 105, + "metadata": { + "id": "hSGlG2JXWfrT" + }, + "outputs": [], + "source": [ + "model_gru_2.trainable= True" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9oa3Kgv9wNKT" + }, + "source": [ + "### Saving Model" + ] + }, + { + "cell_type": "code", + "execution_count": 106, + "metadata": { + "id": "xL132JZYwO6D" + }, + "outputs": [], + "source": [ + "model_gru_2.save('model')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DedFg4AaWrYy" + }, + "source": [ + "### Directory Name of Model Name" + ] + }, + { + "cell_type": "code", + "execution_count": 107, + "metadata": { + "id": "ONXaU2hcwmU9" + }, + "outputs": [], + "source": [ + "model_dir= 'model'" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q90RyEVeWy2p" + }, + "source": [ + "### Save Model as TensorFlow Saved Model" + ] + }, + { + "cell_type": "code", + "execution_count": 108, + "metadata": { + "id": "YBS90f5Ywq4A" + }, + "outputs": [], + "source": [ + "model_gru_2.save(model_dir,save_format= 'tf')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4XaSUeg0XFWg" + }, + "source": [ + "### Compress directory model into ZIP file" + ] + }, + { + "cell_type": "code", + "execution_count": 109, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + }, + "id": "eQVLd5Fn3xBV", + "outputId": "d5505204-f496-4bc2-de9a-1e2028f19f02" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'/content/model.zip'" + ] + }, + "execution_count": 109, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import shutil\n", + "shutil.make_archive(model_dir,'zip', model_dir)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Bab 10: Model Inference" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Model Inference in another notebook entitled model_inf_Allen_G7" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Bab 11: Conclusion" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Looking to EDA: \n", + "- The most customer of MC'D givin rating 5 stars which is satisfied with the restaurant from the menu served or the service of the restaurant but unfortunately from the data the second highest rating giving by customer is 1 stars where it is very negative response from customer. At the end we can conclude that customers of Mc'd here still satisfied because the third place of rating is filled with 4 stars rating that indicates postiveness rating bigger than negativeness\n", + "- from information above we can said that the label, class from customers' review the highest giving rating with label 2 that is `Positive Response`, 0 is `Negative Response`, and 1 is `Neutral Response`\n", + "\n", + "### Best or GRU MODEL with Transfer Learning Model:\n", + "- After doing with model LSTM and GRU, the conclusion is for this project GRU model having development in performace than LSTM\n", + "- Transfer learning in baseline GRU model making improvement of the performance \n", + "- This project chose GRU MODEL with transfer learning although have decrease in accuracy but the data become goodfit with performance in accuracy 78% and in val accuracy 77% which is good enough \n", + "\n", + "### Confusion Matrix:\n", + "From 2 confusion matrix, it can be concluded that Gru Model with transfer learning is much better than baseline gru model, it can be seen from the all True Positive in 3 labels got the highes number and if we want to compare to the default model, the improvement model succesfully increase true postive level in label 1 and reduce the false negative level in label 2 and label 3. it can be said that the accuracy decrease but the data become goodfit because there is a little gap of the accuracy and val accuracy, it is `0.78` to `0.77`\n", + "\n", + "### Model Weakness (Comparing weakness words in false positive and false negative)\n", + "- Both of false positive and false negative consist of top of words that frequnt appear like food and order, it looks like the model difficult to directly predict the review that contain word `food` and `order` categorized in what label. It is because the model can predict from review in dataset related to words like `good`, `bad`, or the most related to the services\n", + "\n", + "- Both have words `good` it indicates that the model got misleading context for example the review said that the foods,services,and ambiences are really good, which is will enter the label 2 or `positive` but the model can't predict the review that contain this word into negative meaning like \"The Sandwich or the chicken is not `good` and very though to be eaten.\n", + "\n", + "- The problems above is the one of factors that affect the miss prediction by the model so it is expexted for future works to increase the model performance with other model from `LSTM` and `GRU` model that have used in this project also can add more stopwords to reduce the vocabulary\n", + "\n", + "### Insight Business \n", + "- Evaluate customers review so that the mistake like that will never happen again\n", + "- Increasing the services for dine in, take away, and drive thru\n", + "- Giving promo or discount it is proved effective affect customers experience in reviewing and giving rating\n", + "\n", + "### Improvement Model:\n", + "- it is expexted for future works to increase the model performance with other model from `LSTM` and `GRU` model that have used in this project also can add more stopwords to reduce lots of vocabulary\n", + "- Do much more in cleaning data to get better result and minimal token and vocabulary\n", + "- Be wise in choosing to do Lemmatize or Stemming by doing trial for both\n" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}