{"cells":[{"cell_type":"markdown","id":"c454c018-02b7-4c3d-a21f-411748963a3f","metadata":{"id":"c454c018-02b7-4c3d-a21f-411748963a3f"},"source":["# Workshop: Sentiment Analysis"]},{"cell_type":"markdown","id":"2eda2e01-dfc4-42a6-9b6a-5cdf39fbce78","metadata":{"id":"2eda2e01-dfc4-42a6-9b6a-5cdf39fbce78"},"source":["
\n","\n","
"]},{"cell_type":"code","source":["ls"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"eBO3zjLr0gY9","executionInfo":{"status":"ok","timestamp":1713839953212,"user_tz":-420,"elapsed":8,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"dc57b433-6f87-4904-99c2-0002d39a7c60"},"id":"eBO3zjLr0gY9","execution_count":1,"outputs":[{"output_type":"stream","name":"stdout","text":["\u001b[0m\u001b[01;34msample_data\u001b[0m/\n"]}]},{"cell_type":"code","source":["from google.colab import drive\n","drive.mount('/content/drive')"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"dXQK7eEb0mGr","executionInfo":{"status":"ok","timestamp":1713840022646,"user_tz":-420,"elapsed":25112,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"93400453-f2a5-4d20-f3bc-f7c03c8abf41"},"id":"dXQK7eEb0mGr","execution_count":2,"outputs":[{"output_type":"stream","name":"stdout","text":["Mounted at /content/drive\n"]}]},{"cell_type":"code","source":["cd \"/content/drive/MyDrive/689-WorkShop/Ass13-SemtimentAna\""],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"eB1rpKAu04zB","executionInfo":{"status":"ok","timestamp":1713840067921,"user_tz":-420,"elapsed":329,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"28e5fa28-6dd0-4891-fea8-b363826e9602"},"id":"eB1rpKAu04zB","execution_count":3,"outputs":[{"output_type":"stream","name":"stdout","text":["/content/drive/MyDrive/689-WorkShop/Ass13-SemtimentAna\n"]}]},{"cell_type":"code","source":["ls"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"g8kKS4TJ1CkA","executionInfo":{"status":"ok","timestamp":1713840076790,"user_tz":-420,"elapsed":834,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"cbc24ac9-9c22-4af4-de0e-38d3a0edd796"},"id":"g8kKS4TJ1CkA","execution_count":4,"outputs":[{"output_type":"stream","name":"stdout","text":["imdb_reviews.csv WorkshopSentimentsAna-65130700309.ipynb WorkshopSentimentsAna-std.ipynb\n"]}]},{"cell_type":"code","execution_count":5,"id":"7ef9db65-1fda-4fc6-8bb9-bc52bdbb9529","metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/"},"id":"7ef9db65-1fda-4fc6-8bb9-bc52bdbb9529","executionInfo":{"status":"ok","timestamp":1713840098623,"user_tz":-420,"elapsed":14252,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"85c975d0-c099-4fc5-e228-b726da3fca93"},"outputs":[{"output_type":"stream","name":"stdout","text":["Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)\n","Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7)\n","Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.0)\n","Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2023.12.25)\n","Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.2)\n","Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.40.0)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.13.4)\n","Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.20.3)\n","Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.25.2)\n","Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (24.0)\n","Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.1)\n","Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2023.12.25)\n","Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.31.0)\n","Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.19.1)\n","Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.3)\n","Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.2)\n","Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.19.3->transformers) (2023.6.0)\n","Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.19.3->transformers) (4.11.0)\n","Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.3.2)\n","Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.7)\n","Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.7)\n","Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2024.2.2)\n"]}],"source":["!pip install nltk\n","!pip install transformers"]},{"cell_type":"markdown","id":"1a0b8ed9-f240-47b4-aa62-0cf48bdd7868","metadata":{"jp-MarkdownHeadingCollapsed":true,"tags":[],"id":"1a0b8ed9-f240-47b4-aa62-0cf48bdd7868"},"source":["## Rule-Based Approaches\n","\n","- **Lexicon-Based Methods**: Use sentiment lexicons or dictionaries that contain words annotated with their sentiment polarity (positive, negative, neutral).\n","- **Pattern Matching**: Identify sentiment based on predefined patterns or rules in the text.\n"]},{"cell_type":"code","execution_count":8,"id":"9f7f14b4-60ba-4a92-a9d0-a124e62fe03b","metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/"},"id":"9f7f14b4-60ba-4a92-a9d0-a124e62fe03b","executionInfo":{"status":"ok","timestamp":1713840585023,"user_tz":-420,"elapsed":1966,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"c8ce549c-78f7-47b4-88f2-149744da949d"},"outputs":[{"output_type":"stream","name":"stderr","text":["[nltk_data] Downloading package stopwords to /root/nltk_data...\n","[nltk_data] Unzipping corpora/stopwords.zip.\n","[nltk_data] Downloading package punkt to /root/nltk_data...\n","[nltk_data] Unzipping tokenizers/punkt.zip.\n"]},{"output_type":"execute_result","data":{"text/plain":["True"]},"metadata":{},"execution_count":8}],"source":["import nltk\n","from nltk.tokenize import word_tokenize\n","from nltk.corpus import stopwords\n","\n","nltk.download('stopwords')\n","nltk.download('punkt')"]},{"cell_type":"code","execution_count":9,"id":"8a25f60f-f202-49cd-b965-e3ebb1676786","metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/"},"id":"8a25f60f-f202-49cd-b965-e3ebb1676786","executionInfo":{"status":"ok","timestamp":1713840589093,"user_tz":-420,"elapsed":349,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"fa23eaf9-86ce-41c8-de1d-01946a330f2e"},"outputs":[{"output_type":"stream","name":"stdout","text":["['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n"]}],"source":["print(stopwords.words('english'))"]},{"cell_type":"code","execution_count":10,"id":"7652d6d2-ba4c-4d02-bfe3-313b6e0f24a5","metadata":{"tags":[],"id":"7652d6d2-ba4c-4d02-bfe3-313b6e0f24a5","executionInfo":{"status":"ok","timestamp":1713841458584,"user_tz":-420,"elapsed":344,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}}},"outputs":[],"source":["text = \"I had a good experience with the product. Highly recommended!\""]},{"cell_type":"code","execution_count":11,"id":"53fc7d50-59fa-4bec-9ae4-b93a1a3847f1","metadata":{"tags":[],"id":"53fc7d50-59fa-4bec-9ae4-b93a1a3847f1","executionInfo":{"status":"ok","timestamp":1713841468235,"user_tz":-420,"elapsed":318,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}}},"outputs":[],"source":["tokens = word_tokenize(text.lower())"]},{"cell_type":"code","execution_count":12,"id":"faac761f-912e-44f7-b7b0-626baaea6a56","metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/"},"id":"faac761f-912e-44f7-b7b0-626baaea6a56","executionInfo":{"status":"ok","timestamp":1713841469507,"user_tz":-420,"elapsed":2,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"fe91fd19-88d2-4913-c926-fed1f59089a6"},"outputs":[{"output_type":"stream","name":"stdout","text":["['i', 'had', 'a', 'good', 'experience', 'with', 'the', 'product', '.', 'highly', 'recommended', '!']\n"]}],"source":["print(tokens)"]},{"cell_type":"code","execution_count":13,"id":"9f6543a2-76f4-4993-b535-f90e50bada72","metadata":{"tags":[],"id":"9f6543a2-76f4-4993-b535-f90e50bada72","executionInfo":{"status":"ok","timestamp":1713841471657,"user_tz":-420,"elapsed":1,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}}},"outputs":[],"source":["stop_words = set(stopwords.words('english'))"]},{"cell_type":"code","execution_count":14,"id":"4d7f529d-f006-48db-a092-2262f17cb3cd","metadata":{"tags":[],"id":"4d7f529d-f006-48db-a092-2262f17cb3cd","executionInfo":{"status":"ok","timestamp":1713841473288,"user_tz":-420,"elapsed":1,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}}},"outputs":[],"source":["tokens = [word for word in tokens if word.isalnum() and word not in stop_words] #alnum = alphanumeric"]},{"cell_type":"code","execution_count":15,"id":"4acfb41c-615d-4e8b-92dc-3f73a4188402","metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/"},"id":"4acfb41c-615d-4e8b-92dc-3f73a4188402","executionInfo":{"status":"ok","timestamp":1713841476285,"user_tz":-420,"elapsed":352,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"1ae566fc-8dab-41b8-9c8b-db17ec037666"},"outputs":[{"output_type":"stream","name":"stdout","text":["['good', 'experience', 'product', 'highly', 'recommended']\n"]}],"source":["print(tokens)"]},{"cell_type":"code","execution_count":null,"id":"c3cfd1cc-3f30-43de-a469-dec0b3816313","metadata":{"id":"c3cfd1cc-3f30-43de-a469-dec0b3816313"},"outputs":[],"source":[]},{"cell_type":"code","execution_count":16,"id":"aed2ad01-27e5-45e3-a55c-63084966a482","metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/"},"id":"aed2ad01-27e5-45e3-a55c-63084966a482","executionInfo":{"status":"ok","timestamp":1713841613192,"user_tz":-420,"elapsed":313,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"1156b9ea-186c-4282-8eb7-b7e6ce0328a3"},"outputs":[{"output_type":"stream","name":"stdout","text":["Sentiment: Positive\n"]}],"source":["# Sample positive and negative words\n","positive_words = set(['good', 'awesome', 'excellent', 'happy', 'positive'])\n","negative_words = set(['bad', 'terrible', 'poor', 'unhappy', 'negative'])\n","\n","def rule_based_sentiment_analysis(text):\n"," # Tokenize the text\n"," tokens = word_tokenize(text.lower())\n","\n"," # Remove stopwords\n"," stop_words = set(stopwords.words('english'))\n"," tokens = [word for word in tokens if word.isalnum() and word not in stop_words] #alnum = alphanumeric\n","\n"," # Calculate sentiment score\n"," sentiment_score = sum(1 for word in tokens if word in positive_words) - sum(1 for word in tokens if word in negative_words)\n","\n"," # Classify sentiment\n"," if sentiment_score > 0:\n"," return 'Positive'\n"," elif sentiment_score < 0:\n"," return 'Negative'\n"," else:\n"," return 'Neutral'\n","\n","# Example usage\n","text_to_analyze = \"I had a good experience with the product. Highly recommended!\"\n","sentiment_result = rule_based_sentiment_analysis(text_to_analyze)\n","print(f\"Sentiment: {sentiment_result}\")"]},{"cell_type":"markdown","id":"21764069-0b07-4b3e-8103-b2ab464a9182","metadata":{"tags":[],"id":"21764069-0b07-4b3e-8103-b2ab464a9182"},"source":["## Machine Learning Approaches"]},{"cell_type":"markdown","id":"dc739c8a-a453-43d1-bdc5-ad10d823d748","metadata":{"jp-MarkdownHeadingCollapsed":true,"tags":[],"id":"dc739c8a-a453-43d1-bdc5-ad10d823d748"},"source":["### Import packages"]},{"cell_type":"code","execution_count":17,"id":"7e030b97-e111-45ea-b00f-09a360f3400e","metadata":{"tags":[],"id":"7e030b97-e111-45ea-b00f-09a360f3400e","executionInfo":{"status":"ok","timestamp":1713841657541,"user_tz":-420,"elapsed":720,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}}},"outputs":[],"source":["import pandas as pd\n","from sklearn.pipeline import Pipeline\n","from sklearn.utils import shuffle\n","from sklearn.model_selection import train_test_split\n","from sklearn.feature_extraction.text import TfidfVectorizer\n","# from sklearn.svm import SVC\n","from sklearn.naive_bayes import MultinomialNB\n","from sklearn.metrics import classification_report, confusion_matrix\n","\n"]},{"cell_type":"markdown","id":"54c4fe66-f52f-487f-bfd5-0ea6e05206ce","metadata":{"tags":[],"id":"54c4fe66-f52f-487f-bfd5-0ea6e05206ce"},"source":["### TF-IDF vectorizer"]},{"cell_type":"markdown","id":"3f5b7e92-5de4-4894-b2be-47dac1cf2482","metadata":{"id":"3f5b7e92-5de4-4894-b2be-47dac1cf2482"},"source":["\n","
\n","\n","
\n","\n","\n","Image sources: https://www.kdnuggets.com/2022/09/convert-text-documents-tfidf-matrix-tfidfvectorizer.html\n","\n","\n","\n","\n"]},{"cell_type":"markdown","id":"9bd125fc-11fd-414a-b8f0-ff7ef628fb94","metadata":{"jp-MarkdownHeadingCollapsed":true,"tags":[],"id":"9bd125fc-11fd-414a-b8f0-ff7ef628fb94"},"source":["##### Example on Small data"]},{"cell_type":"code","execution_count":18,"id":"8a61fdce-6544-4774-bc29-265bf4afaa90","metadata":{"tags":[],"id":"8a61fdce-6544-4774-bc29-265bf4afaa90","executionInfo":{"status":"ok","timestamp":1713841845540,"user_tz":-420,"elapsed":360,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}}},"outputs":[],"source":["\n","\n","# Sample data\n","documents = [\n"," \"This is the first document.\",\n"," \"This document is the second document.\",\n"," \"And this is the third one.\",\n"," \"Is this the first document?\"\n","]"]},{"cell_type":"code","execution_count":19,"id":"5794027b-2bee-46d9-9b4d-9cbaa7c4120f","metadata":{"tags":[],"id":"5794027b-2bee-46d9-9b4d-9cbaa7c4120f","executionInfo":{"status":"ok","timestamp":1713841849693,"user_tz":-420,"elapsed":471,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}}},"outputs":[],"source":["# Create a DataFrame for better visualization\n","df = pd.DataFrame({'Text': documents})"]},{"cell_type":"code","source":["df"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":175},"id":"Wje-T5kT712V","executionInfo":{"status":"ok","timestamp":1713841863619,"user_tz":-420,"elapsed":422,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"96738978-cc1b-4565-f336-173f2a348453"},"id":"Wje-T5kT712V","execution_count":20,"outputs":[{"output_type":"execute_result","data":{"text/plain":[" Text\n","0 This is the first document.\n","1 This document is the second document.\n","2 And this is the third one.\n","3 Is this the first document?"],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
Text
0This is the first document.
1This document is the second document.
2And this is the third one.
3Is this the first document?
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"application/vnd.google.colaboratory.intrinsic+json":{"type":"dataframe","variable_name":"df","summary":"{\n \"name\": \"df\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"Text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"This document is the second document.\",\n \"Is this the first document?\",\n \"This is the first document.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"}},"metadata":{},"execution_count":20}]},{"cell_type":"code","execution_count":21,"id":"b49d5272-0383-4e39-910b-87276c4ffca2","metadata":{"tags":[],"id":"b49d5272-0383-4e39-910b-87276c4ffca2","executionInfo":{"status":"ok","timestamp":1713841867905,"user_tz":-420,"elapsed":2,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}}},"outputs":[],"source":["# TF-IDF vectorization\n","vectorizer = TfidfVectorizer()\n","tfidf_matrix = vectorizer.fit_transform(df['Text'].tolist())"]},{"cell_type":"code","execution_count":22,"id":"46c0b47d-80ab-498b-91a2-7202f1c429fd","metadata":{"tags":[],"id":"46c0b47d-80ab-498b-91a2-7202f1c429fd","executionInfo":{"status":"ok","timestamp":1713841872560,"user_tz":-420,"elapsed":320,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}}},"outputs":[],"source":["# Convert the TF-IDF matrix to a DataFrame\n","tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())"]},{"cell_type":"code","execution_count":23,"id":"91c2bee0-5bb6-44b9-a609-1f3d0e891ad4","metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/"},"id":"91c2bee0-5bb6-44b9-a609-1f3d0e891ad4","executionInfo":{"status":"ok","timestamp":1713841877895,"user_tz":-420,"elapsed":339,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"cdd8422d-bf3a-42c1-a476-2b15d7f37157"},"outputs":[{"output_type":"stream","name":"stdout","text":["Original Data:\n"," Text\n","0 This is the first document.\n","1 This document is the second document.\n","2 And this is the third one.\n","3 Is this the first document?\n"]}],"source":["# Print the original data\n","print(\"Original Data:\")\n","print(df)"]},{"cell_type":"code","execution_count":24,"id":"24c4a522-8ef4-4001-ada6-031a043b9a54","metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/"},"id":"24c4a522-8ef4-4001-ada6-031a043b9a54","executionInfo":{"status":"ok","timestamp":1713841882847,"user_tz":-420,"elapsed":344,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"2606fbf1-fe41-4e03-899b-5d3f2776685f"},"outputs":[{"output_type":"stream","name":"stdout","text":[" (0, 1)\t0.46979138557992045\n"," (0, 2)\t0.5802858236844359\n"," (0, 6)\t0.38408524091481483\n"," (0, 3)\t0.38408524091481483\n"," (0, 8)\t0.38408524091481483\n"," (1, 5)\t0.5386476208856763\n"," (1, 1)\t0.6876235979836938\n"," (1, 6)\t0.281088674033753\n"," (1, 3)\t0.281088674033753\n"," (1, 8)\t0.281088674033753\n"," (2, 4)\t0.511848512707169\n"," (2, 7)\t0.511848512707169\n"," (2, 0)\t0.511848512707169\n"," (2, 6)\t0.267103787642168\n"," (2, 3)\t0.267103787642168\n"," (2, 8)\t0.267103787642168\n"," (3, 1)\t0.46979138557992045\n"," (3, 2)\t0.5802858236844359\n"," (3, 6)\t0.38408524091481483\n"," (3, 3)\t0.38408524091481483\n"," (3, 8)\t0.38408524091481483\n"]}],"source":["print(tfidf_matrix)"]},{"cell_type":"code","execution_count":25,"id":"6feb5892-284f-43d1-ab7b-5b13dbfadd0b","metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/"},"id":"6feb5892-284f-43d1-ab7b-5b13dbfadd0b","executionInfo":{"status":"ok","timestamp":1713841924141,"user_tz":-420,"elapsed":341,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"0df619c6-eb05-43a0-da9d-7c0d3c0326b7"},"outputs":[{"output_type":"stream","name":"stdout","text":["\n","TF-IDF Matrix:\n"," and document first is one second the \\\n","0 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085 \n","1 0.000000 0.687624 0.000000 0.281089 0.000000 0.538648 0.281089 \n","2 0.511849 0.000000 0.000000 0.267104 0.511849 0.000000 0.267104 \n","3 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085 \n","\n"," third this \n","0 0.000000 0.384085 \n","1 0.000000 0.281089 \n","2 0.511849 0.267104 \n","3 0.000000 0.384085 \n"]}],"source":["# Print the TF-IDF matrix\n","print(\"\\nTF-IDF Matrix:\")\n","print(tfidf_df)"]},{"cell_type":"markdown","id":"6802c239-edfa-462e-99ea-31386fd7aed4","metadata":{"tags":[],"id":"6802c239-edfa-462e-99ea-31386fd7aed4"},"source":["## Naive Bayes classifier trained on the TF-IDF features."]},{"cell_type":"markdown","id":"3accf6f8-6cae-4265-8d5d-fb5d40a07a2d","metadata":{"id":"3accf6f8-6cae-4265-8d5d-fb5d40a07a2d"},"source":["
\n","\n","
\n"]},{"cell_type":"markdown","id":"9062063a-557b-4971-ad84-e3601b1a520e","metadata":{"jp-MarkdownHeadingCollapsed":true,"tags":[],"id":"9062063a-557b-4971-ad84-e3601b1a520e"},"source":["### Read data/Preparation"]},{"cell_type":"code","execution_count":26,"id":"8d2eab09-03c7-441e-9c78-0c2e069f4d25","metadata":{"tags":[],"id":"8d2eab09-03c7-441e-9c78-0c2e069f4d25","executionInfo":{"status":"ok","timestamp":1713843412522,"user_tz":-420,"elapsed":3570,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}}},"outputs":[],"source":["# df = pd.read_csv(\"Womens_Clothing_E_Commerce_Reviews.csv\")\n","df = pd.read_csv(\"imdb_reviews.csv\")"]},{"cell_type":"code","execution_count":27,"id":"aca597f3-c8da-4314-990e-253d5ed719da","metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/"},"id":"aca597f3-c8da-4314-990e-253d5ed719da","executionInfo":{"status":"ok","timestamp":1713843417599,"user_tz":-420,"elapsed":381,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"1bc779d8-5cb6-4263-cfde-29f2a9ce6e82"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["(50000, 2)"]},"metadata":{},"execution_count":27}],"source":["df.shape"]},{"cell_type":"code","execution_count":28,"id":"7d8131e4-4a69-45af-aa12-335c926e308f","metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/","height":143},"id":"7d8131e4-4a69-45af-aa12-335c926e308f","executionInfo":{"status":"ok","timestamp":1713843430104,"user_tz":-420,"elapsed":994,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"03ebe55c-6c56-4084-cf22-0d4c48626788"},"outputs":[{"output_type":"execute_result","data":{"text/plain":[" text label\n","0 One of the other reviewers has mentioned that ... positive\n","1 A wonderful little production.

The... positive\n","2 I thought this was a wonderful way to spend ti... positive"],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
textlabel
0One of the other reviewers has mentioned that ...positive
1A wonderful little production. <br /><br />The...positive
2I thought this was a wonderful way to spend ti...positive
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"application/vnd.google.colaboratory.intrinsic+json":{"type":"dataframe","variable_name":"df","summary":"{\n \"name\": \"df\",\n \"rows\": 50000,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 49582,\n \"samples\": [\n \"\\\"Soul Plane\\\" is a horrible attempt at comedy that only should appeal people with thick skulls, bloodshot eyes and furry pawns.

The plot is not only incoherent but also non-existent, acting is mostly sub sub-par with a gang of highly moronic and dreadful characters thrown in for bad measure, jokes are often spotted miles ahead and almost never even a bit amusing. This movie lacks any structure and is full of racial stereotypes that must have seemed old even in the fifties, the only thing it really has going for it is some pretty ladies, but really, if you want that you can rent something from the \\\"Adult\\\" section. OK?

I can hardly see anything here to recommend since you'll probably have a lot a better and productive time chasing rats with a sledgehammer or inventing waterproof teabags or whatever.

2/10\",\n \"Guest from the Future tells a fascinating story of time travel, friendship, battle of good and evil -- all with a small budget, child actors, and few special effects. Something for Spielberg and Lucas to learn from. ;) A sixth-grader Kolya \\\"Nick\\\" Gerasimov finds a time machine in the basement of a decrepit building and travels 100 years into the future. He discovers a near-perfect, utopian society where robots play guitars and write poetry, everyone is kind to each other and people enjoy everything technology has to offer. Alice is the daughter of a prominent scientist who invented a device called Mielophone that allows to read minds of humans and animals. The device can be put to both good and bad use, depending on whose hands it falls into. When two evil space pirates from Saturn who want to rule the universe attempt to steal Mielophone, it falls into the hands of 20th century school boy Nick. With the pirates hot on his tracks, he travels back to his time, followed by the pirates, and Alice. Chaos, confusion and funny situations follow as the luckless pirates try to blend in with the earthlings. Alice enrolls in the same school Nick goes to and demonstrates superhuman abilities in PE class. The catch is, Alice doesn't know what Nick looks like, while the pirates do. Also, the pirates are able to change their appearance and turn literally into anyone. (Hmm, I wonder if this is where James Cameron got the idea for Terminator...) Who gets to Nick -- and Mielophone -- first? Excellent plot, non-stop adventures, and great soundtrack. I wish Hollywood made kid movies like this one...\",\n \"\\\"National Treasure\\\" (2004) is a thoroughly misguided hodge-podge of plot entanglements that borrow from nearly every cloak and dagger government conspiracy clich\\u00e9 that has ever been written. The film stars Nicholas Cage as Benjamin Franklin Gates (how precious is that, I ask you?); a seemingly normal fellow who, for no other reason than being of a lineage of like-minded misguided fortune hunters, decides to steal a 'national treasure' that has been hidden by the United States founding fathers. After a bit of subtext and background that plays laughably (unintentionally) like Indiana Jones meets The Patriot, the film degenerates into one misguided whimsy after another \\u0096 attempting to create a 'Stanley Goodspeed' regurgitation of Nicholas Cage and launch the whole convoluted mess forward with a series of high octane, but disconnected misadventures.

The relevancy and logic to having George Washington and his motley crew of patriots burying a king's ransom someplace on native soil, and then, going through the meticulous plan of leaving clues scattered throughout U.S. currency art work, is something that director Jon Turteltaub never quite gets around to explaining. Couldn't Washington found better usage for such wealth during the start up of the country? Hence, we are left with a mystery built on top of an enigma that is already on shaky ground by the time Ben appoints himself the new custodian of this untold wealth. Ben's intentions are noble \\u0096 if confusing. He's set on protecting the treasure. For who and when?\\u0085your guess is as good as mine.

But there are a few problems with Ben's crusade. First up, his friend, Ian Holmes (Sean Bean) decides that he can't wait for Ben to make up his mind about stealing the Declaration of Independence from the National Archives (oh, yeah \\u0096 brilliant idea!). Presumably, the back of that famous document holds the secret answer to the ultimate fortune. So Ian tries to kill Ben. The assassination attempt is, of course, unsuccessful, if overly melodramatic. It also affords Ben the opportunity to pick up, and pick on, the very sultry curator of the archives, Abigail Chase (Diane Kruger). She thinks Ben is clearly a nut \\u0096 at least at the beginning. But true to action/romance form, Abby's resolve melts quicker than you can say, \\\"is that the Hope Diamond?\\\" The film moves into full X-File-ish mode, as the FBI, mistakenly believing that Ben is behind the theft, retaliate in various benign ways that lead to a multi-layering of action sequences reminiscent of Mission Impossible meets The Fugitive. Honestly, don't those guys ever get 'intelligence' information that is correct? In the final analysis, \\\"National Treasure\\\" isn't great film making, so much as it's a patchwork rehash of tired old bits from other movies, woven together from scraps, the likes of which would make IL' Betsy Ross blush.

The Buena Vista DVD delivers a far more generous treatment than this film is deserving of. The anamorphic widescreen picture exhibits a very smooth and finely detailed image with very rich colors, natural flesh tones, solid blacks and clean whites. The stylized image is also free of blemishes and digital enhancements. The audio is 5.1 and delivers a nice sonic boom to your side and rear speakers with intensity and realism. Extras include a host of promotional junket material that is rather deep and over the top in its explanation of how and why this film was made. If only, as an audience, we had had more clarification as to why Ben and co. were chasing after an illusive treasure, this might have been one good flick. Extras conclude with the theatrical trailer, audio commentary and deleted scenes. Not for the faint-hearted \\u0096 just the thick-headed.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"negative\",\n \"positive\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"}},"metadata":{},"execution_count":28}],"source":["df.head(3)"]},{"cell_type":"code","execution_count":29,"id":"43a27caf-779b-4bd1-a3cf-fa641021172e","metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/"},"id":"43a27caf-779b-4bd1-a3cf-fa641021172e","executionInfo":{"status":"ok","timestamp":1713843524562,"user_tz":-420,"elapsed":334,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"fd1bd6c8-a340-49a2-deb8-e3a00e217a34"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["array(['positive', 'negative'], dtype=object)"]},"metadata":{},"execution_count":29}],"source":["df['label'].unique()"]},{"cell_type":"code","execution_count":null,"id":"c72dd5ec-59b2-4c7f-a8fb-fdade866984d","metadata":{"tags":[],"id":"c72dd5ec-59b2-4c7f-a8fb-fdade866984d"},"outputs":[],"source":["df['label'].unique()"]},{"cell_type":"code","execution_count":30,"id":"ba556f9b-da1c-4d13-8d70-563e0bd528a1","metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/"},"id":"ba556f9b-da1c-4d13-8d70-563e0bd528a1","executionInfo":{"status":"ok","timestamp":1713843636003,"user_tz":-420,"elapsed":322,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"85b8ee33-108a-44b9-d395-e372c1b99b79"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["text 0\n","label 0\n","dtype: int64"]},"metadata":{},"execution_count":30}],"source":["df.isna().sum()"]},{"cell_type":"markdown","id":"819c31c3-873d-4d31-a21a-759059bd4c6d","metadata":{"jp-MarkdownHeadingCollapsed":true,"tags":[],"id":"819c31c3-873d-4d31-a21a-759059bd4c6d"},"source":["### Split the dataset into training and testing sets"]},{"cell_type":"code","execution_count":31,"id":"6ca318a2-26d7-446e-8324-6660171f239d","metadata":{"tags":[],"id":"6ca318a2-26d7-446e-8324-6660171f239d","executionInfo":{"status":"ok","timestamp":1713843687000,"user_tz":-420,"elapsed":1205,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}}},"outputs":[],"source":["train_data, test_data, train_labels, test_labels = train_test_split(df['text'], df['label'], test_size=0.3, random_state=42)"]},{"cell_type":"code","execution_count":32,"id":"f0cfc8fc-49e5-4c88-bb33-8084dcf00100","metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/"},"id":"f0cfc8fc-49e5-4c88-bb33-8084dcf00100","executionInfo":{"status":"ok","timestamp":1713843694590,"user_tz":-420,"elapsed":329,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"767e8a04-ed3d-466f-a935-15eab7cdb83f"},"outputs":[{"output_type":"stream","name":"stdout","text":["38094 As much as I love trains, I couldn't stomach t...\n","40624 This was a very good PPV, but like Wrestlemani...\n","49425 Not finding the right words is everybody's pro...\n","35734 I'm really suprised this movie didn't get a hi...\n","41708 I'll start by confessing that I tend to really...\n"," ... \n","11284 `Shadow Magic' recaptures the joy and amazemen...\n","44732 I found this movie to be quite enjoyable and f...\n","38158 Avoid this one! It is a terrible movie. So wha...\n","860 This production was quite a surprise for me. I...\n","15795 This is a decent movie. Although little bit sh...\n","Name: text, Length: 35000, dtype: object\n"]}],"source":["print(train_data)"]},{"cell_type":"code","execution_count":33,"id":"51d0a415-4982-43dd-8864-c189ba6826f4","metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/"},"id":"51d0a415-4982-43dd-8864-c189ba6826f4","executionInfo":{"status":"ok","timestamp":1713843697940,"user_tz":-420,"elapsed":311,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"8167d4df-179d-4cf9-bf14-9112184e98be"},"outputs":[{"output_type":"stream","name":"stdout","text":["38094 negative\n","40624 positive\n","49425 negative\n","35734 positive\n","41708 negative\n"," ... \n","11284 positive\n","44732 positive\n","38158 negative\n","860 positive\n","15795 positive\n","Name: label, Length: 35000, dtype: object\n"]}],"source":["print(train_labels)"]},{"cell_type":"markdown","id":"42987cdb-4cdf-46df-95d8-7c2b2824c1ee","metadata":{"jp-MarkdownHeadingCollapsed":true,"tags":[],"id":"42987cdb-4cdf-46df-95d8-7c2b2824c1ee"},"source":["### Create a pipeline"]},{"cell_type":"code","execution_count":34,"id":"06ffd548-c333-4c1a-87ce-9699ddd116ee","metadata":{"tags":[],"id":"06ffd548-c333-4c1a-87ce-9699ddd116ee","executionInfo":{"status":"ok","timestamp":1713843715911,"user_tz":-420,"elapsed":333,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}}},"outputs":[],"source":["sentiment_pipeline = Pipeline([\n"," ('tfidf', TfidfVectorizer()),\n"," ('nb', MultinomialNB())\n","])"]},{"cell_type":"markdown","id":"6bafa7cd-8d0b-4725-bd40-4a3b04634fab","metadata":{"jp-MarkdownHeadingCollapsed":true,"tags":[],"id":"6bafa7cd-8d0b-4725-bd40-4a3b04634fab"},"source":["### Train the model using the pipeline"]},{"cell_type":"code","execution_count":35,"id":"712dea09-52c2-4a9f-8bf9-3cbb273fe4b5","metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/","height":126},"id":"712dea09-52c2-4a9f-8bf9-3cbb273fe4b5","executionInfo":{"status":"ok","timestamp":1713844075138,"user_tz":-420,"elapsed":9437,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"fda94af1-772d-4175-ee65-ab856819dea7"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["Pipeline(steps=[('tfidf', TfidfVectorizer()), ('nb', MultinomialNB())])"],"text/html":["
Pipeline(steps=[('tfidf', TfidfVectorizer()), ('nb', MultinomialNB())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
"]},"metadata":{},"execution_count":35}],"source":["sentiment_pipeline.fit(train_data, train_labels)\n"]},{"cell_type":"markdown","id":"4c95c599-ae0d-433f-9ed5-856fd9fa35e0","metadata":{"jp-MarkdownHeadingCollapsed":true,"tags":[],"id":"4c95c599-ae0d-433f-9ed5-856fd9fa35e0"},"source":["### Make predictions on the test set"]},{"cell_type":"code","execution_count":36,"id":"37ae9eda-4a02-4f40-bdeb-ecb8ea67f9d3","metadata":{"tags":[],"id":"37ae9eda-4a02-4f40-bdeb-ecb8ea67f9d3","executionInfo":{"status":"ok","timestamp":1713844081489,"user_tz":-420,"elapsed":3301,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}}},"outputs":[],"source":["predictions = sentiment_pipeline.predict(test_data)"]},{"cell_type":"code","source":["test_data[1]"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":109},"id":"a620DunmGUpx","executionInfo":{"status":"ok","timestamp":1713844697926,"user_tz":-420,"elapsed":4,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"6109017f-a264-46b9-cd2a-db7d91158ba0"},"id":"a620DunmGUpx","execution_count":43,"outputs":[{"output_type":"execute_result","data":{"text/plain":["'A wonderful little production.

The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece.

The actors are extremely well chosen- Michael Sheen not only \"has got all the polari\" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\\'s of comedy and his life.

The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \\'dream\\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\\'s murals decorating every surface) are terribly well done.'"],"application/vnd.google.colaboratory.intrinsic+json":{"type":"string"}},"metadata":{},"execution_count":43}]},{"cell_type":"code","source":["test_labels[1]"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":36},"id":"qmW5xhMgGae6","executionInfo":{"status":"ok","timestamp":1713844689037,"user_tz":-420,"elapsed":4,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"2f9176d6-62cc-454b-cf06-292c2682f59f"},"id":"qmW5xhMgGae6","execution_count":41,"outputs":[{"output_type":"execute_result","data":{"text/plain":["'positive'"],"application/vnd.google.colaboratory.intrinsic+json":{"type":"string"}},"metadata":{},"execution_count":41}]},{"cell_type":"code","source":["predictions"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"9p7uhKVvEXQn","executionInfo":{"status":"ok","timestamp":1713844095757,"user_tz":-420,"elapsed":312,"user":{"displayName":"wannisa paethong","userId":"05174644342145313126"}},"outputId":"310851b4-6a8e-4308-f477-a932a640c26c"},"id":"9p7uhKVvEXQn","execution_count":37,"outputs":[{"output_type":"execute_result","data":{"text/plain":["array(['negative', 'positive', 'negative', ..., 'negative', 'positive',\n"," 'positive'], dtype='