{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "7e59ad5c", "metadata": {}, "outputs": [], "source": [ "import json\n", "import random\n", "\n", "# Define the path to the full Yelp dataset file\n", "full_data_path = \"yelp_academic_dataset_review.json\"\n", "\n", "# Define the path to save the sampled dataset file\n", "sampled_data_path = \"yelp_academic_dataset_review_sampled.json\"\n", "\n", "# Define the number of reviews to sample (adjust as needed)\n", "num_reviews_to_sample = 10000 # Example: Sample 10,000 reviews\n", "\n", "# Load all reviews from the full dataset\n", "all_reviews = []\n", "with open(full_data_path, \"r\", encoding=\"utf-8\") as f:\n", " for line in f:\n", " review = json.loads(line)\n", " all_reviews.append(review)\n", "\n", "# Randomly sample a subset of reviews\n", "sampled_reviews = random.sample(all_reviews, num_reviews_to_sample)\n", "\n", "# Save the sampled reviews to a new JSON file\n", "with open(sampled_data_path, \"w\", encoding=\"utf-8\") as f:\n", " for review in sampled_reviews:\n", " json.dump(review, f)\n", " f.write(\"\\n\")\n", "\n", "print(f\"Sampled {num_reviews_to_sample} reviews and saved to {sampled_data_path}\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "f562ff04", "metadata": {}, "outputs": [], "source": [ "import gzip\n", "\n", "# Define the path to save the compressed dataset file\n", "compressed_data_path = \"yelp_academic_dataset_review_sampled.json.gz\"\n", "\n", "# Compress the sampled dataset file using gzip\n", "with open(sampled_data_path, \"rb\") as f_in:\n", " with gzip.open(compressed_data_path, \"wb\") as f_out:\n", " f_out.writelines(f_in)\n", "\n", "print(f\"Compressed file saved to {compressed_data_path}\")\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "337f6649", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import classification_report, accuracy_score\n", "\n", "# Load the preprocessed Yelp dataset (sampled and compressed if applicable)\n", "data_path = \"yelp_academic_dataset_review_sampled.json.gz\" # Adjust the path\n", "data = pd.read_json(data_path, lines=True)" ] }, { "cell_type": "code", "execution_count": 2, "id": "e0936968", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | review_id | \n", "user_id | \n", "business_id | \n", "stars | \n", "useful | \n", "funny | \n", "cool | \n", "text | \n", "date | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "f9khuhJxadQhg6CaI1cRdA | \n", "4Qijwb2RDiUGc4SBjA2lJg | \n", "nTBStZYJfHGdSZJbpaBiPA | \n", "4 | \n", "1 | \n", "0 | \n", "1 | \n", "I had read about this place adding a second lo... | \n", "2011-02-08 17:48:40 | \n", "
1 | \n", "WH0c1wEMu4XRTIysI7uMig | \n", "7JeW4Mlvqdp7R-FAUBB_vA | \n", "H3Tmgv94pbGvBIKZ4Rs9Cw | \n", "5 | \n", "1 | \n", "0 | \n", "1 | \n", "I had dinner at Tin Angel on Saturday and was ... | \n", "2012-04-16 13:30:02 | \n", "
2 | \n", "S1Lg07IGrupUDk7Uu9rnQQ | \n", "umUy5DTpVrvQDXLR4gywHA | \n", "H7BikysfQbS9bMULQsCU_Q | \n", "2 | \n", "4 | \n", "1 | \n", "0 | \n", "I was really excited to visit the store, havin... | \n", "2019-10-05 00:17:15 | \n", "
3 | \n", "AH4_Pua0yzK4oU9FoU8hXQ | \n", "uwYw0KKj16lC_nq_HsQGVQ | \n", "Xb6QfBbleg2aJT2cG807jQ | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "I hired Two Men and a Truck for my recent move... | \n", "2016-06-02 13:27:24 | \n", "
4 | \n", "9_CIDS98p6ZsTRiCvmuIKA | \n", "l9bVKgzvjjcU8Iang3Tvtg | \n", "lqSJkyNSE1yPeux4PoR-pg | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "i was very disappointed to this company. They ... | \n", "2020-06-05 22:28:47 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
9995 | \n", "5MknizHCBH3jpj5DJd-6Uw | \n", "d2VrfngFJ1f1nvNAsojJzw | \n", "hy-E7DdXbdgTbwphKUYW1w | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "This was such a trash experience. We signed up... | \n", "2021-07-29 16:10:10 | \n", "
9996 | \n", "mXFlaWuiCnyCkZ_SIAGqew | \n", "cHWDGVf4LofBk9wZ2mnXQQ | \n", "AYWSFv6QxF5IjQSxITMUug | \n", "5 | \n", "0 | \n", "0 | \n", "0 | \n", "I have been going to Goshen Nail Salon for the... | \n", "2018-03-16 00:30:50 | \n", "
9997 | \n", "W1Ij-zC3ufRU5MTEgHLjmg | \n", "aN9nWudz5rfar7rHr9lHfA | \n", "oyJ3gXNkV0DO0YxcaTgtTg | \n", "5 | \n", "0 | \n", "0 | \n", "0 | \n", "Ok. This place surprised me. I always thought ... | \n", "2018-06-01 23:56:44 | \n", "
9998 | \n", "HNejB5H9iD1qe3MMKxg6sg | \n", "6JejVLZl5M-IB3UkNTkXtQ | \n", "WJLKQTduGumxjlXelqiuKg | \n", "3 | \n", "0 | \n", "0 | \n", "0 | \n", "Meets expectations, but quirky. The trucks re... | \n", "2016-06-29 15:57:34 | \n", "
9999 | \n", "LSJGzHJ7whqNn5uPxidMjQ | \n", "_Av1LaAAY0Y8YcPp7Ck7fg | \n", "M983OPfVRnwvG7zEOzykCA | \n", "5 | \n", "0 | \n", "0 | \n", "0 | \n", "Jordan was our waiter. He was very attentive a... | \n", "2017-03-15 23:54:07 | \n", "
10000 rows × 9 columns
\n", "