{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sistema de recomendação de filmes usando filtro colaborativo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparando conjunto de dados" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Importações do projeto" ] }, { "cell_type": "code", "execution_count": 220, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from fuzzywuzzy import process\n", "from sklearn.metrics.pairwise import cosine_similarity\n", "import math" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Importando conjunto de dados\n", "ratings: Avaliações dos usuários para cada filme\n", "\n", "movies: informações dos filmes que foram avaliados" ] }, { "cell_type": "code", "execution_count": 221, "metadata": {}, "outputs": [], "source": [ "ratings = pd.read_csv('../data/reduced/ratings_m10.csv')\n", "ratings.reindex()\n", "movies = pd.read_csv('../data/reduced/movies_m10_rich_pre.csv', index_col='movieId')\n", "movies_title = movies[['title']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Junção dos dois conjuntos de dados\n", "união feita pela coluna 'movieId' presente em ambos" ] }, { "cell_type": "code", "execution_count": 222, "metadata": {}, "outputs": [], "source": [ "ratings_movies = ratings.merge(movies_title, on='movieId')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Separação do conjunto de dados baseado no timestamp\n", "Para cada usuário foram divididas 90% das suas avaliações para o conjunto de treino e o restante para o conjunto de teste" ] }, { "cell_type": "code", "execution_count": 223, "metadata": {}, "outputs": [], "source": [ "def train_test_column_split(df, group_column, split_column, y_label, train_size):\n", " df = df.sort_values(by=split_column, ascending=True) \n", " train = pd.DataFrame(columns=df.columns)\n", " test = pd.DataFrame(columns=df.columns)\n", "\n", " for idx in df[group_column].unique():\n", " group = df.loc[df[group_column] == idx]\n", "\n", " q_user = group[group[split_column].le(group[split_column].quantile(train_size))]\n", " p_user = group[group[split_column].ge(group[split_column].quantile(train_size))]\n", "\n", " train = pd.concat([train, q_user])\n", " test = pd.concat([test, p_user])\n", " train = train.sort_index(ascending=True)\n", " test = test.sort_index(ascending=True)\n", "\n", " X_labels = [c for c in df.columns]\n", "\n", " X_train = train[X_labels]\n", " X_test = test[X_labels]\n", "\n", " return (X_train, X_test)" ] }, { "cell_type": "code", "execution_count": 224, "metadata": {}, "outputs": [], "source": [ "X_train, X_test = train_test_column_split(ratings_movies, 'userId', 'timestamp', 'rating', .9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Criando uma Pivot Matrix\n", "Matriz: {userId x movieId}, cada célula corresponde à avaliação de cada usuário para cada filme, em que na ausência será preenchido com 0 (zero)." ] }, { "cell_type": "code", "execution_count": 225, "metadata": {}, "outputs": [], "source": [ "#user_movie_mat = ratings_movies.pivot(index='movieId', columns='userId', values='rating').fillna(0)\n", "user_movie_train = X_train.pivot(index='movieId', columns='userId', values='rating').fillna(0)\n", "user_movie_test = X_test.pivot(index='movieId', columns='userId', values='rating').fillna(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Criando Matriz de similaridade dos usuários baseado nas avaliações" ] }, { "cell_type": "code", "execution_count": 226, "metadata": {}, "outputs": [], "source": [ "def find_correlation_between_two_users(ratings_df: pd.DataFrame, user1: str, user2: str):\n", " \"\"\"Find correlation between two users based on their rated movies using Pearson correlation\"\"\"\n", " rated_movies_by_both = ratings_df[[user1, user2]].dropna(axis=0).values\n", " user1_ratings = rated_movies_by_both[:, 0].reshape(1, -1)\n", " user2_ratings = rated_movies_by_both[:, 1].reshape(1, -1)\n", " return cosine_similarity(user1_ratings, user2_ratings)" ] }, { "cell_type": "code", "execution_count": 227, "metadata": {}, "outputs": [], "source": [ "users_list = list(user_movie_train.columns)\n", "movies_list = list(user_movie_train.index)\n", "\n", "#users_similarity_mat = np.array([[find_correlation_between_two_users(user_movie_train, user1, user2) for user1 in users_list] for user2 in users_list])\n", "##users_similarity_mat = users_similarity_mat.reshape(608, 608)\n", "#users_similarity_mat = pd.DataFrame(users_similarity_mat, index=users_list, columns=users_list)\n", "users_similarity_mat = pd.read_pickle('../data/preprocessed/users_similarity_mat_cosim.pkl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Métodos para prever as notas que o usuário dará para cada filme" ] }, { "cell_type": "code", "execution_count": 228, "metadata": {}, "outputs": [], "source": [ "def get_rated_user_for_a_movie(ratings_df: pd.DataFrame, movie: str):\n", " return ratings_df.loc[movie, :].dropna().index.values\n", "\n", "\n", "def get_top_neighbors(\n", " similarity_df: pd.DataFrame, user: str, rated_users: str, n_neighbors: int\n", "):\n", " return similarity_df[user][rated_users].nlargest(n_neighbors).to_dict()\n", "\n", "\n", "def subtract_bias(rating: float, mean_rating: float):\n", " return rating - mean_rating\n", "\n", "\n", "def get_neighbor_rating_without_bias_per_movie(\n", " ratings_df: pd.DataFrame, user: str, movie: str\n", "):\n", " \"\"\"Substract the rating of a user from the mean rating of that user to eliminate bias\"\"\"\n", " mean_rating = ratings_df[user].mean()\n", " rating = ratings_df.loc[movie, user]\n", " return subtract_bias(rating, mean_rating)\n", "\n", "\n", "def get_ratings_of_neighbors(ratings_df: pd.DataFrame, neighbors: list, movie: str):\n", " \"\"\"Get the ratings of all neighbors after adjusting for biases\"\"\"\n", " return [\n", " get_neighbor_rating_without_bias_per_movie(ratings_df, neighbor, movie)\n", " for neighbor in neighbors\n", " ]\n", "\n", "def get_weighted_average_rating_of_neighbors(ratings: list, neighbor_distance: list):\n", " weighted_sum = np.array(ratings).dot(np.array(neighbor_distance))\n", " abs_neigbor_distance = np.abs(neighbor_distance)\n", " return weighted_sum / np.sum(abs_neigbor_distance)\n", "\n", "\n", "def ger_user_rating(ratings_df: pd.DataFrame, user: str, avg_neighbor_rating: float):\n", " user_avg_rating = ratings_df[user].mean()\n", " return round(user_avg_rating + avg_neighbor_rating, 2)\n" ] }, { "cell_type": "code", "execution_count": 229, "metadata": {}, "outputs": [], "source": [ "def predict_rating(\n", " df: pd.DataFrame,\n", " similarity_df: pd.DataFrame,\n", " user: str,\n", " movie: str,\n", " n_neighbors: int = 2,\n", "):\n", " \"\"\"Predict the rating of a user for a movie based on the ratings of neighbors\"\"\"\n", " ratings_df = df.copy()\n", "\n", " rated_users = get_rated_user_for_a_movie(ratings_df, movie)\n", "\n", " top_neighbors_distance = get_top_neighbors(\n", " similarity_df, user, rated_users, n_neighbors\n", " )\n", " neighbors, distance = top_neighbors_distance.keys(), top_neighbors_distance.values()\n", "\n", " #print(f\"Top {n_neighbors} neighbors of user {user}, {movie}: {list(neighbors)}, distance: {list(distance)}\")\n", "\n", " ratings = get_ratings_of_neighbors(ratings_df, neighbors, movie)\n", " avg_neighbor_rating = get_weighted_average_rating_of_neighbors(\n", " ratings, list(distance)\n", " )\n", "\n", " return ger_user_rating(ratings_df, user, avg_neighbor_rating)" ] }, { "cell_type": "code", "execution_count": 230, "metadata": {}, "outputs": [], "source": [ "def adjust_rating(nota):\n", " if nota < 0:\n", " return 0\n", " elif nota > 5:\n", " return 5\n", " else:\n", " # Arredonda para o valor mais próximo em incrementos de 0.5\n", " return round(nota * 2) / 2\n" ] }, { "cell_type": "code", "execution_count": 231, "metadata": {}, "outputs": [], "source": [ "def get_n_recommendations(user: int, n: int, user_movie_mat: pd.DataFrame, movies: pd.DataFrame, n_neighbors: int):\n", " df = user_movie_mat.copy()\n", " recommendations = pd.DataFrame(columns=['movieId', 'title', 'pred_rating'])\n", "\n", " for movie, _ in df[user].items():\n", " if df.loc[movie, user] == 0:\n", " df.loc[movie, user] = predict_rating(user_movie_mat, users_similarity_mat, user, movie, n_neighbors)\n", " new_row = {'movieId': movie, 'title': movies.loc[movie]['title'], 'pred_rating': adjust_rating(df.loc[movie, user])}\n", " recommendations.loc[len(recommendations)] = new_row\n", "\n", " recommendations = recommendations.sort_values(by='pred_rating', ascending=False)\n", " return recommendations.head(n) if n > 0 else recommendations" ] }, { "cell_type": "code", "execution_count": 232, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3.03" ] }, "execution_count": 232, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_name = 'White Squall'\n", "user1 = 1\n", "movie = process.extractOne(movie_name, movies['title'])[2]\n", "rating = predict_rating(user_movie_train, users_similarity_mat, user1, movie, 30)\n", "rating" ] }, { "cell_type": "code", "execution_count": 236, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movieIdtitlepred_rating
248589Fallen3.5
3147White Squall3.0
5421527Closer3.0
310858South Park: Bigger, Longer and Uncut3.0
3641036Great Muppet Caper, The3.0
............
3851100Starman0.0
3841096Flatliners0.0
3831095Blood Simple0.0
185419Henry V0.0
62147Shallow Grave0.0
\n", "

666 rows × 3 columns

\n", "
" ], "text/plain": [ " movieId title pred_rating\n", "248 589 Fallen 3.5\n", "31 47 White Squall 3.0\n", "542 1527 Closer 3.0\n", "310 858 South Park: Bigger, Longer and Uncut 3.0\n", "364 1036 Great Muppet Caper, The 3.0\n", ".. ... ... ...\n", "385 1100 Starman 0.0\n", "384 1096 Flatliners 0.0\n", "383 1095 Blood Simple 0.0\n", "185 419 Henry V 0.0\n", "62 147 Shallow Grave 0.0\n", "\n", "[666 rows x 3 columns]" ] }, "execution_count": 236, "metadata": {}, "output_type": "execute_result" } ], "source": [ "user_id = 1\n", "n_top_neighbors = 30\n", "n_recommendations = -1\n", "\n", "n_recommendations = get_n_recommendations(user_id, n_recommendations, user_movie_train, movies, n_top_neighbors)\n", "n_recommendations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Avaliação dos resultados" ] }, { "cell_type": "code", "execution_count": 234, "metadata": {}, "outputs": [], "source": [ "def eval_ratings():\n", " test = user_movie_test.copy()\n", "\n", " real = []\n", " preds = []\n", "\n", " for user in test.columns:\n", " for movie, _ in test[user].items():\n", " if test.loc[movie, user] != 0 and len(n_recommendations[n_recommendations['movieId'] == movie]['pred_rating'].values) > 0:\n", " title = movies.loc[movie]['title']\n", " real_rating = test.loc[movie, user]\n", " pred_rating = n_recommendations[n_recommendations['movieId'] == movie]['pred_rating'].values[0]\n", " \n", " real.append(real_rating)\n", " preds.append(pred_rating)\n", " \n", " #print(f'{user:10} - {title:50} - true rating: {real_rating}, pred rating: {pred_rating}, DIFF:{abs(real_rating - pred_rating)}')\n", "\n", " MSE = np.square(np.subtract(real, preds)).mean() \n", " \n", " RMSE = math.sqrt(MSE)\n", " print(\"Root Mean Square Error:\\n\")\n", " print(RMSE)" ] }, { "cell_type": "code", "execution_count": 235, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Root Mean Square Error:\n", "\n", "3.0086445530929264\n" ] } ], "source": [ "eval_ratings()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 2 }