{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sistema de recomendação de filmes usando filtro colaborativo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparando conjunto de dados" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Importações do projeto" ] }, { "cell_type": "code", "execution_count": 220, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from fuzzywuzzy import process\n", "from sklearn.metrics.pairwise import cosine_similarity\n", "import math" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Importando conjunto de dados\n", "ratings: Avaliações dos usuários para cada filme\n", "\n", "movies: informações dos filmes que foram avaliados" ] }, { "cell_type": "code", "execution_count": 221, "metadata": {}, "outputs": [], "source": [ "ratings = pd.read_csv('../data/reduced/ratings_m10.csv')\n", "ratings.reindex()\n", "movies = pd.read_csv('../data/reduced/movies_m10_rich_pre.csv', index_col='movieId')\n", "movies_title = movies[['title']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Junção dos dois conjuntos de dados\n", "união feita pela coluna 'movieId' presente em ambos" ] }, { "cell_type": "code", "execution_count": 222, "metadata": {}, "outputs": [], "source": [ "ratings_movies = ratings.merge(movies_title, on='movieId')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Separação do conjunto de dados baseado no timestamp\n", "Para cada usuário foram divididas 90% das suas avaliações para o conjunto de treino e o restante para o conjunto de teste" ] }, { "cell_type": "code", "execution_count": 223, "metadata": {}, "outputs": [], "source": [ "def train_test_column_split(df, group_column, split_column, y_label, train_size):\n", " df = df.sort_values(by=split_column, ascending=True) \n", " train = pd.DataFrame(columns=df.columns)\n", " test = pd.DataFrame(columns=df.columns)\n", "\n", " for idx in df[group_column].unique():\n", " group = df.loc[df[group_column] == idx]\n", "\n", " q_user = group[group[split_column].le(group[split_column].quantile(train_size))]\n", " p_user = group[group[split_column].ge(group[split_column].quantile(train_size))]\n", "\n", " train = pd.concat([train, q_user])\n", " test = pd.concat([test, p_user])\n", " train = train.sort_index(ascending=True)\n", " test = test.sort_index(ascending=True)\n", "\n", " X_labels = [c for c in df.columns]\n", "\n", " X_train = train[X_labels]\n", " X_test = test[X_labels]\n", "\n", " return (X_train, X_test)" ] }, { "cell_type": "code", "execution_count": 224, "metadata": {}, "outputs": [], "source": [ "X_train, X_test = train_test_column_split(ratings_movies, 'userId', 'timestamp', 'rating', .9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Criando uma Pivot Matrix\n", "Matriz: {userId x movieId}, cada célula corresponde à avaliação de cada usuário para cada filme, em que na ausência será preenchido com 0 (zero)." ] }, { "cell_type": "code", "execution_count": 225, "metadata": {}, "outputs": [], "source": [ "#user_movie_mat = ratings_movies.pivot(index='movieId', columns='userId', values='rating').fillna(0)\n", "user_movie_train = X_train.pivot(index='movieId', columns='userId', values='rating').fillna(0)\n", "user_movie_test = X_test.pivot(index='movieId', columns='userId', values='rating').fillna(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Criando Matriz de similaridade dos usuários baseado nas avaliações" ] }, { "cell_type": "code", "execution_count": 226, "metadata": {}, "outputs": [], "source": [ "def find_correlation_between_two_users(ratings_df: pd.DataFrame, user1: str, user2: str):\n", " \"\"\"Find correlation between two users based on their rated movies using Pearson correlation\"\"\"\n", " rated_movies_by_both = ratings_df[[user1, user2]].dropna(axis=0).values\n", " user1_ratings = rated_movies_by_both[:, 0].reshape(1, -1)\n", " user2_ratings = rated_movies_by_both[:, 1].reshape(1, -1)\n", " return cosine_similarity(user1_ratings, user2_ratings)" ] }, { "cell_type": "code", "execution_count": 227, "metadata": {}, "outputs": [], "source": [ "users_list = list(user_movie_train.columns)\n", "movies_list = list(user_movie_train.index)\n", "\n", "#users_similarity_mat = np.array([[find_correlation_between_two_users(user_movie_train, user1, user2) for user1 in users_list] for user2 in users_list])\n", "##users_similarity_mat = users_similarity_mat.reshape(608, 608)\n", "#users_similarity_mat = pd.DataFrame(users_similarity_mat, index=users_list, columns=users_list)\n", "users_similarity_mat = pd.read_pickle('../data/preprocessed/users_similarity_mat_cosim.pkl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Métodos para prever as notas que o usuário dará para cada filme" ] }, { "cell_type": "code", "execution_count": 228, "metadata": {}, "outputs": [], "source": [ "def get_rated_user_for_a_movie(ratings_df: pd.DataFrame, movie: str):\n", " return ratings_df.loc[movie, :].dropna().index.values\n", "\n", "\n", "def get_top_neighbors(\n", " similarity_df: pd.DataFrame, user: str, rated_users: str, n_neighbors: int\n", "):\n", " return similarity_df[user][rated_users].nlargest(n_neighbors).to_dict()\n", "\n", "\n", "def subtract_bias(rating: float, mean_rating: float):\n", " return rating - mean_rating\n", "\n", "\n", "def get_neighbor_rating_without_bias_per_movie(\n", " ratings_df: pd.DataFrame, user: str, movie: str\n", "):\n", " \"\"\"Substract the rating of a user from the mean rating of that user to eliminate bias\"\"\"\n", " mean_rating = ratings_df[user].mean()\n", " rating = ratings_df.loc[movie, user]\n", " return subtract_bias(rating, mean_rating)\n", "\n", "\n", "def get_ratings_of_neighbors(ratings_df: pd.DataFrame, neighbors: list, movie: str):\n", " \"\"\"Get the ratings of all neighbors after adjusting for biases\"\"\"\n", " return [\n", " get_neighbor_rating_without_bias_per_movie(ratings_df, neighbor, movie)\n", " for neighbor in neighbors\n", " ]\n", "\n", "def get_weighted_average_rating_of_neighbors(ratings: list, neighbor_distance: list):\n", " weighted_sum = np.array(ratings).dot(np.array(neighbor_distance))\n", " abs_neigbor_distance = np.abs(neighbor_distance)\n", " return weighted_sum / np.sum(abs_neigbor_distance)\n", "\n", "\n", "def ger_user_rating(ratings_df: pd.DataFrame, user: str, avg_neighbor_rating: float):\n", " user_avg_rating = ratings_df[user].mean()\n", " return round(user_avg_rating + avg_neighbor_rating, 2)\n" ] }, { "cell_type": "code", "execution_count": 229, "metadata": {}, "outputs": [], "source": [ "def predict_rating(\n", " df: pd.DataFrame,\n", " similarity_df: pd.DataFrame,\n", " user: str,\n", " movie: str,\n", " n_neighbors: int = 2,\n", "):\n", " \"\"\"Predict the rating of a user for a movie based on the ratings of neighbors\"\"\"\n", " ratings_df = df.copy()\n", "\n", " rated_users = get_rated_user_for_a_movie(ratings_df, movie)\n", "\n", " top_neighbors_distance = get_top_neighbors(\n", " similarity_df, user, rated_users, n_neighbors\n", " )\n", " neighbors, distance = top_neighbors_distance.keys(), top_neighbors_distance.values()\n", "\n", " #print(f\"Top {n_neighbors} neighbors of user {user}, {movie}: {list(neighbors)}, distance: {list(distance)}\")\n", "\n", " ratings = get_ratings_of_neighbors(ratings_df, neighbors, movie)\n", " avg_neighbor_rating = get_weighted_average_rating_of_neighbors(\n", " ratings, list(distance)\n", " )\n", "\n", " return ger_user_rating(ratings_df, user, avg_neighbor_rating)" ] }, { "cell_type": "code", "execution_count": 230, "metadata": {}, "outputs": [], "source": [ "def adjust_rating(nota):\n", " if nota < 0:\n", " return 0\n", " elif nota > 5:\n", " return 5\n", " else:\n", " # Arredonda para o valor mais próximo em incrementos de 0.5\n", " return round(nota * 2) / 2\n" ] }, { "cell_type": "code", "execution_count": 231, "metadata": {}, "outputs": [], "source": [ "def get_n_recommendations(user: int, n: int, user_movie_mat: pd.DataFrame, movies: pd.DataFrame, n_neighbors: int):\n", " df = user_movie_mat.copy()\n", " recommendations = pd.DataFrame(columns=['movieId', 'title', 'pred_rating'])\n", "\n", " for movie, _ in df[user].items():\n", " if df.loc[movie, user] == 0:\n", " df.loc[movie, user] = predict_rating(user_movie_mat, users_similarity_mat, user, movie, n_neighbors)\n", " new_row = {'movieId': movie, 'title': movies.loc[movie]['title'], 'pred_rating': adjust_rating(df.loc[movie, user])}\n", " recommendations.loc[len(recommendations)] = new_row\n", "\n", " recommendations = recommendations.sort_values(by='pred_rating', ascending=False)\n", " return recommendations.head(n) if n > 0 else recommendations" ] }, { "cell_type": "code", "execution_count": 232, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3.03" ] }, "execution_count": 232, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_name = 'White Squall'\n", "user1 = 1\n", "movie = process.extractOne(movie_name, movies['title'])[2]\n", "rating = predict_rating(user_movie_train, users_similarity_mat, user1, movie, 30)\n", "rating" ] }, { "cell_type": "code", "execution_count": 236, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | movieId | \n", "title | \n", "pred_rating | \n", "
---|---|---|---|
248 | \n", "589 | \n", "Fallen | \n", "3.5 | \n", "
31 | \n", "47 | \n", "White Squall | \n", "3.0 | \n", "
542 | \n", "1527 | \n", "Closer | \n", "3.0 | \n", "
310 | \n", "858 | \n", "South Park: Bigger, Longer and Uncut | \n", "3.0 | \n", "
364 | \n", "1036 | \n", "Great Muppet Caper, The | \n", "3.0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
385 | \n", "1100 | \n", "Starman | \n", "0.0 | \n", "
384 | \n", "1096 | \n", "Flatliners | \n", "0.0 | \n", "
383 | \n", "1095 | \n", "Blood Simple | \n", "0.0 | \n", "
185 | \n", "419 | \n", "Henry V | \n", "0.0 | \n", "
62 | \n", "147 | \n", "Shallow Grave | \n", "0.0 | \n", "
666 rows × 3 columns
\n", "