# Overview

This notebook is used to train a matrix factorization model for recommendation.<br>
We'll consider the implicit features in the MovieLens100k dataset.<br>
We'll use tensorflow recommenders to achieve this.

## Import TFRS

First, install and import TFRS and needed packages

In [1]:
!pip install -q tensorflow_recommenders

In [2]:
from typing import Dict, Text
import tensorflow as tf
import tensorflow_recommenders as tfrs
# import urllib.request
# import zipfile
import pandas as pd

In [3]:
# python version: 3.10.11
tf.__version__, tfrs.__version__

('2.15.0', 'v0.7.3')

In [4]:
# python version: 3.10.11
tf.__version__, tfrs.__version__

('2.15.0', 'v0.7.3')

## Load, prepare and split data

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv')

In [7]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [8]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [9]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [10]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [11]:
ratings = ratings.merge(movies, on='movieId', how='inner')[['userId', 'title', 'rating', 'timestamp']].rename(columns={'title': 'movieTitle'})
ratings.head()

Unnamed: 0,userId,movieTitle,rating,timestamp
0,1,Toy Story (1995),4.0,964982703
1,5,Toy Story (1995),4.0,847434962
2,7,Toy Story (1995),4.5,1106635946
3,15,Toy Story (1995),2.5,1510577970
4,17,Toy Story (1995),4.5,1305696483


In [12]:
movies = movies.rename(columns={'title':'movieTitle'})
movies

Unnamed: 0,movieId,movieTitle,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [13]:
ratings['userId'] = ratings['userId'].map(lambda id_int: str(id_int))
movies['movieId'] = movies['movieId'].map(lambda id_int: str(id_int))

In [14]:
train_valid , test = train_test_split(ratings, test_size=0.2, stratify=ratings['userId'], random_state=42)

In [15]:
train, valid = train_test_split(train_valid, test_size=0.1, stratify=train_valid['userId'], random_state=42)

In [16]:
train.head()

Unnamed: 0,userId,movieTitle,rating,timestamp
65913,298,I Am Sam (2001),0.5,1447598721
82997,68,Volcano (1997),3.0,1269123535
61517,477,"Waterboy, The (1998)",3.0,1200943122
81164,448,Green Lantern (2011),1.5,1308418333
72010,57,Howards End (1992),3.0,972174279


### Cold Start Problem

For the cold start problem (new users with no history or guests with no accounts), we'll use aggregates about the movies to show the highest rated movies and most viewed movies (since we don't have the count of views, we'll use the count of ratings instead)


We'll create a custom class to handle this.<br>
We'll use thresholds to weed out movies with few ratings and movies with low ratings

In [17]:
class MovieData:
  def __init__(self, data, rating_threshold, count_threshold):
    self.rating_threshold = rating_threshold
    self.count_threshold = count_threshold
    self.data = data

  def get_highest_rated(self, n=20):
    # Return top n rated movies rated at least self.count_threhold times
    ratings_count = self.data.groupby(['movieTitle'])['rating'].count()
    popular_movies = ratings_count[ratings_count>self.count_threshold].index
    highest_rated_movies = self.data[self.data['movieTitle'].isin(popular_movies)].groupby('movieTitle').mean('rating')['rating'].sort_values(ascending=False)[:n]
    return highest_rated_movies

  def get_most_rated(self, n=20):
    # Return top n most rated movies with average rating more than self.rating_threhold times
    average_rating = self.data.groupby(['movieTitle'])['rating'].mean('rating')
    popular_movies = average_rating[average_rating>self.rating_threshold].index
    most_rated_movies = self.data[self.data['movieTitle'].isin(popular_movies)].groupby('movieTitle').count()['userId'].sort_values(ascending=False)[:n]
    return most_rated_movies

## Data Preparation

We'll create a tf dataset object for our train and test sets

In [18]:
train_interaction_dataset = tf.data.Dataset.from_tensor_slices({'userId':train['userId'].values, 'movieTitle': train['movieTitle'].values})
valid_interaction_dataset = tf.data.Dataset.from_tensor_slices({'userId':valid['userId'].values, 'movieTitle': valid['movieTitle'].values})
test_interaction_dataset = tf.data.Dataset.from_tensor_slices({'userId':test['userId'].values, 'movieTitle': test['movieTitle'].values})
train_interaction_dataset

<_TensorSliceDataset element_spec={'userId': TensorSpec(shape=(), dtype=tf.string, name=None), 'movieTitle': TensorSpec(shape=(), dtype=tf.string, name=None)}>

In [19]:
movie_dataset = tf.data.Dataset.from_tensor_slices(movies['movieTitle'].values)
movie_dataset

<_TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

In [20]:
user_ids_vocabulary = tf.keras.layers.StringLookup(mask_token=None, name='users_lookup')
movie_titles_vocabulary = tf.keras.layers.StringLookup(mask_token=None, name='movies_lookup')

In [21]:
user_ids_vocabulary.adapt(train_interaction_dataset.map(lambda x: x['userId']))

In [22]:
movie_titles_vocabulary.adapt(movie_dataset.map(lambda x: x))

In [23]:
n_users = user_ids_vocabulary.vocabulary_size()
n_movies = movie_titles_vocabulary.vocabulary_size()
n_users, n_movies

(611, 9738)

## Define a model
We will use matrix factorization model without context features.
We can define a TFRS model by inheriting from `tfrs.Model` and implementing the `compute_loss` method:

The task is a convenient object that wraps both the loss and the metrics

In [24]:
class MovieLensModel(tfrs.Model):
  def __init__(self, user_model: tf.keras.Model, movie_model: tf.keras.Model, task: tfrs.tasks.Retrieval):
    super().__init__()

    # Set up user and movie representations.
    self.user_model = user_model
    self.movie_model = movie_model

    # Set up a retrieval task.
    self.task = task

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    # Define how the loss is computed.
    user_embeddings = self.user_model(features["userId"])
    movie_embeddings = self.movie_model(features["movieTitle"])
    return self.task(user_embeddings, movie_embeddings)

Define the two models and the retrieval task.

In [25]:
movie_model = tf.keras.Sequential([
    movie_titles_vocabulary,
    tf.keras.layers.Embedding(n_movies, 64, name='movie_embedding')
], name='movie_model')

In [26]:
user_model = tf.keras.Sequential([
    user_ids_vocabulary,
    tf.keras.layers.Embedding(n_users, 64, name='user_embedding')
], name='user_model')

ks is the k for top_k metrics. We use multiple ks

In [27]:
task = tfrs.tasks.Retrieval(metrics=tfrs.metrics.FactorizedTopK(
    candidates=movie_dataset.batch(128).map(movie_model),
    ks = (1, 5, 10)
  )
)

## Fit and evaluate it.

Create the model, train it, and generate predictions:



In [28]:
model = MovieLensModel(user_model, movie_model, task)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.01))

In [29]:
model.fit(train_interaction_dataset.batch(4096), epochs=15, validation_data=valid_interaction_dataset.batch(1024))

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.src.callbacks.History at 0x7bc3963f4160>

In [30]:
model.evaluate(test_interaction_dataset.batch(1024))



[0.0009420864516869187,
 0.007982943207025528,
 0.017502974718809128,
 4515.97607421875,
 0,
 4515.97607421875]

In [31]:
# model.evaluate(test_interaction_dataset.batch(1024))

## Indexers

Indexers use store the embedding of the possible candidates as keys. When it receives a query, it embeds the query and retrieves the closest keys.

For our recommendation task, it stores the embeddings of movies and the embedding of users. When we want to recommend for a user, it gets the movies whose embedding are the most similar (using dot product) to the user.

In [32]:
# Use brute-force search to set up retrieval using the trained representations.
user_recommender = tfrs.layers.factorized_top_k.BruteForce(model.user_model, k=100)

In [33]:
user_recommender.index_from_dataset(
    movie_dataset.batch(100).map(lambda title: (title, model.movie_model(title))))

<tensorflow_recommenders.layers.factorized_top_k.BruteForce at 0x7bc3846bc2b0>

In [34]:
# Get some recommendations.
_, titles = user_recommender(tf.constant(["42"]))
print(f"Top 3 recommendations for user 42: {titles.shape}")

Top 3 recommendations for user 42: (1, 100)


#### Item-Item recommendation

For items similarity, we can use the embedding of movies as both query and keys

In [35]:
movie_recommender = tfrs.layers.factorized_top_k.BruteForce(model.movie_model, k=100)

In [36]:
movie_recommender.index_from_dataset(
    movie_dataset.batch(100).map(lambda title: (title, model.movie_model(title))))

<tensorflow_recommenders.layers.factorized_top_k.BruteForce at 0x7bc3846bcb80>

In [37]:
# Get some recommendations.
_, titles2 = movie_recommender(tf.constant(["Freaky Friday (2003)"]))
print(f"Top 3 recommendations for movie 42: {titles2.shape}")

Top 3 recommendations for movie 42: (1, 100)


In [38]:
# Get some recommendations.
_, titles2 = movie_recommender(tf.constant(["Freaky Friday (2003)"]), k=25)
print(f"Top 3 recommendations for movie 42: {titles2.shape}")

Top 3 recommendations for movie 42: (1, 25)


## Saving the models

In [39]:
user_recommender.save('user_model')
movie_recommender.save('movie_model')



In [40]:
tmp1 = tf.keras.models.load_model('user_model')
tmp2 = tf.keras.models.load_model('movie_model')



In [41]:
# Get some recommendations.
_, titles = tmp1(tf.constant(["42"]))
print(f"Top 3 recommendations for user 42: {titles.shape}")

Top 3 recommendations for user 42: (1, 100)


In [42]:
# Get some recommendations.
_, titles2 = tmp2(tf.constant(["Freaky Friday (2003)"]))
print(f"Top 3 recommendations for movie 42: {titles2[0, :10]}")

Top 3 recommendations for movie 42: [b'Freaky Friday (2003)' b'What Women Want (2000)'
 b'Wedding Crashers (2005)' b'Shrek the Third (2007)'
 b'Atlantis: The Lost Empire (2001)' b'Along Came Polly (2004)'
 b'Princess Diaries, The (2001)'
 b"Hitchhiker's Guide to the Galaxy, The (2005)" b'Holes (2003)'
 b'Mean Girls (2004)']


# Explicit rating

## Data Prepatation

In [43]:
train_rating_dataset = tf.data.Dataset.from_tensor_slices({'userId':train['userId'].values, 'movieTitle': train['movieTitle'].values, 'rating': train['rating'].values})
valid_rating_dataset = tf.data.Dataset.from_tensor_slices({'userId':valid['userId'].values, 'movieTitle': valid['movieTitle'].values, 'rating': valid['rating'].values})
test_rating_dataset = tf.data.Dataset.from_tensor_slices({'userId':test['userId'].values, 'movieTitle': test['movieTitle'].values, 'rating': test['rating'].values})
train_rating_dataset

<_TensorSliceDataset element_spec={'userId': TensorSpec(shape=(), dtype=tf.string, name=None), 'movieTitle': TensorSpec(shape=(), dtype=tf.string, name=None), 'rating': TensorSpec(shape=(), dtype=tf.float64, name=None)}>

In [44]:
# user_ids_vocabulary.adapt(train_rating_dataset.map(lambda x: x['userId']))

In [45]:
# movie_titles_vocabulary.adapt(movie_dataset.map(lambda x: x))

In [46]:
# n_users = user_ids_vocabulary.vocabulary_size()
# n_movies = movie_titles_vocabulary.vocabulary_size()
# n_users, n_movies

## Define model

In [47]:
ranking_task = tfrs.tasks.Ranking(
      loss = tf.keras.losses.MeanSquaredError(),
      metrics=[tf.keras.metrics.RootMeanSquaredError()]
    )

In [48]:
rating_model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(64,  activation='relu'),
    tf.keras.layers.Dense(1)
], name='raing_model')

In [49]:
explicit_movie_model = tf.keras.Sequential([
    movie_titles_vocabulary,
    tf.keras.layers.Embedding(n_movies, 64, name='movie_embedding')
], name='movie_model')

In [50]:
explicit_user_model = tf.keras.Sequential([
    user_ids_vocabulary,
    tf.keras.layers.Embedding(n_users, 64, name='user_embedding')
], name='user_model')

In [51]:
class ExplicitMovieLensModel(tfrs.Model):
  def __init__(self, user_model: tf.keras.Model, movie_model: tf.keras.Model, rating_model:tf.keras.Model, task: tfrs.tasks.Retrieval):
    super().__init__()

    # Set up user and movie representations.
    self.user_model = user_model
    self.movie_model = movie_model

        # Compute predictions.
    self.rating_model = rating_model

    # Set up a ranking task.
    self.task = task

  def call(self, features: Dict[str, tf.Tensor]) -> tf.Tensor:
    user_embeddings = self.user_model(features["userId"])
    movie_embeddings = self.movie_model(features["movieTitle"])
    return self.rating_model(tf.concat([user_embeddings, movie_embeddings], axis=1))

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    labels = features.pop("rating")

    rating_predictions = self(features)

    # The task computes the loss and the metrics.
    return self.task(labels=labels, predictions=rating_predictions)

In [52]:
explicit_model = ExplicitMovieLensModel(user_model=explicit_user_model, movie_model=explicit_movie_model, rating_model=rating_model, task=ranking_task)
explicit_model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.05))

In [53]:
explicit_model.fit(train_rating_dataset.batch(4096), epochs=40, validation_data=valid_rating_dataset.batch(1024))

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.src.callbacks.History at 0x7bc38640dcc0>

In [54]:
explicit_model.evaluate(test_rating_dataset.batch(1024))



[0.8827441334724426, 0.8436796069145203, 0, 0.8436796069145203]

In [55]:
all_movies = movies['movieTitle'].unique().reshape(-1,1)

In [56]:
# Predict rating for all movies
preds = explicit_model({"userId": tf.tile([['42']], [9737, 1]), "movieTitle": all_movies})
preds

<tf.Tensor: shape=(9737, 1), dtype=float32, numpy=
array([[4.150449 ],
       [3.704074 ],
       [3.5481296],
       ...,
       [4.242151 ],
       [3.8372483],
       [3.9443884]], dtype=float32)>

In [57]:
# Sort movie titles from highest rated to lowest
tf.gather(all_movies, tf.squeeze(tf.argsort(preds, axis=0, direction='DESCENDING')))

<tf.Tensor: shape=(9737, 1), dtype=string, numpy=
array([[b'Wallace & Gromit: The Best of Aardman Animation (1996)'],
       [b'Andalusian Dog, An (Chien andalou, Un) (1929)'],
       [b'Come and See (Idi i smotri) (1985)'],
       ...,
       [b'Speed 2: Cruise Control (1997)'],
       [b'Battlefield Earth (2000)'],
       [b'Jason X (2002)']], dtype=object)>

In [58]:
# save model
explicit_model.save('explicit_model')



In [59]:
explicit_loaded = tf.saved_model.load("explicit_model")

In [60]:
# Note: Saved model takes input as 1-d array
preds = explicit_loaded({"userId": tf.tile(['42'], [9737]), "movieTitle": movies['movieTitle'].unique()})
tf.gather(all_movies, tf.squeeze(tf.argsort(preds, axis=0, direction='DESCENDING'))).numpy()

array([[b'Wallace & Gromit: The Best of Aardman Animation (1996)'],
       [b'Andalusian Dog, An (Chien andalou, Un) (1929)'],
       [b'Come and See (Idi i smotri) (1985)'],
       ...,
       [b'Speed 2: Cruise Control (1997)'],
       [b'Battlefield Earth (2000)'],
       [b'Jason X (2002)']], dtype=object)