Building a Recommender System From Scratch with Matrix Factorization in Python
Image by Author | Ideogram
Inroduction
In this article, we will build step by step a movie recommender system in Python, based on matrix factorization. Among the many approaches for building recommender systems that suggest products, services, or content to users based on their preferences and past interactions, matrix factorization stands out as a powerful technique for collaborative filtering, efficiently capturing hidden patterns in user-item interactions from large-scale user and item databases.
Concretely, the tutorial will a Python library called surprise that contains handy implementations of matrix factorization algorithms to build recommender systems. We will also consider the MovieLens 100K datasets: a popular dataset for movie recommendations, ideal for getting familiar with recommender systems from a practical standpoint.
Note: It is recommended that you have some basic knowledge and familiarity with recommender systems concepts and fundamentals prior to beginning this tutorial.
Step-by-Step Process
The first step is to import the necessary libraries and packages. You may need to manually install the surprise
library before being able to import it.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from surprise import Dataset, Reader, SVD from surprise.model_selection import train_test_split, cross_validate from surprise import accuracy import requests import zipfile import io import os |
We will begin our coding by defining a function to load the MovieLens 100K dataset from the official dataset’s external website. The process entails unpacking the downloaded .zip
file.
def download_and_extract_movielens(): if not os.path.exists(‘ml-100k’): print(“Downloading MovieLens 100K dataset…”) url = “https://files.grouplens.org/datasets/movielens/ml-100k.zip” r = requests.get(url) z = zipfile.ZipFile(io.BytesIO(r.content)) z.extractall() print(“Movielens 100K dataset downloaded and extracted successfully.”) else: print(“The dataset already exists. Download skipped.”) |
Next, we proceed to load the data by calling the newly defined function, putting the data into a Pandas DataFrame, and obtaining some basic information about it.
download_and_extract_movielens()
ratings_df = pd.read_csv(‘ml-100k/u.data’, sep=‘\t’, names=[‘user_id’, ‘item_id’, ‘rating’, ‘timestamp’])
print(f“Dataset shape: {ratings_df.shape}”) print(f“Number of unique users: {ratings_df[‘user_id’].nunique()}”) print(f“Number of unique movies: {ratings_df[‘item_id’].nunique()}”) print(f“Range of ratings: {ratings_df[‘rating’].min()} to {ratings_df[‘rating’].max()}”) |
The printed output describes important aspects of the dataset:
Dataset shape: (100000, 4) Number of unique users: 943 Number of unique movies: 1682 Rating range: 1 to 5 |
As we can observe, the size of this dataset is pretty manageable for illustrative purposes in this tutorial, although real-world applications of matrix factorization would usually entail much larger user and item (e.g. movie) sets.
Now, with the aid of two classes imported from the surprise library, namely Dataset
and Reader
, we will pack the dataset into a format that will be easily manageable by the library’s implementation of matrix factorization techniques. We do so as follows, and also split the data into training and test sets for model evaluation. Notice the importance of specifying the correct range of numerical ratings in the dataset when initializing the Reader
object:
reader = Reader(rating_scale=(1, 5)) data = Dataset.load_from_df(ratings_df[[‘user_id’, ‘item_id’, ‘rating’]], reader)
trainset, testset = train_test_split(data, test_size=0.2, random_state=42) |
Now we get into real action by initializing, training, and evaluating the matrix factorization model. Concretely, we will use singular value decomposition (SVD), a popular matrix factorization approach whose implementation is provided via surprise’s SVD
class. If you are familiar with training machine learning models with scikit-learn, you’ll find the process fairly similar:
model = SVD(n_factors=20, lr_all=0.01, reg_all=0.01, n_epochs=20, random_state=42) model.fit(trainset)
predictions = model.test(testset) rmse = accuracy.rmse(predictions) mae = accuracy.mae(predictions)
print(f“Test RMSE: {rmse:.4f}”) print(f“Test MAE: {mae:.4f}”) |
In the above SVD model instantiation, n_factors
is an important hyperparameter in which we define the desired dimension (in our example, 20) for the latent feature space we will use for building compact user and item vector representations, based on the original data given in the form of an ample yet sparse user-item ratings matrix. For a better understanding of this key process in matrix factorization, be sure to check out this article. Other arguments used are the learning rate (lr_all
, 0.01), a regularization parameter (reg_all
, 0.01) to prevent overfitting, and the number of training epochs (n_epochs
) being set to 20.
Changing any of the described arguments’ values may impact the resulting model performance on the test data, measured by prediction error metrics like RMSE and MAE. In our specific setting, we get:
Test RMSE: 0.9576 Test MAE: 0.7455 |
For a more robust evaluation, we can optionally apply cross-validation:
cv_results = cross_validate(model, data, measures=[‘RMSE’, ‘MAE’], cv=5, verbose=True)
print(f“Average RMSE: {cv_results[‘test_rmse’].mean():.4f}”) print(f“Average MAE: {cv_results[‘test_mae’].mean():.4f}”) |
Trying It Out
Now let’s see our recommender system in action by making some example recommendations. For this, we will first define two more custom functions: one that loads the set of movie titles, and one that, given a user ID and a number N of desired recommendations, will use the trained model to obtain a list of top-N recommended movies for that user, based on her/his preferences reflected in the original ratings data. The latter function is perhaps the most insightful part of the entire code, so we added some inline comments for a better understanding of the process involved.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
def get_movie_names(): movies_df = pd.read_csv(‘ml-100k/u.item’, sep=‘|’, encoding=‘latin-1’, header=None, usecols=[0, 1], names=[‘item_id’, ‘title’]) return movies_df
movies_df = get_movie_names()
def recommend_movies(user_id, n=10): # List of all movies all_movies = movies_df[‘item_id’].unique()
# Movies already rated by the user rated_movies = ratings_df[ratings_df[‘user_id’] == user_id][‘item_id’].values
# Movies not yet rated by the user unrated_movies = np.setdiff1d(all_movies, rated_movies)
# Predicting ratings on unseen movies, by using the trained SVD model predictions = [] for item_id in unrated_movies: predicted_rating = model.predict(user_id, item_id).est predictions.append((item_id, predicted_rating))
# Rank predictions by estimated rating predictions.sort(key=lambda x: x[1], reverse=True)
# Get top N recommendations top_recommendations = predictions[:n]
# Fetch movie titles associated with top N recommendations recommendations = pd.DataFrame(top_recommendations, columns=[‘item_id’, ‘predicted_rating’]) recommendations = recommendations.merge(movies_df, on=‘item_id’)
return recommendations |
All that remains is trying out these functions to get real recommendations!
user_id = 42 recommendations = recommend_movies(user_id, n=10)
print(f“\nTop 10 recommended movies for user {user_id}:”) print(recommendations[[‘title’, ‘predicted_rating’]]) |
Output:
Top 10 recommended movies for user 42: title predicted_rating 0 Braveheart (1995) 4.946602 1 Singin‘ in the Rain (1952) 4.835148 2 Henry V (1989) 4.811671 3 Great Escape, The (1963) 4.754385 4 Babe (1995) 4.702876 5 Wrong Trousers, The (1993) 4.646727 6 My Fair Lady (1964) 4.631982 7 Air Force One (1997) 4.617786 8 Sabrina (1954) 4.541566 9 Patton (1970) 4.530220 |
Wrapping Up
And that’s it! With these steps, we have built our first matrix factorization-based movie recommender and seen it in action. The next steps to further navigate into the intricacies and wonders of recommender system models like this could be visualizing interesting data patterns like rating distributions per user or movie, finding similar movies to each other based on latent factor representations, or visualizing latent factors themselves.