This blog post was originally posted here, and is replicated here in its entirety.
Recommender systems are now ubiquitous in our daily lives. From Amazon indicating similar products, to Netflix suggesting TV shows, even down to which version of a given advertisement you get in the mail, every business seems to be using recommender systems in order to improve their service. While recommender systems may seem too complex to implement, machine learning libraries such as Spark’s MLlib and Mahout can make the development of such systems easier than you might think. In this post, I’ll first define some common terms in the area, including common approaches for implementing recommender systems. Then I’ll go through the process of designing a recommender system at a high level.
Background Terminology
Before we get started, there are some important concepts you’ll need to understand in order to properly design a recommender system. First, a recommender system is defined as a system which attempts to predict a user’s preferences or rating without asking the user about that particular item. Instead, it uses the user’s previous ratings for other products, as well as other customer’s ratings and guesses what the customer’s rating or preference will be.
Many machine learning algorithms, including some used in recommender engines, require two phases in order to be effective: a training phase and a query phase. In the training phase, all data about all customer’s preferences are combined, and the algorithm creates a model of how it views the world. This may include groupings of what customers are similar to each other, or just internal data structures it can use to quickly generate a prediction later. The training phase often requires a lot of time and memory, whereas the query phase is generally pretty fast.
These systems are often non-deterministic, so training multiple models may give different recommendations for a single customer. You can make the algorithm more deterministic by using a seed, which is a long integer which can be used to ensure reduce the randomness of the model that is created. The query phase is rarely non-deterministic, so seeds are not used in that phase.
Many systems also include hyperparameters, which are values which can be tweaked that change how the system works as a whole. These values can be used to make the system run faster, or to get more accuracy. It can also be used to avoid a scenario called overfitting, where the model ends up matching the noise in the training data instead of the general trend, causing bad predictions.
Finally, there are three different techniques used generally in recommender systems: Market Basket Analysis (MBA), Collaborative Filtering (CF) and K-Means Clustering (KMC). Understanding the three approaches at a high level goes a long way to creating your own recommender engine.
Common Approaches
Market Basket Analysis (MBA)
This approach is similar to the approach used by Amazon in their “Customers who bought this item also bought” section. Given a single item and the purchase history of all other customers, this approach determines what a customer who bought that item is likely to buy as well. This approach was well covered by my colleague Andy Kruth in this blog post, so I won’t go into it in too much detail here.
MBA is the simplest of the three approaches and does not require any use of machine learning libraries. Instead, you just need to get a percentage of customers who bought item X who also bought item Y, for all items that have been purchased. This avoids the need to know anything more about the customer, such as demographics or even how much they spent on the items. It also allows you to create recommendations for anonymized users who may not have created an account with your service yet. The downside to this is that, while it is well suited for a situation where you have limited information, more information (such as the customer’s buying history and demographics) cannot be leveraged easily in this approach. Additionally, this approach is not well suited to trying to predict the behavior of a customer in general, instead focusing on predicting the behavior of a customer who bought a specific product.
Collaborative Filtering (CF)
Collaborative filtering was made popular by Netflix and the Netflix Prize. This approach tries to predict a user’s rating of a product based on the ratings that all customers have given to all items. A rating may be an explicit rating set by the user (such as Netflix’s stars, or ratings on Amazon), or they may have an implicit rating (such as how much they have spent on a product, or whether they have bought a product at all). You can think of CF as creating a matrix of users to products, with ratings filling in the cells. The system then tries to predict the value of cells that haven’t been filled in by users yet.
CF does a better job with more information since it makes use of a customer’s purchase history beyond one item to determine a prediction. If you do not have an explicit rating available to you, however, it can sometimes be challenging to choose the right implicit rating. Experimentation is often required in order to determine the best way to determine a customer’s implicit rating of a product. CF also does not use any demographic information. Additionally, this system does not lend itself well to explaining the rationale behind the recommendations. The CF algorithm itself, while public, is not easily explainable to a lay-person, so you end up waving your hands a lot when explaining the system.
K-Means Clustering (KMC)
K-Means Clustering is somewhat an outlier in this list, since it is not strictly a recommendation engine algorithm. Instead, it determines clusters of K objects which are similar, based on numeric dimensions describing each object. A simple example, shown above, uses two dimensions, and tries to group objects together based on the values of those two dimensions. KMC can work with thousands of dimensions, however, making it very flexible to take as much data as we have. Using thousands of dimensions does not cause a huge issue with KMC, especially when working in Hadoop where large amounts of memory are easily accessible. KMC is used in many different fields such as image recognition, natural language processing (NLP) and even astronomy and agriculture.
Its ability to determine similar objects (or customers in our case) makes it handy as a first step in a recommendation engine. The second step is then to determine recommendations for each cluster, likely using something like a simple most-purchased algorithm.
In the realm of recommender systems, K-Means Clustering might use dimensions like the customer’s total expenditure on your products, percentage of purchases falling into a specific category, or the average income for people in the customer’s area. Systems will commonly include hundreds or thousands of dimensions, often with large groups of related dimensions. For example, it is common to take a value that can have a finite number of values (such as state), and have a dimension for each possible value, with a 1 indicating the customer has that value, and 0 indicating the customer does not.
KMC is the most complex of the three approaches, requiring tuning for which dimensions to use (some dimensions can actually reduce the accuracy of the system), as well as some model hyperparameters (such as number of clusters to use, and number of iterations to run). Additionally, the two step process means that it can run slower if you’re not careful. Regardless, this approach uses the most data and is easily explainable to customers (since you know what data it is using to determine its recommendations), so it is often a good choice for approach.
Designing a Recommender System
Now that we have the terminology down, let’s design a recommender system! Most of the design time is closer to the process a scientist would go through. This includes running experiments in order to determine accuracy and speed metrics. The design of such a system generally will go through three steps:
Create algorithm to determine the accuracy of your system
Determine which approach to use in your system
Set the parameters (e.g. hyperparameters, dimensions) for the system
1. Determining The Accuracy Of Your System
Scoring your system is an important process during design that needs to be given a lot of thought. It will shape which approach you use, what parameters you use, what dimensions you use, and your success in securing buy-in from others on the system. Because of this, determine how you want to score first, and be consistent about the algorithm.
If your approach is using a random seed to train, you should be aware of this and use the same seed multiple times in order to ensure a change in accuracy is not due to the change in seed as well. It is preferable to determine a score between 0 and 1 for a given recommendation. Doing this allows you to be able to express your score as a weighted percentage. You should also select a set of test customers which are a cross-section of your customers in order to represent them well.
As an example of what an item score might be for a retailer, you might rate a recommendation based on whether it matches a product category already purchased, whether the price is in the range of what they have purchased before, or that the user has bought that brand before. You can combine criteria in order to get a more of a multi-faceted score for each recommendation.
Once you have a rating for all recommendations, take the top N for each test customer and average the scores. N should be chosen to be the number of quality recommendations you expect the scoring algorithm to reliably generate, while still being large enough to represent what users will see from the system. Doing it this way will give you a weighted percentage that is easy to translate and rationalize.
2. Which Recommender System Approach Should You Use?
Once you know how you will score different systems, determining which approach to use for your system is the next hurdle. Each approach’s implementation will look very different, so choosing this early is wise. Which one of the three approaches to use depends on a few factors.
First of all, it is best to try to reduce the number of approaches based on the available data. For example, if users may include anonymous users, MBA is likely going to be your only option. On the other hand, if you are working with invoice data with only known users, but not any demographic or other information about users, all three might work, but CF is likely your best starting place. If you have full purchase history for all customers, and don’t have to worry about anonymous users, oftentimes CF or KMC will be the better options. In general, the more you know about the customers, the more effective KMC tends to be, whereas the less you know, the more effective MBA will be, with CF somewhere in the middle.
If you are only able to reduce down to two approaches, it is best to write a quick prototype of both systems and see what the accuracy of the results looks like. When you’re using a tool like MLlib or Mahout, these prototypes can generally be created quickly to a surprisingly high level of quality. This allows you to get a feel for performance, accuracy, and what types of recommendations will be returned.
Some systems might return a lot of very similar recommendations to what the customer has bought in the past, while others might return a wide variety of recommendations that still make sense, but aren’t as obvious. For example, a user who buys a lot of peanut butter might be recommended different brands of peanut butter, or they might be recommended jelly, bread, and eggs. Which to use is up to you and what you want to achieve with the system. Working with a data scientist or a consultant, such as me, can help you determine which systems are going to meet your objectives, and sort through the myriad of options available in this area.
3. Determining The Parameters For Your System
Once you have chosen which approach to use, you may need to tweak the parameters to get the most accuracy and best runtime performance. Some approaches, such as K-Means Clustering and Collaborative Filtering have hyperparameters, or parameters which can modify how the algorithm works or how accurate it is. K-Means Clustering also uses dimensions which can be chosen to change the system’s behavior. All of these can be decided in the same way, making it easy to choose these parameters.
To determine the best parameters and dimensions for your approach, create a job to test the accuracy of each possible set using the scoring algorithm you created in step one. All results should be stored somewhere or written to command line, and eventually put into a spreadsheet, or some other tool that makes visualization and calculations simple. Note that doing this sort of analysis may run for hours, or even days, so make the necessary accommodations and try to do this as little as possible to get the information you need.
For continuous values, such as number of clusters or alpha/lambda values, choose a set number of values to try and you can adjust these if you see a definite trend later that you want to investigate more. These scores can be plotted in order to best determine the correct hyperparameter to use. Often hyperparameters are independent of each other, so looking at them in isolation helps. In the example below, we are looking at a set of accuracy scores for each value of K for a KMC algorithm. You can see that once we reach around 40 - 50, the accuracy doesn’t get much better. Because of this, we can safely set K to 50, reducing runtime as much as possible while keeping the accuracy up.
For dimensions or other enumerated parameter values, it can be difficult to determine what is the best since different seeds will produce different values that score the highest. Additionally, dimensions are generally not independent, so you can’t look at each one in isolation as easily.
Despite this, it is possible to get some idea of which ones are more effective than others. To do this, run many jobs with different sets of dimensions or values. Once you have all of the scores, take the average score with and without a given dimension, and subtract the two. This produces a rough value we call the “impact” of that dimension. The more positive the value, the more likely that dimension is to help, while the more negative the value, the more likely the dimension hurts. This can be used to reduce the number of dimensions under consideration, making analysis faster. Take care not to look at the impact as an absolute number, and don’t ever compare the impact of one dimension to another impact if they are not both based on the same dataset. In the end, choosing the right set of dimensions often comes down to a gut feeling based on the scores of the various options. Luckily, often times there will be a number of sets of dimensions which will give you similar accuracy, giving you a selection of valid dimension sets to choose from.
When working with KMC, oftentimes the question comes up of what to tune first: hyperparameters or dimensions? In my experience, I’ve found that starting with dimensions and then progressing to hyperparameters once you have settled on a set of dimensions works the best. This is because dimensions have greater changes to the results and the system as a whole than hyperparameters do. It’s never a bad idea to recheck the dimensional analysis after hyperparameters have been determined, but you rarely will see a huge difference at that stage.
Conclusion
Creating a recommender engine can look like a daunting task. With the help of tools like Spark’s MLlib or Mahout and a bit of knowledge, however, it is something that many companies have done and you can too. By focusing on the technique, the parameters and/or dimensions, and the scoring algorithm, you can easily create a system that will give good recommendations quickly.
Comments