Forecasting future purchases on a customer level offers businesses a great opportunity to personalize marketing efforts and to efficiently focus their resources. In many cases, historical customer data is readily available and Machine Learning methods can be utilized to make various valuable predictions, for example the time of a customer’s next order.
However, these methods usually only calculate a single variable for each customer. This is called a univariate prediction and can already be quite useful. But predicting several variables in a multivariate setting provides much greater detail and therefore possibilities for taking profitable actions. Multivariate predictions also offer a more precise explanation about the customer behavior.
In this blog post we explain how to use Machine Learning to perform multivariate purchase predictions on a real-world dataset. After describing the problem setup, our first approach will be to combine multiple univariate models. We will then make use of Deep Learning and Neural Networks with a particular architecture to develop a real multivariate model.
Let’s see how we can set up a multivariate purchase prediction problem on a customer level.
We will use the ‘Acquire Valued Shoppers Challenge’ dataset which is provided by Kaggle and can be freely downloaded here. This dataset contains detailed information about almost 350 million transactions of over 300,000 unique customers over 14 month, making it the ideal basis for constructing a multivariate purchase prediction problem.
For all customers we use one year of their transactions as information about their past buying behavior and predict in which of the following four weeks they will make a purchase. To be more precise, we predict four boolean target variables for each customer:
- Target 1: ‘Customer makes a purchase in the first week.’
- Target 2: ‘Customer makes a purchase in the first two weeks.’
- Target 3: ‘Customer makes a purchase in the first three weeks.’
- Target 4: ‘Customer makes a purchase in the first four weeks.’
You can take a look at the picture below to see how we use four weeks of purchase history to construct the target labels for three different customers.
As features we use the available demographic data and derive additional features from the past transactions. These include simple statistics like the total number of purchases and the customer’s preferred product category, but also more complex features like ratios and trends. In total we use a combination of about 30 numerical and 80 categorical and boolean features. If you need some inspiration for your feature engineering, definitely consider reading this great paper.
After setting up the problem, we now want to find out how hard to solve it actually is. The following baseline runs almost out of the box.
A simple approach for solving a multivariate prediction problem is to treat the different target variables independently and to use a univariate model for each of them. In our case we train four models, one for each of the four target variables.
Here, the main advantage is that we reduce the problem to the much easier and more common univariate case. This enables us to make use of the open-source implementations provided by libraries such as sklearn and XGBoost which are highly optimized, making this approach a strong baseline.
Take a second to think about how the target variables interact with each other.
Imagine for example a customer who makes a purchase in the first week.
Well, then his variable for target 1 has to be True, since he made a purchase in the first week. But his other variables have to be True as well, because a purchase in the first week is in particular a purchase in the first two, three and four weeks.
Humans understand these connections naturally, but the baseline does not.
The models work independently and can not communicate with each other, and as a result they might output predictions which are self-contradictory.
So how can we incorporate this ‘common sense’ into our predictions?
This is the moment where we switch gears to Artificial Neural Networks and start to build a single multivariate classifier.
Why use Neural Networks?
In the last years Deep Learning and in particular Neural Networks (NN) have had an enormous success dealing with a variety of Machine Learning problems, and arguably even surpassing human level performance in a few of them.
Part of that success stems from the ability to grasp more complex dependencies within data than traditional Machine Learning models. This makes them very well suited for our multivariate prediction problem, since capturing these dependencies – for example between the target variables – is exactly what we are looking for. Also NN’s architecture is incredibly versatile and flexible and we will now see how to make use of that.
Neural Network Architecture
Let’s take one step back for a moment and assume that we want to predict just a single boolean variable with our network. How would we do that?
Well, given an input vector, we would pass it through some hidden layers until it arrives at the output layer, where we have only two neurons – one indicating True and one indicating False. Intuitively we want the network to fire at the True output neuron, whenever we pass a positive sample and at the False output neuron for a negative sample.
Now think about the multivariate case where we want to predict four different boolean variables.
Again, we pass the input vector through the hidden layers, up to the last hidden layer, one step before the output layer. At this point, the Neural Network has already applied most of its knowledge. Using the trained parameters in the hidden layers, it has transformed the input vector into an internal representation which encodes the conclusions it has drawn so far.
Instead of giving this representation now to a single output layer of two neurons, we pass it to four output layers (two neurons each), one for each target variable. By sharing the hidden layers – and thus learning a single encoding – our predictions will be strongly linked together, while the respective output layers are still able to compute separate variables.
Training the Neural Network
Training Neural Networks is hard. There is a vast amount of parameters to tune, new techniques to test and using a custom architecture with multiple outputs surely doesn’t make life easier. Let’s quickly see how the training can be done.
For the output layers we used a softmax activation function and a binary cross entropy loss function. To back-propagate the gradients to the hidden layers, we simply averaged the gradients of the output layers.
For the hidden layers we found that a SELU activation function combined with Alpha-Dropout was superior to other approaches. To learn more about self-normalizing Neural Networks, checkout this amazing paper.
For other parameters including learning rate, optimizer, training epochs, and batchsize we implemented a gridsearch with heuristical speedups. The Neural Network itself was implemented using the Keras Model API. If you have troubles training your own net, this might help you big time.
Following the problem description above, we use a subset of the data to construct a train (~13.000 samples) and a test dataset (~5.500 samples). We have to keep in mind, that considering multiple weeks for one target introduces an unbalance to the distribution of target variables: In the first week already about 80% of the customers make a purchase, but within the four weeks for target 4, it is over 98%. Since we have very few samples from the underrepresented target class to learn from, it will be challenging to predict them correctly.
For the evaluation we will therefore compare the F1-score for the target labels separately. As a second metric we analyze the number of predictions, which are self-contradicting as described in the section Common Sense.
As a baseline we trained Decision Trees, Random Forests and Gradient Boosted Trees (XGBoost implementation) and used hyperparameter optimization techniques from sklearn. These algorithms are not only widely used and well studied, but have also proven to be successful in winning many Kaggle challenges.
As you can see the Neural Network is not quite able to surpass the baseline approach regarding the F1-score, especially not the XGBClassifier. Nevertheless, the Network is able to deliver competitive results and for the difficult False class it is never worse by more than 0.02 compared to the best performing model.
On the other hand the Neural Network proves to be very strong in making consistent predictions.
Out of the about 5.500 predictions made on the test data, multiple hundreds are self-contradictory for the Decision Tree and the Random Forest, and still 10 for the XGBClassifier, while the Neural Network does not make a single one of those mistakes.
Generally speaking, we can conclude that Neural Networks are capable of making comparably good results, while delivering more consistent and interpretable predictions, than the baseline approach. When dealing with real-world problems self-contradicting predictions are hardly reasonable, therefore the Neural Network might be particularly useful in that setting.
In this blog post we have seen how to use a Neural Network to deal with a multivariate purchase prediction problem. While it does not yet prove to be superior in accuracy, only the Deep Learning approach is able to produce predictions without self-contradictions.
We do believe that further progress will be made in the future using improved network architectures or more sophisticated training and inference techniques, and we will keep working on that for a follow-up post.
If you think about doing purchase predictions as well, then you should definitely consider doing multivariate predictions. We would love to hear about your experiences and insights.
In contrast to purchase prediction, it is also possible to consider churn prediction. To learn more about this interesting topic checkout this blog-post which shows how to precisely forecast future churners in the online game Blade & Soul.