Who Will See the Reward Program Offer? Let’s Create the Predicting Model With Python, Scikit-Learn, XGBoost

김희규
The Startup
Published in
11 min readOct 24, 2020

--

Through “Targeted Marketing Strategy for Startbucks” Dataset, make a model predicting who will see the rewards program.

Photo by TR on Unsplash

Business Understanding

The reward program is a representative marketing method that induces users to purchase. But program that user are not interested in, user may unsubscribe corporate marketing channels or have a negative image. That’s why it’s important to send a SMS, email of reward program to interested users. In this case, the user’s interest is determined by whether or not the user has viewed the program. By creating a model that predicts whether users will see the program, you can deliver the right program to the right users.

Metrics

In this article, I will explain how I made the model for predicting whether users will see the program or not. To evaluate the model, I need to set criteria and metric.

Accuracy is a common metric for almost all classification models. But 75% of the users who received offer viewed offer. That means if we determine that all user view the offer, we will get a 75% accuracy😱.

So not only accuracy, but also precision, recall, and F1 scores should be checked. In unbalanced data, F1 score is a widely used metric. The F1 score is the harmonic mean of the precision and recall. It is a better indication of accuracy in unbalanced data because harmonic mean penalizes the extreme value. You can see the definition in the image below.

Precision and recall from https://en.wikipedia.org/wiki/F-score
Definition of F1 score from https://en.wikipedia.org/wiki/F-score

Data Exploration & Visualization

First, We can download the dataset from below link of Kaggle.

Let’s see the files of dataset.

Portfolio.json

Portfolio.json file contents and statistics
Box Plot of rewards, difficulties, durations group by offer type
Channels of offers by offer type
  • Contains information on 10 offers that users can receive.
  • There are 3 types of programs — BOGO(Buy-One, Get-One), discount, informational.
  • Informational offer don’t have reward.
  • Offers are delivered through multiple channels
view probability of offer types and view probability by offer type,

On average, the view probability was slightly higher in bogo, but when viewed by offer, there was a big difference even in the bogo offer type.

Profile.json

profile.json file rows and statistics
  • Profile is the information of users.
  • Users of unknown age are 118 years old
  • User’s gender, age, income are all unknown or all exist.
  • 2175 of users(12%) have no age, gender, income data(NaN). They only have event history.
distribution of age(left), scatter plot of age-income(left)

Most users are between 50–70 years old. Distribution of income by age was similar regardless of gender.

histogram of ages by gender (left), bar chart of gender count

The younger population under 40 had more male data.

Transcript.json

Transcript are events in which a user received, viewed, completed, or transacted an offer. person column represents the user’s ID in profile. Completed offer events have reward gain and informational offers don’t have complete events.

value column is dictionary, different by event type. So I need to separate transcript file in preprocessing phase.

time distribution by offer types

Left of the above plot is histogram of offer received times. As you see that, User received an offer at a specific time(0, 168, 336, 408, 504, 576, unit is hour, initial value is zero). Offer views happened a lot right after receiving an offer. So I decided that events with a time value between 0 ~ 503 are train set, events with a time value more than 503 are test set. It means that we predict future user action from past history.

distributions about transaction amount

A lot of 0-amount transactions have occurred, but I need to check the exact reason. It may have purchased it for free using a coupon or event.

Data Preprocessing

Handle Missing Values

If the user’s age and income are missing values, they are replaced with average values. Also, if age, income, and gender are missing values, not_na column is added — 0 is na, and 0 is not.

Categorical to numeric

For categorical variables, a column designated as 1 when it occurs and 0 when it is not added was added. Channels in the portfolio can have multiple values.

encoded profile data

Object to numeric

became_member is date format string(YYYYMMDD). I extracted the year and converted it to numeric value.

Create input data to use for prediction and output data.

Combine the offers received by the user and the user, and add an offer viewed column, which is 1 if the user viewed the offer, or 0 otherwise. Some users receive multiple events or view multiple times. To simplify the model, I omitted multiple events and set to 1.

The model receives user information and offer information received by the user and outputs the probability of viewing the offer received by the user.

Train/Test Split

Offer received/viewed events with a time value between 0 ~ 503 are train set, more than 503 are test set.

Normalization

Since the model values have different ranges, Scikit-Learn’s StandardScaler unifies the distribution of values.

Benchmark

I couldn’t find a similar project on the internet. Instead, I implemented a baseline model using RandomForestClassifier. In this baseline model, I found a model that maximizes the accuracy and the F1 score of the test dataset. Here’s the performance of the baseline model. You can see that model is strongly overfitted.

Report for baseline
Train Report>
precision recall f1-score support

0.0 0.96 0.89 0.92 10469
1.0 0.97 0.99 0.98 34906

accuracy 0.97 45375
macro avg 0.96 0.94 0.95 45375
weighted avg 0.97 0.97 0.96 45375

Test Report>
precision recall f1-score support

0.0 0.62 0.50 0.56 6102
1.0 0.85 0.90 0.87 18419

accuracy 0.80 24521
macro avg 0.73 0.70 0.71 24521
weighted avg 0.79 0.80 0.79 24521

Model Training

I used Scikit-Learn’s StandardScaler for normalization, XGBoost as classifier.

XGBoost model from https://xgboost.readthedocs.io/en/latest/tutorials/model.html

XGBoost is a representative ensemble boosting model. Several decision trees conduct classification in parallel. First, the sampled values are put in tree 1, and then incorrectly predicted values are trained by weighting the tree 2. Repeat this over and over again for several trees.

XGBoost Regularization from https://xgboost.readthedocs.io/en/latest/tutorials/model.html

Regularization is also used in XGBoost. It controls the complexity of the model and prevent the model to depend on specific features. This method greatly reduces overfitting and enables parallel learning, which speeds up learning.

Improving model

Inserting total transaction amount of user

I inserted user’s total transaction price to user’s information. Because model couldn’t decide the probability for 12% of the users who has NaN value in age, income, gender. Their input was just average values and program’s information!

total transaction amount by person

This method actually slightly increased the learning accuracy and testing accuracy(1–2%p). The log value was taken in the amount column and changed to a normal distribution.

Removing unnecessary columns

Because all offers are delivered by email, the email column is always 1. member_since column is also removed because it had very little influence as a result of correlation analysis and feature importance analysis. Reducing the number of features is important to avoid overfitting.

Hyperparameter Search

The optimal hyperparameter was searched through GridSearchCV.

Cross Validation

10-Fold Cross Validation was used to verify whether untrained data was well predicted.

Evaluation

Train Report>
precision recall f1-score support

0.0 0.73 0.58 0.65 9886
1.0 0.88 0.94 0.91 33067

accuracy 0.85 42953
macro avg 0.81 0.76 0.78 42953
weighted avg 0.85 0.85 0.85 42953

Test Report>
precision recall f1-score support

0.0 0.67 0.61 0.64 5362
1.0 0.88 0.90 0.89 16808

accuracy 0.83 22170
macro avg 0.77 0.76 0.77 22170
weighted avg 0.83 0.83 0.83 22170
  • The difference from the test set is not large and there is a little. Accuracy difference between train and test of the baseline is 0.17 but XGBoost is 0.02.
  • Compared to the baseline model, both Accuracy and F1 score showed performance improvement (Test F1 score: 0.87 -> 0.89, Test Accuracy: 0.80 -> 0.83)

Error Analysis

I compared the distribution of the data that the model corrects and the data that are not correct. However, no significant difference was found in distribution.

Confusion Matrix

My model is good at predicting who to view, but is relatively poor at predicting who to not view. Because of the unbalanced data, predicting “not viewed” is more difficult. You can see the right figure that most of the predicted values are near 1.0 and few near 0.0. This means that my model is difficult to predict “not viewed”.

Although there was no significant difference in the accuracy of users without information and users with no information, the accuracy of predicting that users without information will not see an offer was significantly low (F1 score 0.32).

Model incorrectly predicted about 500(18%) people who don’t have age, income, gender value, which was far less than my prediction. But, this is a high percentage given that they account for 12%.

In order to avoid sending wrong offers, it would be helpful to handle users who do not have information separately(ex: special another model) when building a recommendation system that sends an actual offer.

Feature Importance by XGBoostClassifier

  • Total transaction amounts have good feature importance. Sending through social media had an overwhelmingly large influence.

How improvements were made

Optimal Hyperparameter

{
'xgb__colsample_bylevel': 0.9,
'xgb__colsample_bytree': 0.8,
'xgb__gamma': 1,
'xgb__max_depth': 6,
'xgb__min_child_weight': 3,
'xgb__n_estimators': 50,
}

XGBoostClassifier with these hyperparameters showed best F1 score 0.8977924674474604. To explain this:

  • colsample_bylevel: Denotes the subsample ratio of columns for each split, in each level.
  • colsample_bytree: Denotes the fraction of columns to be randomly samples for each tree.
  • gamma: A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
  • max_depth: maximum depth of decision tree
  • min_child_weight: minimum sum of weights of all observations required in a child
  • n_estimators: number of decision tree

Changing prediction threshold

accuracy, f1 score by threshold

Changing prediction threshold value to 0.4(predictIf the probability is greater than 0.4, it is 1, and if it is less, it is 0) shows best accuracy and f1 score both. Do not show a very big difference

Feature Engineering Results

  • Inserting the user’s total transaction price improves 1.2%p of test accuracy. It can help to predict the users without information
  • When I took a log of the total transaction amount and converted it to a normal distribution, I received a very small increase in test accuracy (F1 score 0.3%p).
  • Removing useless/a little influential columns affect little to prediction.

Conclusion

  • By combining the user’s information and the user’s information on the offers received, a model was created to predict whether or not the received offer was viewed.
  • It was difficult to improve accuracy due to unbalanced data. However, when the F1 score was used as an index, an 0.89 was achieved.
  • While preprocess the data, I handled missing values, converted categorical column to numeric type, normalize the columns, split the data into train set and test set by time.
  • XGBoost classifier is used for model. It is a state-of-the-art model that is strong in overfit, fast in learning, and accurate.
  • It showed a little better than the benchmark. There was little overfit. Learning accuracy and F1 scores were high. But the accuracy of predicting not to see the offer was low. It is lower if there is no user information
  • I improved the model performance by hyperparmeter optimization using GridSearchCV, Cross Validation, removing unnecessary columns, inserting total transaction amount of users.
  • Even without knowing the user information other than the total transaction amount, it was able to predict quite accurately than I expected. That’s interesting.
  • By error analysis, I didn’t find good ideas to improve model performance. Instead, I’m gonna need other approaches or create special model for NA handling.

Another approaches for improvement?

  • If there are more features and data about the user, the learning accuracy will be higher.
  • If I have more data, I can apply the YouTube recommendation system algorithm. When this method is applied to the current dataset, the size of the dataset is greatly reduced, so the test accuracy is only 85%.

For the detail implementation, see my github repository. If you have any good ideas, please post a comment. Thank you for reading!

References

--

--

김희규
The Startup

나는 최고의 선수다. 나를 최고라고 믿지 않는 사람은 최고가 될 수 없다.