Description of the Kaggle competition

This is a private binary classification competition for the “Machine Learning” course, part of CEU BA, spring semester of 2020.

In this competition, your task is to predict which articles are shared the most in social media. The data comes from website mashable.com from the beginning of 2015. The dataset used in the competition can be found at the UCI repository (https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity#) - of course, you should not cheat by checking out the whole dataset found there.

Make sure you submit at least the following 4 types of solutions:

  • linear model prediction after parameter tuning
  • random forest prediction after parameter tuning
  • gradient boosting prediction after parameter tuning
  • neural network prediction after parameter tuning.

Your best model will need to have at least 0.65 AUC to consider this part of the exam complete.

For extra points, build a stacked model, explain how it works and evaluate its results.

Setup

The training data table consists of 27752 observations and 60 variables. The business question is to predict the number of shares in social networks (popularity), so the y-variable in this case is is_popular. The target observations are the test data in the private leaderboard of the kaggle competition.

Data exploration and cleaning

First I am going to get an overview of the attributes in this dataset.

Data summary
Name train_data
Number of rows 27752
Number of columns 60
_______________________
Column type frequency:
numeric 60
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
n_tokens_title 0 1 10.38 2.11 3.00 9.00 10.00 12.00 20.00 ▁▅▇▁▁
n_tokens_content 0 1 546.56 471.66 0.00 246.00 409.00 717.00 7764.00 ▇▁▁▁▁
n_unique_tokens 0 1 0.56 4.21 0.00 0.47 0.54 0.61 701.00 ▇▁▁▁▁
n_non_stop_words 0 1 1.01 6.25 0.00 1.00 1.00 1.00 1042.00 ▇▁▁▁▁
n_non_stop_unique_tokens 0 1 0.70 3.90 0.00 0.63 0.69 0.75 650.00 ▇▁▁▁▁
num_hrefs 0 1 10.92 11.40 0.00 4.00 8.00 14.00 304.00 ▇▁▁▁▁
num_self_hrefs 0 1 3.30 3.85 0.00 1.00 3.00 4.00 116.00 ▇▁▁▁▁
num_imgs 0 1 4.61 8.36 0.00 1.00 1.00 4.00 111.00 ▇▁▁▁▁
num_videos 0 1 1.23 4.07 0.00 0.00 0.00 1.00 91.00 ▇▁▁▁▁
average_token_length 0 1 4.55 0.84 0.00 4.48 4.66 4.86 8.04 ▁▁▇▃▁
num_keywords 0 1 7.23 1.91 1.00 6.00 7.00 9.00 10.00 ▁▂▇▇▇
data_channel_is_lifestyle 0 1 0.05 0.23 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
data_channel_is_entertainment 0 1 0.18 0.38 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
data_channel_is_bus 0 1 0.16 0.36 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
data_channel_is_socmed 0 1 0.06 0.23 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
data_channel_is_tech 0 1 0.19 0.39 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
data_channel_is_world 0 1 0.21 0.41 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
kw_min_min 0 1 26.26 69.82 -1.00 -1.00 -1.00 4.00 377.00 ▇▁▁▁▁
kw_max_min 0 1 1165.99 4147.64 0.00 448.00 662.00 1000.00 298400.00 ▇▁▁▁▁
kw_avg_min 0 1 314.70 670.78 -1.00 142.50 235.94 358.22 42827.86 ▇▁▁▁▁
kw_min_max 0 1 13529.57 57297.42 0.00 0.00 1400.00 7900.00 843300.00 ▇▁▁▁▁
kw_max_max 0 1 751996.26 214875.42 0.00 843300.00 843300.00 843300.00 843300.00 ▁▁▁▁▇
kw_avg_max 0 1 258825.28 134555.01 0.00 172477.50 244297.86 330700.00 843300.00 ▃▇▃▁▁
kw_min_avg 0 1 1118.68 1138.40 -1.00 0.00 1024.77 2060.03 3613.04 ▇▃▃▂▂
kw_max_avg 0 1 5676.57 6346.89 0.00 3566.30 4359.61 6018.30 298400.00 ▇▁▁▁▁
kw_avg_avg 0 1 3140.74 1343.01 0.00 2387.63 2870.55 3608.71 43567.66 ▇▁▁▁▁
self_reference_min_shares 0 1 4123.98 20926.43 0.00 638.00 1200.00 2700.00 843300.00 ▇▁▁▁▁
self_reference_max_shares 0 1 10401.31 41383.32 0.00 1100.00 2800.00 7900.00 843300.00 ▇▁▁▁▁
self_reference_avg_sharess 0 1 6479.49 24979.25 0.00 978.79 2200.00 5199.40 843300.00 ▇▁▁▁▁
weekday_is_monday 0 1 0.17 0.37 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
weekday_is_tuesday 0 1 0.19 0.39 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
weekday_is_wednesday 0 1 0.19 0.39 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
weekday_is_thursday 0 1 0.18 0.39 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
weekday_is_friday 0 1 0.14 0.35 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
weekday_is_saturday 0 1 0.06 0.24 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
weekday_is_sunday 0 1 0.07 0.25 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
is_weekend 0 1 0.13 0.34 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
LDA_00 0 1 0.18 0.26 0.00 0.03 0.03 0.24 0.92 ▇▁▁▁▁
LDA_01 0 1 0.14 0.22 0.00 0.03 0.03 0.15 0.93 ▇▁▁▁▁
LDA_02 0 1 0.21 0.28 0.00 0.03 0.04 0.33 0.92 ▇▁▁▁▁
LDA_03 0 1 0.22 0.30 0.00 0.03 0.04 0.38 0.93 ▇▁▁▁▂
LDA_04 0 1 0.24 0.29 0.00 0.03 0.04 0.40 0.93 ▇▂▁▁▂
global_subjectivity 0 1 0.44 0.12 0.00 0.40 0.45 0.51 1.00 ▁▃▇▁▁
global_sentiment_polarity 0 1 0.12 0.10 -0.39 0.06 0.12 0.18 0.73 ▁▂▇▁▁
global_rate_positive_words 0 1 0.04 0.02 0.00 0.03 0.04 0.05 0.16 ▅▇▁▁▁
global_rate_negative_words 0 1 0.02 0.01 0.00 0.01 0.02 0.02 0.18 ▇▁▁▁▁
rate_positive_words 0 1 0.68 0.19 0.00 0.60 0.71 0.80 1.00 ▁▁▃▇▃
rate_negative_words 0 1 0.29 0.16 0.00 0.19 0.28 0.38 1.00 ▅▇▃▁▁
avg_positive_polarity 0 1 0.35 0.10 0.00 0.31 0.36 0.41 1.00 ▁▇▃▁▁
min_positive_polarity 0 1 0.10 0.07 0.00 0.05 0.10 0.10 1.00 ▇▁▁▁▁
max_positive_polarity 0 1 0.76 0.25 0.00 0.60 0.80 1.00 1.00 ▁▁▅▅▇
avg_negative_polarity 0 1 -0.26 0.13 -1.00 -0.33 -0.25 -0.19 0.00 ▁▁▂▇▃
min_negative_polarity 0 1 -0.52 0.29 -1.00 -0.70 -0.50 -0.30 0.00 ▆▆▇▅▅
max_negative_polarity 0 1 -0.11 0.10 -1.00 -0.12 -0.10 -0.05 0.00 ▁▁▁▁▇
title_subjectivity 0 1 0.28 0.32 0.00 0.00 0.15 0.50 1.00 ▇▂▂▁▂
title_sentiment_polarity 0 1 0.07 0.27 -1.00 0.00 0.00 0.15 1.00 ▁▁▇▂▁
abs_title_subjectivity 0 1 0.34 0.19 0.00 0.17 0.50 0.50 0.50 ▃▂▁▁▇
abs_title_sentiment_polarity 0 1 0.16 0.23 0.00 0.00 0.00 0.25 1.00 ▇▂▁▁▁
is_popular 0 1 0.20 0.40 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
article_id 0 1 19794.97 11441.80 2.00 9883.50 19765.50 29699.50 39644.00 ▇▇▇▇▇

Attribute description:
The features fall into various categories such as quantitative information about the article — such things as number of images, number of videos, etc. — and qualitative information about the article — such as which day it was published and which topic the article falls under.

Missing values:
There are no missing values in any of the observations.

Factorization:
For binary prediction with caret, the target variable must be a factor so I am going to coerce it to factors. I previously tried to factorize also the independent variables, but ran into problems compiling the h2o stacked ensemble model, so I will keep them as integer values.

Now I look at the outcome variable is_popular:

The majority of articles was labeled unpopular.

In addition I created a data profiling report with the ‘DataExplorer’ package, in order to have a better understanding of the dataset.

From the exploratory data analysis functions that were run in this report I learned that

  • most articles were published during the week and not on the weekend (mainly Tue, Wed, Fri).
  • the most popular channel is “world”, followed by “tech” and “entertainment”.
  • the number of keywords is mostly between 5 and 10.
  • most articles use zero or very few images and videos.

The report is available as a separate html document.

Model Training

I am going to train five models, and compare their performance using AUC. The AUC statistic is the most commonly used measure for diagnostic accuracy of quantitative tests. It is a discrimination measure which tells us how well we can classify observations in two groups: those with and those without the outcome of interest.

Linear Model

First I am training a Logit LASSO model as my benchmark model, a type of linear regression that uses shrinkage to reduce model complexity.

I am going to use different lambdas, which defines how much weight we want to add to penalization (the larger lambda the higher the penalty). Cross validation is used to check the best lambda.

The data is pre-processed before training, which includes centering and scaling the data. I am going to set metric to “ROC” to choose models based on AUC.

The LASSO algorithm picked a regression with 49 predictor variables/interactions and lambda = 0.0004641589. I will use the trained model to predict the probability of an article being shared or not on the holdout dataset. The default threshold for prediction is 50%. If we were to increase the threshold for predicting something to be positive, we will have fewer cases that we label as positive and therefor decrease the false positives and true positives.

The ROC curve summarizes how a binary classifier performs “overall”, taking into account all possible thresholds. It shows the trade-off between true positive rate (a.k.a sensitivity, # true positives / # all positives) and the false positive rate (a.k.a 1 - specificity, # false positive / # negatives).

AUC is the “area under the (ROC) curve”. This is a number between 0 and 1. Higher AUC generally means better classification.

The bigger the area under the ROC curve is, the better is our prediction. In this case we have an AUC of 0.7111.

Random Forest

Next, I am going to build a Probability Random Forest to predict y.

Random Forest combines multiple decision trees, each fit to a random sample of the original data. It builds a predictive model without intervention to select variables or functional form. By combining multiple trees it reduces variance with minimal increase in bias.

I am going to use the default option of growing 500 trees, with several different tuning parameters, such as number of variables randomly chosen for any split in the tree or the minimum number of observations in the terminal nodes. To verify the optimal parameters I use 5-fold cross-validation. Here data scaling was not applied during training.

The performance of the random forest model changed only marginally with different tuning paramaters. The best results were produced with the following parameters: 5 features to consider for each split and minimum 10 observations in each terminal node.

The random forest model produced an AUC of 0.7213, which is an improvement to the previous model (Logit Lasso AUC: 0.7111).

Gradient Boosting

Gradient boosting machines are also an ensemble of trees, however, the method of building the trees is different. The idea is to gradually (step-by-step) improve the trees by using the residuals of the previous tree. For not overfitting the data, a shrinkage parameter is constantly added, as well as a learning rate.

I am going to use different hyperparameters, which include the number of trees (the number of gradient boosting iteration), maximum nodes per tree, the learning rate and the minimum number of observations in trees’ terminal nodes.

The final values used for the model were n.trees = 1000, interaction.depth = 6, shrinkage = 0.01 and n.minobsinnode = 10.

The gradient boosting model produced an AUC of 0.7067, which shows no improvement compared to the previous models (Logit Lasso AUC: 0.7111, Random Forest AUC: 0.7213).

I also tried to remove pca variable reduction, which improved the AUC to 0.7338. This might be an indicator that variables are not very much correlated, which I had already assumed from the correlation table in the DataExplorer report.

Neural Network

I am going to train my neural network model with the caret package and tune the hyper-parameters using cross validation. The hyperparameters of my neural net model are size and decay. Size is the number of units in hidden layer (nnet fit a single hidden layer neural network) and decay is the regularization parameter to avoid over-fitting. Data is preprocessed, meaning centered, scaled and dimensions are reduced using PCA.

The final values used for the model were size = 7 and decay = 5.

The neural network model’s AUC of 0.7137 is similar to the lasso model (Logit Lasso AUC: 0.7111, Random Forest AUC: 0.7213, Gradient Boosting AUC: 0.7067).

Again I tried to train the same model without using PCA as the variables, but it only marginally improved the result with an AUC of 0.7177.

For comparison I am going to train a neural network model using H2O. H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform.

First I am going to connect to a local H2O instance.

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         22 hours 16 minutes 
##     H2O cluster timezone:       Europe/Vienna 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.28.0.2 
##     H2O cluster version age:    2 months and 7 days  
##     H2O cluster name:           H2O_started_from_R_lisahlmsch_htr830 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   5.82 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.6.1 (2019-07-05)

For this model I have chosen 3 hidden layers of 32 neurons each. This is a rather small network but it will run faster and hopefully still produces good results, as in the previous example also a small number of neurons was chosen by the algorithm. I have chosen 10000 epochs, but expect the misclassification rate to converge earlier, which will stop the training due to the early-stopping rule I will implement.

The h2o neural network model was not able to produce a better result than the neural net model trained with caret: 0.7031 (h2o) versus 0.7137 (caret).

Stacked model

Finally I am going to build a stacked model using h2o.

I am going to combine four models of different families using cross validation for hyperparameter tuning. These are my base models. With these models I am going to approximate the target variable. With the output of each model I am going to use some kind of combining mechanism to combine the results: I am going to use the preditions of my base models as features to train a new model - my metalearner. With this combination I am going make my final prediction. The metalearner is supposed to highlight each base learner where it performs better and where it performs worse. In short: Stacking is a model ensembling technique used to combine information from multiple predictive models to produce a new one.

Model 1 - Elastic net:
In this model I am again using grid search with cross-validation to tune the hyperparamter alpha and then extract the best models in terms of cross validated AUC.

The best model in terms of AUC was an elastic net model (alpha = 0.25) and produced an AUC of 0.7108 on the validation data set.

Model 2 - Random Forest: As in the previous random forest example I will be growing 500 trees. The default in h2o is 50 but I have incresed it because I will let the early stopping criteria decide when the random forest is sufficiently accurate. The algorithm will stop fitting new trees when the 2-tree average is within 0.001 (default) of the prior two 2-tree averages. As part of hyperparameter tuning I am going to specify different number of columns to randomly select at each level.

The final model produced an AUC of 0.7278 on the validation set. It used five randomly selected columns at each level (mtries = 5) and 259 trees.

Model 3 - Gradient Boosting:

For this model I am going to adjust some of the default parameters and use grid search to find the optimal hyperparameters. I will increase the number of trees from the default of 50 to 300. I will try different learning rates. Increasing the learning rate means the contribution of each tree will be stronger, so the model will move further away from the overall mean. I will also try different values for depth which adjusts the “weakness” of each learner. Adding depth makes each tree fit the data closer.

The best GBM model produced an AUC of 0.7307 on the validation set. It used a learning rate of 0.01, max_depth = 10, col_sample_rate of 0.5 and 295 number of trees.

Model 4 - Deep learning:
I am going to include the previously trained neural net in my stacked model.

Validation performances:
Here is a summary of performances (evaluated on the validation set) for each of the different models and its optimal hyperparameters.

h2o_glm h2o_rf h2o_gbm h2o_nnet
0.7108 0.728 0.7307 0.7031

Stacked ensemble model:
Recall, stacking is a combination of predictive models. The idea is that combined models should produce something more stable and have larger predictive performance (decrease the variance).

To produce a stacked ensemble model I am going to define several independent models and use the same folds for training each model. Then I do an out of fold prediction (produce prediction for these left out folds) and record the value we predicted as new features. With another predictive model I use these new features to estimate the outcome based on these baseline predictive scores. I am stacking the out of fold predictions and produce one single score from those features.

First I start with the baseline meta-learner - a glm model with non-negative weights (coefficients are constraints to be non-negative).

Then I use “gbm” as meta-learner.

Validation Performances:

ensemble_glm ensemble_gbm
0.735 0.713

The first ensemble, using glm as meta-learner performed better than elastic net, random forest, gbm and deep learning.

For my final model I therefor choose a stacked ensemble model with glm as meta learner and train it on the entire training dataset to predict my final results:

FINAL MODEL

# elastic net
# The best model in terms of AUC was an elastic net model (alpha = 0.25)
h2o_glm_x <- h2o.glm(y = "is_popular_f",
                     x = names(train_data[,-c(59,61)]),
                     training_frame = as.h2o(train_data),
                     lambda = 0.0004935,
                     family = "binomial",
                     alpha = c(.25),
                   seed = 700)

# random forest
# The final model used five randomly selected columns at each level (mtries = 5) and 259 trees. 
h2o_rf_x <- h2o.randomForest(y = "is_popular_f",
                    x = names(train_data[,-c(59,61)]),
                    training_frame = as.h2o(train_data),
                    stopping_rounds = 2,
                    ntrees = 300,
                    mtries = 5,
                   seed = 700)

# GBM
# The best GBM model used a learning rate of 0.01, max_depth = 10, col_sample_rate of 0.5 and 295 number of trees.
h2o_gbm_x <- h2o.gbm(y = "is_popular_f",
                   x = names(train_data[,-c(59,61)]),
                   training_frame = as.h2o(train_data),
                   ntrees = 300,
                   learn_rate = 0.01,
                   max_depth = 10,
                   sample_rate = 0.5,
                   col_sample_rate = 0.5,
                   stopping_rounds = 2,
                   seed = 700)

# neural net
h2o_nnet_x <- h2o.deeplearning(y = "is_popular_f",
                             x = names(train_data[,-c(59,61)]),
                             training_frame = as.h2o(train_data),
                             activation = "Rectifier",
                             hidden=c(32,32,32), 
                             epochs = 10000,
                             stopping_rounds=2,
                             stopping_metric="misclassification", ## could be "MSE","logloss","r2"
                             stopping_tolerance=0.01,
                             reproducible = TRUE,
                             seed = 700)

ensemble_model_glm_x <- h2o.stackedEnsemble(
                              y = "is_popular_f",
                              x = names(train_data[,-c(59,61)]),
                              blending_frame = as.h2o(train_data),
                              base_models = list(h2o_nnet_x, 
                                                 h2o_gbm_x, 
                                                 h2o_rf_x, 
                                                 h2o_glm_x))

test_prediction_probs <- h2o.predict(ensemble_model_glm_x, newdata = as.h2o(test_data))

filename <- paste0("predictions/final_stacking_x.csv")
df <- data.frame("article_id" = test_data$article_id,
                 "score" = as.data.table(test_prediction_probs[,3]))
names(df) <- c("article_id", "score")
write.csv(df,file=filename,row.names=FALSE)

Summary

The dataset provided was clean and complete, so no further data cleaning steps were necessary prior to model building. Data was often preprocessed, meaning centered, scaled and dimensions were reduced using PCA, since correlated variables are problematic for gradient-based optimization. PCA however worsened model performance.

Using the caret package, I trained a LASSO model, a random forest model, a gradient boosting machine model and a neural net. Among these random forest had the best results in terms of AUC on the validation set. Using h2o I trained an elastic net model, a random forest, a gradient boosting machine model and a neural net and stacked them using both glm and gbm as a meta learner. The stacked model using glm as a metalearner produced the best results in terms of AUC on the validation set. Among the base models gbm showed the best results, while gbm using caret was rather weak in predicting the target variable.

In the Kaggle competition my gbm model trained with h2o scored the best: the public score was 0.70587.

Acknowledgments

The dataset used in the competition can be found at the UCI repository. We thank Kelwin Fernandes, Pedro Vinagre, Paulo Cortez and Pedro Sernadela for making it publicly available. Check also their publication below.

K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.

```