Task 1: Fashion MNIST data

Description:

Take the “Fashion MNIST dataset” where images of fashion items are to be classified in a similar manner to what we saw with handwritten digits (see more here: https://github.com/zalandoresearch/fashion-mnist). Images are in exactly the same format as we saw digits: 28x28 pixel grayscale images. The task is to build deep neural net models to predict image classes. The goal is to have as accurate classifier as possible: we are using accuracy as a measure of predictive power.

fashion_mnist <- dataset_fashion_mnist()
x_train <- fashion_mnist$train$x
y_train <- fashion_mnist$train$y
x_test <- fashion_mnist$test$x
y_test <- fashion_mnist$test$y

Label Description:

0 T-shirt/top
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle boot

a. Show some example images from the data.

show_mnist_image <- function(x) {
  image(1:28, 1:28, t(x)[,nrow(x):1],col=gray((0:255)/255)) 
}

show_mnist_image(x_train[12, , ])
show_mnist_image(x_train[345, , ])
show_mnist_image(x_train[1789, , ])
show_mnist_image(x_train[31789, , ])

The examples show 28x28 pixel images of fashion items such as a shoe, purse or a tshirt.

b. Train a fully connected deep network to predict items.

Normalize the data similarly to what we saw with MNIST.
Experiment with network architectures and settings (number of hidden layers, number of nodes, activation functions, dropout, etc.)
Explain what you have tried, what worked and what did not. Present a final model.
Make sure that you use enough epochs so that the validation error starts flattening out - provide a plot about the training history (plot(history))

Data Normalization:

# reshape
x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))
# rescale
x_train <- x_train / 255
x_test <- x_test / 255

The y data is an integer vector with values ranging from 0 to 9. We use one-hot encoding for the training data where the vectors are transformed into binary class matrices using the Keras `to_categorical() function:

# convert class vectors to binary class matrices
y_train <- to_categorical(y_train, 10)
y_test <- to_categorical(y_test, 10)

Define model:

First we initialize a sequential model with the help of the keras_model_sequential() function. Then we add layers to the model. We use the rectifier function ‘relu’ as the activation function in the hidden layer. For the output layer we use the ‘softmax’ activation function, because we want to make sure that the output values are in the range of 0 and 1. The output layer produces 10 output values, one for each clothing class. The first layer, which contains 128 hidden nodes, has an input_shape of 784, because each image has 28 x 28 = 784 features.

# Model definition
model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 128, activation = 'relu', input_shape = c(784)) %>% 
  layer_dropout(rate = 0.3) %>% 
  layer_dense(units = 10, activation = 'softmax')

Here is a summary of the model:

summary(model)

## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense (Dense)                       (None, 128)                     100480      
## ________________________________________________________________________________
## dropout (Dropout)                   (None, 128)                     0           
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 10)                      1290        
## ================================================================================
## Total params: 101,770
## Trainable params: 101,770
## Non-trainable params: 0
## ________________________________________________________________________________

The 100480 parameters are a multiplication of 784 input features times 128 first layer nodes plus 128 biases.

Now that we have set up the architecture of the model, we are going to compile and fit the model to the data. To do so we are configuring the model with the ‘categorical_crossentropy’ loss function and the ‘optimizer_rmsprop()’ optimizer. We are going to monitor the accuracy of the model during training.

# Compile model
model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics = c('accuracy')
)

Next, we fit the model to our data. We are going to train the model for 30 epochs or iterations over all the samples in the training data and in batches of 128 samples. An epoch is a single pass through the entire training set, followed by testing of the verification set. The validation_split argument is set to 0.2, meaning the validation data used will be the last 20% of the training data. The batch size defines the number of samples that is going to be propagated through the network.

# Train model
history <- model %>% fit(
  x_train, y_train, 
  epochs = 30, 
  batch_size = 128, 
  validation_split = 0.2,
  verbose = 0
)

history

## Trained on 48,000 samples (batch_size=128, epochs=30)
## Final epoch (plot to see history):
##         loss: 0.2472
##     accuracy: 0.9096
##     val_loss: 0.3463
## val_accuracy: 0.8919

As the model trains, the loss and accuracy metrics are displayed. This model reaches an accuracy of about 0.9096 (or 91%) on the training data and 0.8919 (or 89%) on the validation data.

For a better understanding of the training history, we are going to visualize the fitting:

plot(history)

The plots shows the loss and accuracy of the model for the training data, as well as for the validation data.

The training data accuracy keeps improving while the validation data accuracy looks rather stable towards the end of the training. As long as the validation data accuracy does not get worse we assume that the model has not yet over-learned the training dataset.

Next we are going to fine-tune a few parameters and see how the output changes.

(a) Adding layers:

The following model is trained with two additional hidden layers.

model_a <- keras_model_sequential() 

# Add layers to model
model_a %>% 
    layer_dense(units = 128, activation = 'relu', input_shape = c(784)) %>% 
    layer_dense(units = 128, activation = 'relu') %>% 
    layer_dense(units = 128, activation = 'relu') %>% 
    layer_dropout(rate = 0.3) %>% 
    layer_dense(units = 10, activation = 'softmax')

# Compile model
model_a %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics = c('accuracy')
)

# Fit the model to the data
history_a <- model_a %>% fit(
  x_train, y_train, 
  epochs = 30, 
  batch_size = 128, 
  validation_split = 0.2,
  verbose = 0
)

history_a

## Trained on 48,000 samples (batch_size=128, epochs=30)
## Final epoch (plot to see history):
##         loss: 0.2167
##     accuracy: 0.9235
##     val_loss: 0.5112
## val_accuracy: 0.8911

# plot fit
plot(history_a)

This model reaches an accuracy of 0.9235 (or 92%) on the training and 0.8911 (or 89%) on the validation data, so there is no significant improvement to the previous model.

(b) Number of nodes:

Next, we are going to try out the effect of adding more hidden units (nodes) to our model’s architecture and study the impact on the evaluation:

model_b <- keras_model_sequential() 

# Add layers to model
model_b %>% 
    layer_dense(units = 640, activation = 'relu', input_shape = c(784)) %>% 
    layer_dropout(rate = 0.3) %>% 
    layer_dense(units = 10, activation = 'softmax')

# Compile model
model_b %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics = c('accuracy')
)

# Fit the model to the data
history_b <- model_b %>% fit(
  x_train, y_train, 
  epochs = 30, 
  batch_size = 128, 
  validation_split = 0.2,
  verbose = 0
)

history_b

## Trained on 48,000 samples (batch_size=128, epochs=30)
## Final epoch (plot to see history):
##         loss: 0.2077
##     accuracy: 0.9263
##     val_loss: 0.4045
## val_accuracy: 0.8962

# plot fit
plot(history_b)

In this case the additional nodes have improved the validation accuracy slightly: close to 90% accuracy were reached.

We will continue with this model to try to tweak a few parameters in the optimization algorithm.

(c) Optimization parameters:

model_c <- keras_model_sequential() 

# Add layers to model
model_c %>% 
    layer_dense(units = 640, activation = 'relu', input_shape = c(784)) %>% 
    layer_dropout(rate = 0.3) %>% 
    layer_dense(units = 10, activation = 'softmax')

# Compile model with stochastic gradient descent optimizer
model_c %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_sgd(lr = 0.01),
  metrics = c('accuracy')
)

# Fit the model to the data
history_c <- model_c %>% fit(
  x_train, y_train, 
  epochs = 30, 
  batch_size = 128, 
  validation_split = 0.2,
  verbose = 0
)

history_c

## Trained on 48,000 samples (batch_size=128, epochs=30)
## Final epoch (plot to see history):
##         loss: 0.3918
##     accuracy: 0.8643
##     val_loss: 0.3959
## val_accuracy: 0.8594

# plot fit
plot(history_c)

Model (c) reaches an accuracy of 0.8643 (or 86%) on the training and 0.8594 (or 86%) on the validation data, so the performance has worsened compared to the previous models.

We will continue with model (b).

c. Evaluate the model on the test set. How does test error compare to validation error?

In order to evaluate our model we use the evaluate() function, which returns the loss value and the metric value (in this case ‘accuracy’)

# Evaluate the model
scores_b <- model_b %>% evaluate(x_test, y_test, verbose = 0)

# Print the score
cat(' Test loss:', scores_b[[1]], '\n','Test accuracy:', scores_b[[2]], '\n')

##  Test loss: 0.4345472 
##  Test accuracy: 0.8891

The test accuracy of 0.8891 is relatively close, but slightly lower than the validation accuracy of 0.8962.

Compare predictions to reality:

predicted_classes_test <- model_b %>% predict_classes(x_test)
real_classes_test <- as.numeric(fashion_mnist$test$y)

dt_pred_vs_real <- data.table(predicted = predicted_classes_test, real = real_classes_test)

ggplot(dt_pred_vs_real[, .N, by = .(predicted, real)], aes(predicted, real)) +
  geom_tile(aes(fill = N), colour = "white") +
  scale_x_continuous(breaks = 0:9) +
  scale_y_continuous(breaks = 0:9) +
  geom_text(aes(label = sprintf("%1.0f", N)), vjust = 1, color = "white") +
  scale_fill_viridis_c() +
  theme_bw() + theme(legend.position = "none")

A larger number of mistakes happened in category 6 (Shirt), that was 200 times inaccurately predicted to be category 0 (T-shirt/top) and 95 times to be category 2 (Pullover). Also category 4 (Coat) and category 2 (Pullover) were often misclassified.

We are going to show some examples:

dt_pred_vs_real[, row_number := 1:.N]
indices_of_mistakes <- dt_pred_vs_real[predicted != real][["row_number"]]

ix <- indices_of_mistakes[1]

dt_pred_vs_real[row_number == ix]

##    predicted real row_number
## 1:         5    7         13

show_mnist_image(fashion_mnist$test$x[ix, , ])

This sneaker (category 7) was predicted to be a sandal (category 5).

ix <- indices_of_mistakes[4]

dt_pred_vs_real[row_number == ix]

##    predicted real row_number
## 1:         0    6         41

show_mnist_image(fashion_mnist$test$x[ix, , ])

This shirt (category 6) was predicted to be a t-shirt/top (category 0).

d. Try building a convolutional neural network and see if you can improve test set performance.

Just like before, experiment with different network architectures, regularization techniques and present your findings

The convolutional neural network is a class of deep neural networks which makes use of the 2d structure of the original input data, applying filters exploiting the 2d images.

Step 1: Convolution To an input Image we apply multiple different feature detectors (aka “kernels” or “filters”) to create feature maps. These filters move along the input image - the step at which these filters are moving is called stride. This comprises the convolutional layer. The process makes the image smaller. Some information is lost, but the integral patterns are kept. On top of that we apply the relu (rectifier liner unit) to increase non-linearity.

Step 2: Max Pooling Then we apply a pooling layer to our convolutional layer, using max pooling (aka “downsampling”). The main purpose of the pooling layer is to make sure that the algortithm recognizes features regardless if they are tilted or distorted. This level of flexibility is called spatial invariance. Pooling significantly reduces the size of our images and helps with avoiding any kind of overfitting of our data since it gets rid of a lot of that data while preserving the main features that we’re after.

Step 3: Flattening The pooled feature map is flattened into a vector to be able to input that into an artificial neural network.

Step 4: Full Connection All of these features are processed through a network and the final fully connected layer performs the voting towards the classes. All of this is trained through forward propagation and back propagation process. An important thing is that not only the weights are trained in artificial neural networks but also the feature detectors are trained and adjusted in that same gradient descent process that allows us to come up with the best feature maps.

x_train2 <- fashion_mnist$train$x
y_train2 <- fashion_mnist$train$y
x_test2 <- fashion_mnist$test$x
y_test2 <- fashion_mnist$test$y

# reshape
# array_reshape(x, dim)
# x - An array
# dim   - The new dimensions to be set on the array.
x_train2 <- array_reshape(x_train, c(nrow(x_train2), 28, 28, 1))
x_test2 <- array_reshape(x_test, c(nrow(x_test2), 28, 28, 1))
# dim(x_train2)
# [1] 60000    28    28     1

# rescale
x_train2 <- x_train2 / 255
x_test2 <- x_test2 / 255

# one-hot encoding of the target variable
y_train2 <- to_categorical(y_train2, 10)
y_test2 <- to_categorical(y_test2, 10)

cnn_model <- keras_model_sequential() 
cnn_model %>% 
  layer_conv_2d(filters = 32,
                kernel_size = c(3, 3), 
                activation = 'relu',
                input_shape = c(28, 28, 1)) %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>% 
  layer_dropout(rate = 0.25) %>%
  layer_flatten() %>% 
  layer_dense(units = 16, activation = 'relu') %>% 
  layer_dense(units = 10, activation = 'softmax')

summary(cnn_model)

## Model: "sequential_4"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## conv2d (Conv2D)                     (None, 26, 26, 32)              320         
## ________________________________________________________________________________
## max_pooling2d (MaxPooling2D)        (None, 13, 13, 32)              0           
## ________________________________________________________________________________
## dropout_4 (Dropout)                 (None, 13, 13, 32)              0           
## ________________________________________________________________________________
## flatten (Flatten)                   (None, 5408)                    0           
## ________________________________________________________________________________
## dense_10 (Dense)                    (None, 16)                      86544       
## ________________________________________________________________________________
## dense_11 (Dense)                    (None, 10)                      170         
## ================================================================================
## Total params: 87,034
## Trainable params: 87,034
## Non-trainable params: 0
## ________________________________________________________________________________

Number of parameters:

layer_conv_2d turns the 28 x 28 image to 26 x 26, using 9 parameters for each filter (3 x 3 weights), plus a bias for each filter: 9 x 32 + 9 = 320 parameters
max_pooling2d takes each disjoint 2 x 2 squares and collapes them to 1, turning a 26 x 26 “image”" to a 13 x 13. No parameters are associated with this step.
flatten: turns each “pixel” in each node to one separate node: 13 x 13 x 32 = 5408
dense: fully connected layer: 5408 nodes x 16 new nodes + 16 biases = 86544
final fully connected layer: 16 x 10 + 10 = 170

cnn_model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics = c('accuracy')
)

history <- cnn_model %>% fit(
  x_train2, y_train2, 
  epochs = 30, batch_size = 128, 
  validation_split = 0.2,
  verbose = 0
)

history

## Trained on 48,000 samples (batch_size=128, epochs=30)
## Final epoch (plot to see history):
##         loss: 0.4598
##     accuracy: 0.8343
##     val_loss: 0.4488
## val_accuracy: 0.8374

plot(history)

The CNN model reaches an accuracy of about 0.8343 (or 83%) on the training data and 0.8374 (or 84%) on the validation data.

# Evaluate the model
scores_cnn <- cnn_model %>% evaluate(x_test2, y_test2, verbose = 0)

# Print the score
cat(' Test loss:', scores_cnn[[1]], '\n','Test accuracy:', scores_cnn[[2]], '\n')

##  Test loss: 0.464764 
##  Test accuracy: 0.8289

The test accuracy of 0.8289 is lower than the test accuracy of the ANN (0.8891).

Summary Task 1

In order to optimize our model and improve its predictive power as much as possible we normalized the data and experminted with the network architectures and settings. Adding hidden layers seemed to improve the model slightly, while inlcuding a stochastic gradient descent optimizer did not. The final model reached an accuracy metric of 89%, and showed weaknesses in correctly predicting shirts. The CNN model was not able to outperform the ANN model.

Task 2: Hot dog or not hot dog?

Description:

In this problem you are going to predict if a certain image containing food is hot dog or is something else. Motivation for this comes from the comedy show Silicon Valley (see here: https://www.youtube.com/watch?v=FNyi3nAuLb0).

The data can be found in the course repo (https://github.com/pappzoltan/machine-learning-course/tree/master/data/hot-dog-not-hot-dog) and is originally downloaded from here: https://www.kaggle.com/dansbecker/hot-dog-not-hot-dog.

a. Pre-process data so that it is acceptable by Keras (set folder structure, bring images to the same size, etc).

The data set only consists of a train and a test folder with images. In order to be able to validate the models, we shift 20% of the images within the train folder, to a new folder, called “validation”.

Without augmentation, there is an easier solution to do this split: When creating the image_data_generator(), we could add the argument validation_split, just like we did in task 1. But as the validation data should not be augmented, we need to seperate the data prior to generating the batches of image data.

###############################################################
#Generating "validation" samples out of the "train" folder#   #
###############################################################

if (!dir.exists(file.path(here(), "data/hot-dog-not-hot-dog/validation"))) {

  path1 <- file.path(here(), "data/hot-dog-not-hot-dog/train", c("hot_dog", "not_hot_dog"))
  path2 <- file.path(here(), "data/hot-dog-not-hot-dog/validation", c("hot_dog", "not_hot_dog"))
  dir.create(file.path(here(), "data/hot-dog-not-hot-dog/validation"), showWarnings = TRUE, recursive = FALSE)
  lapply(path2, dir.create)
  set.seed(12112)
  Map(function(x, y){
    file <- dir(x) ; n <- length(file)
    file_selected <- file.path(x, sample(file, ceiling(n * 0.2)))
    file.copy(file_selected, y)
    file.remove(file_selected)
  }, path1, path2)
  print("validation samples already generated")
}

As we quite frequentely had issues to have the right here() function in place, we include the detach() function and then run the library(here) once again. This makes sure to indeed find, where the current directory is:

# if here function is not working:
detach("package:here", unload=TRUE)
library(here)
image_path_train <- file.path(here(), "data/hot-dog-not-hot-dog/train/")
image_path_valid <- file.path(here(), "data/hot-dog-not-hot-dog/validation/")
image_path_test <- file.path(here(), "data/hot-dog-not-hot-dog/test/")

Setting up the generators without augmentation:

We are now generating batches of data from the images in our directory.

Image width and height are set to 224x224 pixel. The default batch_size is 32, which we changed to 50. This means that 50 randomly selected images from across the classes in the dataset will be returned in each batch during training. As we have a binary classification “Hot Dog” or “Not Hot Dog” we set the class_mode to “binary”.

img_width <- 224
img_height <- 224
image_size <- c(img_width, img_height)
batch_size <- 50 # define batch size

cnn_generator <- image_data_generator(rescale = 1/255)

train_generator <- flow_images_from_directory(image_path_train,
                                              generator = cnn_generator,
                                              target_size = image_size,
                                              class_mode = "binary",
                                              batch_size = batch_size)

valid_generator <- flow_images_from_directory(image_path_valid,
                                              generator = cnn_generator,
                                              target_size = image_size,
                                              class_mode = "binary",
                                              batch_size = batch_size)

test_generator <- flow_images_from_directory(image_path_test,
                                             generator = cnn_generator,
                                             target_size = image_size,
                                             class_mode = "binary",
                                             batch_size = batch_size)

First we would like to see how many images are actually loaded.

#Check how many pictures where loaded:
table(factor(train_generator$classes))

## 
##   0   1 
## 199 199

table(factor(valid_generator$classes))

## 
##  0  1 
## 50 50

table(factor(test_generator$classes))

## 
##   0   1 
## 250 250

We have 199 images in the training set, 50 images in the validation set and 250 images within the test set.

# Check binary classification
train_generator$class_indices

## $hot_dog
## [1] 0
## 
## $not_hot_dog
## [1] 1

The binary classification distinguished between two classes. 0 for Hot Dog; 1 for Not Hot Dog.

To see one of the example images, we will plot one of the images with a hot dog on it:

example_image_path <- file.path(here(), "data/hot-dog-not-hot-dog/train/hot_dog/2417.jpg")
image_read(example_image_path)

b. Estimate a convolutional neural network to predict if an image contains a hot dog or not. Evaluate your model on the test set.

The CNN in this task is quite similar to what we did in task 1. The difference ais the amount of layers we are going to add. We repeat the process of filtering and max_pooling three times, which reduces the size of the image even more and the more important features are kept. We use the rectifier function ‘relu’ as the activation function in the hidden layer and ’sigmoid’for the output layer because its range is (0, 1) and it can represent the probability of binary class.

# number of training samples
train_samples <- train_generator$n
# number of validation samples
valid_samples <- valid_generator$n
# number of test samples
test_samples <- test_generator$n
# number of epochs
channels <- 3 # RGB

# initialise model
cnn_model <- keras_model_sequential()

# add layers
cnn_model %>%
  layer_conv_2d(filters = 32,
                kernel_size = c(3, 3), 
                activation = 'relu',
                input_shape = c(img_width, img_height, channels)) %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>% 
  layer_conv_2d(filters = 16,
                kernel_size = c(3, 3), 
                activation = 'relu') %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>% 
  layer_conv_2d(filters = 16,
                kernel_size = c(3, 3), 
                activation = 'relu') %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>% 
  layer_dropout(rate = 0.25) %>% 
  layer_flatten() %>% 
  layer_dense(units = 8, activation = 'relu') %>% 
  layer_dense(units = 1, activation = "sigmoid")   # for binary

# Compile model
cnn_model %>% compile(
  loss = "binary_crossentropy",
  optimizer = optimizer_rmsprop(lr = 2e-5), # (learning rate); 
  metrics = "accuracy"
)

We are going to fit the model using the fit_generator function, where the generator (or batches of data) we provide runs in parallel to the model, for efficiency.

history <- cnn_model %>%
  fit_generator(
    train_generator,
    steps_per_epoch = as.integer(train_samples/batch_size),
    epochs = 20,
    verbose = 0,
    validation_data = valid_generator,
    validation_steps = as.integer(valid_samples/batch_size))

history

## Trained on 7 samples (batch_size=NULL, epochs=20)
## Final epoch (plot to see history):
##         loss: 0.6654
##     accuracy: 0.6724
##     val_loss: 0.6822
## val_accuracy: 0.61

plot(history)

The model reached an accuracy of 0.6724 (or 67%) on the training and 0.61 (or 61%) on the validation data. The accuracy of the validation data worsened around the 9th epoch but picked up again so we assume there is no overfitting of the data yet.

scores_cnn <- cnn_model %>% evaluate_generator(test_generator, steps = 200)

# Print the score
cat(' Test loss:', scores_cnn[[1]], '\n', 'Test accuracy:', scores_cnn[[2]], '\n')

##  Test loss: 0.6871667 
##  Test accuracy: 0.564

The CNN model reaches an accuracy of 0.564 on the test data.

c. Could data augmentation techniques help with achieving higher predictive accuracy? Try some augmentations that you think make sense and compare

Image data augmentation is a technique that can be used to artificially expand the size of a training dataset by creating modified versions of images in the dataset. This should result in an improvement of performance and the ability of the model to generalize.

The Keras deep learning neural network library provides the capability to fit models using image data augmentation via the ImageDataGenerator class.

First we are going to show an example of an augmented image:

img <- image_load(example_image_path, target_size = c(150, 150))  # this is a PIL image
x <- image_to_array(img) / 255
grid::grid.raster(x)

# take the previous image as base, multiplication is only to conform with the image generator's rescale parameter
xx <- flow_images_from_data(
  array_reshape(x * 255, c(1, dim(x))),
  generator = cnn_generator)

augmented_versions <- lapply(1:10, function(ix) generator_next(xx) %>%  {.[1, , , ]})

# see examples by running in console:
grid::grid.raster(augmented_versions[[3]])

Now we are going to generate two different batches of data from the images in our directory. Within this function we pass a list of parameters to the image_data_genarator() function describing the alterations that we want it to perform on the images:

rotation_range defines how many degrees (0 to 180) the image is rotated.
width_shift_range defines the fraction of total width.
height_shift_range defines the fraction of total height.
shear_range defines the shear intensity (shear angle in radians).
zoom_range defines amount of zoom.
horizontal_flip defines whether to randomly flip images horizontally.

By default, the modifications will be applied randomly, so not every image will be changed every time.

train_generator_aug_1 <- 
  flow_images_from_directory(image_path_train,
                             generator = image_data_generator(rescale = 1/255,
                                                              rotation_range = 20,
                                                              width_shift_range = 0.1,
                                                              height_shift_range = 0.1,
                                                              shear_range = 0.05,
                                                              zoom_range = 0.1,
                                                              horizontal_flip = TRUE,
                                                              fill_mode = "nearest"),
                             target_size = image_size,
                             class_mode = "binary",
                             batch_size = batch_size)

train_generator_aug_2 <- 
  flow_images_from_directory(image_path_train,
                             generator = image_data_generator(rescale = 1/255,
                                                              rotation_range = 90,
                                                              width_shift_range = 0.3,
                                                              height_shift_range = 0.3,
                                                              shear_range = 0.1,
                                                              zoom_range = 0.1,
                                                              horizontal_flip = TRUE,
                                                              vertical_flip = TRUE, 
                                                              fill_mode = "nearest"),
                             target_size = image_size,
                             class_mode = "binary",
                             batch_size = batch_size)

The final step is to use the fit_generator() function in order to train and validate our neural network on the augmented images.

# number of training samples
train_samples_aug_1 <- train_generator_aug_1$n
train_samples_aug_2 <- train_generator_aug_2$n

# initialise model
cnn_model_aug <- keras_model_sequential()

# add layers
cnn_model_aug %>%
  layer_conv_2d(filters = 32,
                kernel_size = c(3, 3), 
                activation = 'relu',
                input_shape = c(img_width, img_height, channels)) %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>% 
  layer_conv_2d(filters = 16,
                kernel_size = c(3, 3), 
                activation = 'relu') %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>% 
  layer_conv_2d(filters = 16,
                kernel_size = c(3, 3), 
                activation = 'relu') %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>% 
  layer_dropout(rate = 0.25) %>% 
  layer_flatten() %>% 
  layer_dense(units = 8, activation = 'relu') %>% 
  layer_dense(units = 1, activation = "sigmoid")   # for binary

# Compile model
cnn_model_aug %>% compile(
  loss = "binary_crossentropy",
  optimizer = optimizer_rmsprop(lr = 2e-5), # (learning rate); 
  metrics = "accuracy"
)

history_aug_1 <- cnn_model_aug %>%
  fit_generator(
    train_generator_aug_1,
    steps_per_epoch = as.integer(train_samples_aug_1/batch_size),
    epochs = 20,
    verbose = 0,
    validation_data = valid_generator,
    validation_steps = as.integer(valid_samples/batch_size))
history_aug_1

## Trained on 7 samples (batch_size=NULL, epochs=20)
## Final epoch (plot to see history):
##         loss: 0.6839
##     accuracy: 0.5345
##     val_loss: 0.683
## val_accuracy: 0.57

This model reaches an accuracy of 0.5345 (or 53%) on the training data and 0.57 (or 57%) on the validation data.

plot(history_aug_1)

The plot shows an interesting development as the training loss becomes greater again towards the end while the validation loss continues to go down. Also the accuracy curve has an unusual shape: After the 15th epoch the validation accuracy goes down, which implies overfitting of the data.

scores_cnn_aug <- cnn_model_aug %>% evaluate_generator(test_generator, steps = 200)

# Print the score
cat(' Test loss:', scores_cnn_aug[[1]], '\n','Test accuracy:', scores_cnn_aug[[2]], '\n')

##  Test loss: 0.6911595 
##  Test accuracy: 0.51

The first CNN model with augmentation reaches an accuracy of 0.51 on the test data. This is not an improvement to the model trained on images without augmentation (CNN: 0.564 accuracy).

Now we train a second model:

history_aug_2 <- cnn_model_aug %>%
  fit_generator(
    train_generator_aug_2,
    steps_per_epoch = as.integer(train_samples_aug_2/batch_size),
    epochs = 20,
    verbose = 0,
    validation_data = valid_generator,
    validation_steps = as.integer(valid_samples/batch_size))

history_aug_2

## Trained on 7 samples (batch_size=NULL, epochs=20)
## Final epoch (plot to see history):
##         loss: 0.6805
##     accuracy: 0.5747
##     val_loss: 0.6821
## val_accuracy: 0.56

This model reaches an accuracy of 0.5747 (or 57%) on the training data and 0.56 (or 56%) on the validation data.

plot(history_aug_2)

Here the validation accuracy keeps declining while the validation loss is rather stable. Since the training accuracy keeps improving until around epoch 13, we can expect a case of overfitting.

scores_cnn_aug <- cnn_model_aug %>% evaluate_generator(test_generator, steps = 200)

# Print the score
cat(' Test loss:', scores_cnn_aug[[1]], '\n', 'Test accuracy:', scores_cnn_aug[[2]], '\n')

##  Test loss: 0.6911627 
##  Test accuracy: 0.516

The second CNN model with augmentation reaches an accuracy of 0.516 on the test data. This is still not an improvement compared to the base model trained on the images without augmentation, and achieved similar results as the previous model (CNN without augmentation: 0.564 accuracy, CNN with augmentation (1): 0.51 accuracy).

d. Try to rely on some pre-built neural networks to aid prediction. Can you achieve a better performance using transfer learning for this problem?

In order to come up with better results than with the ones of our manually created model, we can also use pre-trained models. We will use, as we also did in class, the MobileNetV2 model, with weights pre-trained on ImageNet.

For the start we will classify one single picture. Therefore we will use again our example image.

model_imagenet <- application_mobilenet_v2(weights = "imagenet")
img <- image_load(example_image_path, target_size = image_size)
x <- image_to_array(img)

x <- array_reshape(x, c(1, dim(x)))
x <- mobilenet_preprocess_input(x)

preds <- model_imagenet %>% predict(x)
mobilenet_decode_predictions(preds, top = 3)[[1]]

##   class_name class_description      score
## 1  n07697537            hotdog 0.84387833
## 2  n07697313      cheeseburger 0.02479257
## 3  n07873807             pizza 0.01339789

We can see that the pretrained model worked already pretty good. The hot dog was classified as hot dog with approx. 84% probability.

We will continue using the pre-trained models and add it as one layer on top of the keras_model_sequential(). For the binary classification we use the activation function = sigmoid. The weights of the pre-trained base model will be frozen and are therefore not trainable.

# create the base pre-trained model
base_model <- application_mobilenet_v2(weights = 'imagenet',
                                       include_top = FALSE,
                                       input_shape = c(image_size, 3))

# train only the top layers (which were randomly initialized)
# add our custom layers
pretrained_model <- keras_model_sequential() %>% 
  base_model %>% 
  layer_global_average_pooling_2d() %>% 
  layer_dense(units = 16, activation = 'relu') %>% 
  layer_dense(units = 1, activation = 'sigmoid')

# freeze all convolutional mobilenet layers
freeze_weights(base_model)

# compile the model (should be done *after* setting layers to non-trainable)
pretrained_model %>% compile(
  loss = "binary_crossentropy",
  optimizer = optimizer_rmsprop(),
  metrics = c("accuracy")
)

Due to the really long time the model training takes, we decided to only go with 5 epochs and not more.

# train the model
history_pretrained <- pretrained_model %>%
  fit_generator(
    train_generator_aug_2,
    steps_per_epoch = as.integer(train_samples_aug_2/batch_size),
    epochs = 5,
    verbose = 0,
    validation_data = valid_generator,
    validation_steps = as.integer(valid_samples/batch_size))

history_pretrained

## Trained on 7 samples (batch_size=NULL, epochs=5)
## Final epoch (plot to see history):
##         loss: 0.5027
##     accuracy: 0.7701
##     val_loss: 0.6403
## val_accuracy: 0.72

plot(history_pretrained)

This pre-trained model reaches an accuracy of 0.7701 (or 77%) on the training data and 0.71 (or 71%) on the validation data.

scores_pretrain_aug <- pretrained_model %>% evaluate_generator(test_generator, steps = 200)

# Print the score
cat('Test loss:', scores_pretrain_aug[[1]], '\n', 'Test accuracy:', scores_pretrain_aug[[2]], '\n')

## Test loss: 0.4124853 
##  Test accuracy: 0.802

The petrained model with augmentation reaches an accuracy of 0.802 on the test data. This gives us already a really high improvement compared to the previous models (CNN without augmentation: 0.564 accuracy, CNN with augmentation (1): 0.51 accuracy, CNN with augmentation (2): 0.516 accuracy).

Summary Task 2

The conclusion of this task is for sure, that a pretrained model is already a really good starting point for many problems we might want to solve. Coming up manually with a good model will take a lot of time and trial/errors on how to set up the different layers and hyper-parameters of the model. For this sort of task it is much more efficient to rely on a pretrained model as it has clearly outperformed the manually set up models. In both of our models using augmentation we assume we were overfitting.