Task 1: Fashion MNIST data

Description:

Take the “Fashion MNIST dataset” where images of fashion items are to be classified in a similar manner to what we saw with handwritten digits (see more here: https://github.com/zalandoresearch/fashion-mnist). Images are in exactly the same format as we saw digits: 28x28 pixel grayscale images. The task is to build deep neural net models to predict image classes. The goal is to have as accurate classifier as possible: we are using accuracy as a measure of predictive power.

Label Description:

  • 0 T-shirt/top
  • 1 Trouser
  • 2 Pullover
  • 3 Dress
  • 4 Coat
  • 5 Sandal
  • 6 Shirt
  • 7 Sneaker
  • 8 Bag
  • 9 Ankle boot

b. Train a fully connected deep network to predict items.

  • Normalize the data similarly to what we saw with MNIST.
  • Experiment with network architectures and settings (number of hidden layers, number of nodes, activation functions, dropout, etc.)
  • Explain what you have tried, what worked and what did not. Present a final model.
  • Make sure that you use enough epochs so that the validation error starts flattening out - provide a plot about the training history (plot(history))

Data Normalization:

The y data is an integer vector with values ranging from 0 to 9. We use one-hot encoding for the training data where the vectors are transformed into binary class matrices using the Keras `to_categorical() function:

Define model:

First we initialize a sequential model with the help of the keras_model_sequential() function. Then we add layers to the model. We use the rectifier function ‘relu’ as the activation function in the hidden layer. For the output layer we use the ‘softmax’ activation function, because we want to make sure that the output values are in the range of 0 and 1. The output layer produces 10 output values, one for each clothing class. The first layer, which contains 128 hidden nodes, has an input_shape of 784, because each image has 28 x 28 = 784 features.

Here is a summary of the model:

## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense (Dense)                       (None, 128)                     100480      
## ________________________________________________________________________________
## dropout (Dropout)                   (None, 128)                     0           
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 10)                      1290        
## ================================================================================
## Total params: 101,770
## Trainable params: 101,770
## Non-trainable params: 0
## ________________________________________________________________________________

The 100480 parameters are a multiplication of 784 input features times 128 first layer nodes plus 128 biases.

Now that we have set up the architecture of the model, we are going to compile and fit the model to the data. To do so we are configuring the model with the ‘categorical_crossentropy’ loss function and the ‘optimizer_rmsprop()’ optimizer. We are going to monitor the accuracy of the model during training.

Next, we fit the model to our data. We are going to train the model for 30 epochs or iterations over all the samples in the training data and in batches of 128 samples. An epoch is a single pass through the entire training set, followed by testing of the verification set. The validation_split argument is set to 0.2, meaning the validation data used will be the last 20% of the training data. The batch size defines the number of samples that is going to be propagated through the network.

## Trained on 48,000 samples (batch_size=128, epochs=30)
## Final epoch (plot to see history):
##         loss: 0.2472
##     accuracy: 0.9096
##     val_loss: 0.3463
## val_accuracy: 0.8919

As the model trains, the loss and accuracy metrics are displayed. This model reaches an accuracy of about 0.9096 (or 91%) on the training data and 0.8919 (or 89%) on the validation data.

For a better understanding of the training history, we are going to visualize the fitting:

The plots shows the loss and accuracy of the model for the training data, as well as for the validation data.

The training data accuracy keeps improving while the validation data accuracy looks rather stable towards the end of the training. As long as the validation data accuracy does not get worse we assume that the model has not yet over-learned the training dataset.

Next we are going to fine-tune a few parameters and see how the output changes.

(a) Adding layers:

The following model is trained with two additional hidden layers.

## Trained on 48,000 samples (batch_size=128, epochs=30)
## Final epoch (plot to see history):
##         loss: 0.2167
##     accuracy: 0.9235
##     val_loss: 0.5112
## val_accuracy: 0.8911

This model reaches an accuracy of 0.9235 (or 92%) on the training and 0.8911 (or 89%) on the validation data, so there is no significant improvement to the previous model.

(b) Number of nodes:

Next, we are going to try out the effect of adding more hidden units (nodes) to our model’s architecture and study the impact on the evaluation:

## Trained on 48,000 samples (batch_size=128, epochs=30)
## Final epoch (plot to see history):
##         loss: 0.2077
##     accuracy: 0.9263
##     val_loss: 0.4045
## val_accuracy: 0.8962

In this case the additional nodes have improved the validation accuracy slightly: close to 90% accuracy were reached.

We will continue with this model to try to tweak a few parameters in the optimization algorithm.

(c) Optimization parameters:

## Trained on 48,000 samples (batch_size=128, epochs=30)
## Final epoch (plot to see history):
##         loss: 0.3918
##     accuracy: 0.8643
##     val_loss: 0.3959
## val_accuracy: 0.8594

Model (c) reaches an accuracy of 0.8643 (or 86%) on the training and 0.8594 (or 86%) on the validation data, so the performance has worsened compared to the previous models.

We will continue with model (b).

c. Evaluate the model on the test set. How does test error compare to validation error?

In order to evaluate our model we use the evaluate() function, which returns the loss value and the metric value (in this case ‘accuracy’)

##  Test loss: 0.4345472 
##  Test accuracy: 0.8891

The test accuracy of 0.8891 is relatively close, but slightly lower than the validation accuracy of 0.8962.

Compare predictions to reality:

A larger number of mistakes happened in category 6 (Shirt), that was 200 times inaccurately predicted to be category 0 (T-shirt/top) and 95 times to be category 2 (Pullover). Also category 4 (Coat) and category 2 (Pullover) were often misclassified.

We are going to show some examples:

##    predicted real row_number
## 1:         5    7         13

This sneaker (category 7) was predicted to be a sandal (category 5).

##    predicted real row_number
## 1:         0    6         41

This shirt (category 6) was predicted to be a t-shirt/top (category 0).

d. Try building a convolutional neural network and see if you can improve test set performance.

  • Just like before, experiment with different network architectures, regularization techniques and present your findings

The convolutional neural network is a class of deep neural networks which makes use of the 2d structure of the original input data, applying filters exploiting the 2d images.

Step 1: Convolution To an input Image we apply multiple different feature detectors (aka “kernels” or “filters”) to create feature maps. These filters move along the input image - the step at which these filters are moving is called stride. This comprises the convolutional layer. The process makes the image smaller. Some information is lost, but the integral patterns are kept. On top of that we apply the relu (rectifier liner unit) to increase non-linearity.

Step 2: Max Pooling Then we apply a pooling layer to our convolutional layer, using max pooling (aka “downsampling”). The main purpose of the pooling layer is to make sure that the algortithm recognizes features regardless if they are tilted or distorted. This level of flexibility is called spatial invariance. Pooling significantly reduces the size of our images and helps with avoiding any kind of overfitting of our data since it gets rid of a lot of that data while preserving the main features that we’re after.

Step 3: Flattening The pooled feature map is flattened into a vector to be able to input that into an artificial neural network.

Step 4: Full Connection All of these features are processed through a network and the final fully connected layer performs the voting towards the classes. All of this is trained through forward propagation and back propagation process. An important thing is that not only the weights are trained in artificial neural networks but also the feature detectors are trained and adjusted in that same gradient descent process that allows us to come up with the best feature maps.

## Model: "sequential_4"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## conv2d (Conv2D)                     (None, 26, 26, 32)              320         
## ________________________________________________________________________________
## max_pooling2d (MaxPooling2D)        (None, 13, 13, 32)              0           
## ________________________________________________________________________________
## dropout_4 (Dropout)                 (None, 13, 13, 32)              0           
## ________________________________________________________________________________
## flatten (Flatten)                   (None, 5408)                    0           
## ________________________________________________________________________________
## dense_10 (Dense)                    (None, 16)                      86544       
## ________________________________________________________________________________
## dense_11 (Dense)                    (None, 10)                      170         
## ================================================================================
## Total params: 87,034
## Trainable params: 87,034
## Non-trainable params: 0
## ________________________________________________________________________________

Number of parameters:

  • layer_conv_2d turns the 28 x 28 image to 26 x 26, using 9 parameters for each filter (3 x 3 weights), plus a bias for each filter: 9 x 32 + 9 = 320 parameters
  • max_pooling2d takes each disjoint 2 x 2 squares and collapes them to 1, turning a 26 x 26 “image”" to a 13 x 13. No parameters are associated with this step.
  • flatten: turns each “pixel” in each node to one separate node: 13 x 13 x 32 = 5408
  • dense: fully connected layer: 5408 nodes x 16 new nodes + 16 biases = 86544
  • final fully connected layer: 16 x 10 + 10 = 170
## Trained on 48,000 samples (batch_size=128, epochs=30)
## Final epoch (plot to see history):
##         loss: 0.4598
##     accuracy: 0.8343
##     val_loss: 0.4488
## val_accuracy: 0.8374

The CNN model reaches an accuracy of about 0.8343 (or 83%) on the training data and 0.8374 (or 84%) on the validation data.

##  Test loss: 0.464764 
##  Test accuracy: 0.8289

The test accuracy of 0.8289 is lower than the test accuracy of the ANN (0.8891).

Summary Task 1

In order to optimize our model and improve its predictive power as much as possible we normalized the data and experminted with the network architectures and settings. Adding hidden layers seemed to improve the model slightly, while inlcuding a stochastic gradient descent optimizer did not. The final model reached an accuracy metric of 89%, and showed weaknesses in correctly predicting shirts. The CNN model was not able to outperform the ANN model.

Task 2: Hot dog or not hot dog?

Description:

In this problem you are going to predict if a certain image containing food is hot dog or is something else. Motivation for this comes from the comedy show Silicon Valley (see here: https://www.youtube.com/watch?v=FNyi3nAuLb0).

The data can be found in the course repo (https://github.com/pappzoltan/machine-learning-course/tree/master/data/hot-dog-not-hot-dog) and is originally downloaded from here: https://www.kaggle.com/dansbecker/hot-dog-not-hot-dog.

a. Pre-process data so that it is acceptable by Keras (set folder structure, bring images to the same size, etc).

The data set only consists of a train and a test folder with images. In order to be able to validate the models, we shift 20% of the images within the train folder, to a new folder, called “validation”.

Without augmentation, there is an easier solution to do this split: When creating the image_data_generator(), we could add the argument validation_split, just like we did in task 1. But as the validation data should not be augmented, we need to seperate the data prior to generating the batches of image data.

As we quite frequentely had issues to have the right here() function in place, we include the detach() function and then run the library(here) once again. This makes sure to indeed find, where the current directory is:

Setting up the generators without augmentation:

We are now generating batches of data from the images in our directory.

Image width and height are set to 224x224 pixel. The default batch_size is 32, which we changed to 50. This means that 50 randomly selected images from across the classes in the dataset will be returned in each batch during training. As we have a binary classification “Hot Dog” or “Not Hot Dog” we set the class_mode to “binary”.

First we would like to see how many images are actually loaded.

## 
##   0   1 
## 199 199
## 
##  0  1 
## 50 50
## 
##   0   1 
## 250 250

We have 199 images in the training set, 50 images in the validation set and 250 images within the test set.

## $hot_dog
## [1] 0
## 
## $not_hot_dog
## [1] 1

The binary classification distinguished between two classes. 0 for Hot Dog; 1 for Not Hot Dog.

To see one of the example images, we will plot one of the images with a hot dog on it:

b. Estimate a convolutional neural network to predict if an image contains a hot dog or not. Evaluate your model on the test set.

The CNN in this task is quite similar to what we did in task 1. The difference ais the amount of layers we are going to add. We repeat the process of filtering and max_pooling three times, which reduces the size of the image even more and the more important features are kept. We use the rectifier function ‘relu’ as the activation function in the hidden layer and ’sigmoid’for the output layer because its range is (0, 1) and it can represent the probability of binary class.

We are going to fit the model using the fit_generator function, where the generator (or batches of data) we provide runs in parallel to the model, for efficiency.

## Trained on 7 samples (batch_size=NULL, epochs=20)
## Final epoch (plot to see history):
##         loss: 0.6654
##     accuracy: 0.6724
##     val_loss: 0.6822
## val_accuracy: 0.61

The model reached an accuracy of 0.6724 (or 67%) on the training and 0.61 (or 61%) on the validation data. The accuracy of the validation data worsened around the 9th epoch but picked up again so we assume there is no overfitting of the data yet.

##  Test loss: 0.6871667 
##  Test accuracy: 0.564

The CNN model reaches an accuracy of 0.564 on the test data.

c. Could data augmentation techniques help with achieving higher predictive accuracy? Try some augmentations that you think make sense and compare

Image data augmentation is a technique that can be used to artificially expand the size of a training dataset by creating modified versions of images in the dataset. This should result in an improvement of performance and the ability of the model to generalize.

The Keras deep learning neural network library provides the capability to fit models using image data augmentation via the ImageDataGenerator class.

First we are going to show an example of an augmented image:

Now we are going to generate two different batches of data from the images in our directory. Within this function we pass a list of parameters to the image_data_genarator() function describing the alterations that we want it to perform on the images:

  • rotation_range defines how many degrees (0 to 180) the image is rotated.
  • width_shift_range defines the fraction of total width.
  • height_shift_range defines the fraction of total height.
  • shear_range defines the shear intensity (shear angle in radians).
  • zoom_range defines amount of zoom.
  • horizontal_flip defines whether to randomly flip images horizontally.

By default, the modifications will be applied randomly, so not every image will be changed every time.

The final step is to use the fit_generator() function in order to train and validate our neural network on the augmented images.

## Trained on 7 samples (batch_size=NULL, epochs=20)
## Final epoch (plot to see history):
##         loss: 0.6839
##     accuracy: 0.5345
##     val_loss: 0.683
## val_accuracy: 0.57

This model reaches an accuracy of 0.5345 (or 53%) on the training data and 0.57 (or 57%) on the validation data.

The plot shows an interesting development as the training loss becomes greater again towards the end while the validation loss continues to go down. Also the accuracy curve has an unusual shape: After the 15th epoch the validation accuracy goes down, which implies overfitting of the data.

##  Test loss: 0.6911595 
##  Test accuracy: 0.51

The first CNN model with augmentation reaches an accuracy of 0.51 on the test data. This is not an improvement to the model trained on images without augmentation (CNN: 0.564 accuracy).

Now we train a second model:

## Trained on 7 samples (batch_size=NULL, epochs=20)
## Final epoch (plot to see history):
##         loss: 0.6805
##     accuracy: 0.5747
##     val_loss: 0.6821
## val_accuracy: 0.56

This model reaches an accuracy of 0.5747 (or 57%) on the training data and 0.56 (or 56%) on the validation data.

Here the validation accuracy keeps declining while the validation loss is rather stable. Since the training accuracy keeps improving until around epoch 13, we can expect a case of overfitting.

##  Test loss: 0.6911627 
##  Test accuracy: 0.516

The second CNN model with augmentation reaches an accuracy of 0.516 on the test data. This is still not an improvement compared to the base model trained on the images without augmentation, and achieved similar results as the previous model (CNN without augmentation: 0.564 accuracy, CNN with augmentation (1): 0.51 accuracy).

d. Try to rely on some pre-built neural networks to aid prediction. Can you achieve a better performance using transfer learning for this problem?

In order to come up with better results than with the ones of our manually created model, we can also use pre-trained models. We will use, as we also did in class, the MobileNetV2 model, with weights pre-trained on ImageNet.

For the start we will classify one single picture. Therefore we will use again our example image.

##   class_name class_description      score
## 1  n07697537            hotdog 0.84387833
## 2  n07697313      cheeseburger 0.02479257
## 3  n07873807             pizza 0.01339789

We can see that the pretrained model worked already pretty good. The hot dog was classified as hot dog with approx. 84% probability.

We will continue using the pre-trained models and add it as one layer on top of the keras_model_sequential(). For the binary classification we use the activation function = sigmoid. The weights of the pre-trained base model will be frozen and are therefore not trainable.

Due to the really long time the model training takes, we decided to only go with 5 epochs and not more.

## Trained on 7 samples (batch_size=NULL, epochs=5)
## Final epoch (plot to see history):
##         loss: 0.5027
##     accuracy: 0.7701
##     val_loss: 0.6403
## val_accuracy: 0.72

This pre-trained model reaches an accuracy of 0.7701 (or 77%) on the training data and 0.71 (or 71%) on the validation data.

## Test loss: 0.4124853 
##  Test accuracy: 0.802

The petrained model with augmentation reaches an accuracy of 0.802 on the test data. This gives us already a really high improvement compared to the previous models (CNN without augmentation: 0.564 accuracy, CNN with augmentation (1): 0.51 accuracy, CNN with augmentation (2): 0.516 accuracy).

Summary Task 2

The conclusion of this task is for sure, that a pretrained model is already a really good starting point for many problems we might want to solve. Coming up manually with a good model will take a lot of time and trial/errors on how to set up the different layers and hyper-parameters of the model. For this sort of task it is much more efficient to rely on a pretrained model as it has clearly outperformed the manually set up models. In both of our models using augmentation we assume we were overfitting.