Batch Initialization, Overfitting, Dropout, and Optimization

Date : 2022.10.16

*The contents of this book is heavily based on Stanford University’s CS231n course.

[Batch Normalization]

In the previous post, we explored various methods for weight initialization. The purpose of weight initialization was to evenly spread the activation outputs among all nodes.

Batch normalization is a method to spread the activation outputs without relying on weight initialization. The benefits of batch normalization are the following.

Improves Learning Rate
Decreases Reliance on Weight Initialization
Prevents Overfitting

CNN with batch normalization looks like the following.

The following process converts a batch with Average = 0, Variation = 1, which is the ultimate goal of batch normalization. Each batch layer shifts and scales the normalized data.

𝛽 is the shift constant and 𝛾 is the scale constant. The initial values for each are 0 and 1. As the learning proceeds the values adjust accordingly.

[Computational Graph for Batch Normalization]

[Batch Normalization Test]

With MNIST, let’s compare the learning rates.

[Batch Normalization Increases the Learning Rate]

The W above each graph is the std of initial weight inputs. The outcome of multiple scenarios show that the initial weight distribution highly affects the learning curve. However, despite the different weight initializations, you can see that batch normalization still returns stable learning curvatures. You can find the code on my github page.

[Overfitting]

Overfitting refers to a network that has adapted to a specific data set. This prevents the network’s performance on unprecedented data. To prevent this we are going to implement weight decay and dropouts.

[When does overfitting occur?]

Complex Model (multiple layers and variables)
Insufficient Data
Uneven distribution or Outlier in Weight Variable

[Weight Decay]

If a weight variable is noticeably greater, it may cause the model to overfit for certain training data since the weight variables adjust value according to different characteristics of the input data.

Thus weight decay in simple terms is giving a penalty to weights that are relatively standing out.

[Dropout]

Dropout randomly selects and temporarily deletes nodes within the network. Followed the book to create a trainer which automates the training process as a separate object. The outcome for dropout is as following.

The difference between Train and Test has decreased.

[Finding the Appropriate Hyperparameter]

Hyperparameters are values we need to initialize. It can include but not limited to the number of nodes per layer, batch size, learning rate, weight decay etc. These values don’t have a fixed answer. We must find the optimal value through trial and error. Although intuition and previous studies narrow down the options for us.

Before testing hyperparameters, we need to separate the data into three separate groups to avoid overfitting.

Validation Data: For Hyperparameter Experimentation
Train Data: Train Weight
Test Data: Tracks Accuracy and Network Functionality

The shuffle function mixes the data randomly in a size that we want.

Hyperparameter optimization takes a long time due to the number of cases. To expedite the process, we can reduce the epoch and data size to a certain degree. The code is on github, and the results are presented below.

It is reasonable to say that the top 5 ~ 6 function better than other trials. From the results we can narrow down the learning rate and weight decay constant to 0.005 ~ 0.009 and 10e-8 ~ 10e-5. From there on we continue to narrow down the hyperparameters until we decide to stop and choose at a reasonably high accuracy.

728x90

저작자표시 변경금지 (새창열림)

'Tech Development > Deep Learning (CNN)' 카테고리의 다른 글

Completing the CNN (0)	2022.12.25
Convolutional & Pooling Layers (0)	2022.12.25
Weight Initialization, Xavier Weights, Dropout, and Setting Hyperparameters (0)	2022.12.16
SGD, Momentum, AdaGrad, and Adam (0)	2022.12.16
Neural Network with Backward Propagation (0)	2022.12.16

Just a Kid from Korea

Batch Initialization, Overfitting, Dropout, and Optimization