Weight Initialization, Xavier Weights, Dropout, and Setting Hyperparameters
Date : 2022.10.14
*The contents of this book is heavily based on Stanford University’s CS231n course.
[Weight Initialization]
We’ve explored gradient descents designed to optimize the weights. Now lets focus on the initialization.
“What value shall we start with?”
So far we used weight decay in order to prevent overfitting. If the weight values are greater than the input values, the network will start to adapt to the weights, not the input data.
*Overfitting: A state in which the network is only accustomed to the training data and shows low performance on other data sets.
If we use weight decay to even out the weights to a small value, why not start with small weights? We must make sure, however, to avoid using identical weights, because it defeats the purpose of using backpropagation.
[Trial & Error]
We’re going to experiment with different weight values and activation functions to see the connection between weight initialization and network performance.
The following code tracks the activation values (output per layer).
My version of the code is different from the book’s. I automated all the cases so the output graphs come together rather than running the code multiple times. Go check it out.
Conclusion first. The best case was ReLU as the activation function and He as the initialization value.
The use of sigmoid does not return good weight variables. This can be understood quite intuitively since the sigmoid function’s input range (x-val) is between -5 ~ 5, any other input value will result in either 0 or 1. Thus the weight data is likely to skew to either side of the output range. If most activation values are skewed to 0 and 1’s, the gradient values will vanish in the backpropagation step. Also, the sigmoid function fixes the activation value to a certain number which defeats the purpose of using multiple nodes. Each node should have a unique value in order to effectively represent different characteristics of the input data (MNIST).
We can conclude that the best case scenario is for the activation layer outputs to be evenly spread. The neural network trains effectively with evenly spread activation outputs. Just remember that the initial weight value significantly impacts the learning rate of the network.
'Tech Development > Deep Learning (CNN)' 카테고리의 다른 글
Convolutional & Pooling Layers (0) | 2022.12.25 |
---|---|
Batch Initialization, Overfitting, Dropout, and Optimization (0) | 2022.12.16 |
SGD, Momentum, AdaGrad, and Adam (0) | 2022.12.16 |
Neural Network with Backward Propagation (0) | 2022.12.16 |
Computational Graphs & Backward Propagation (0) | 2022.10.24 |
댓글