본문 바로가기
Tech Development/Deep Learning (CNN)

SGD, Momentum, AdaGrad, and Adam

by JK from Korea 2022. 12. 16.

Stochastic Gradient Descent, Momentum, AdaGrad, and Adam

 

Date : 2022.10.11

 

*The contents of this book is heavily based on Stanford University’s CS231n course.

 

Optimization is the process of finding the optimal variable value. We will explore different methods of optimization to initialize hyperparameters and input variables.

 

The purpose of these “methods” is to increase both efficiency and accuracy of the CNN.

 

[Optimization]

There is no short-cut or ‘easy’ way of optimizing variables. Complex networks require higher time and space complexity. We already had a taste of one of the optimization methods. SGD uses differentiation of the variable loss function to find the optimal weight variable.

 

[SGD]

 

[SGD Equation]

 

[SGD in python]

By separating different optimization methods into independent classes (objects), we can use each method like modules. SGD may not be the best method of optimization. For example SGD is relatively less efficient when it comes to anisotropy functions.

 

[Example of an Anisotropy Function]

We shall explore some other options.

 

[Momentum]

 

[SGD w/ Momentum Equation]

𝑣 is the velocity and 𝑝 is the resistance.

 

[Momentum in Python]
[Momentum Curvature]

Compared to SGD, the momentum method has a smoother curvature heading towards the minimum point. The velocity variable pulls the x variable faster. Similar to a ball falling towards the bottom, the x-axis follows along a relatively fixed path (straight line), thus the velocity factor adds speed (greater descent) compared to y.

 

[AdaGrad]

The learning rate is an essential part of optimizing the weight variable. Thus, we will use ‘Learning Rate Decay’ which is an efficient way of selecting and updating the learning rate. Learning rate decay method generally uses a high learning rate at the start and gradually decays the value as the network reaches higher accuracy. The AdaGrad method updates the learning rate for each weight variable by squaring the original learning rate.

 

[AdaGrad Equation]

It’s interesting how the new adaptive learning rate is part of the numerator. The h variable from line 1 measures the gradient descent for each variable. The greater the gradient, the more the learning rate decays.

 

“Gradient Descent Value Increases → Learning Rate Decay Factor Increases → Next Gradient Descent Value Decreases”

 

[AdaGrad in Python]
[AdaGrad Optimization Route]

The optimization route shows a drastic decrease in the zig-zag pattern along the y-axis. After the first huge y-value gradient drop, the h variable significantly decreases the learning rate.

 

[Adam]

Adam is a combination of ‘momentum’ and ‘adagrad.’

 

[Adam Learning Route]

 

 

I am skipping the comparison between different optimization methods because that is not a part of my key interest of building a CNN. The code for comparison is in the github link below. It is fairly easy to understand.

 

https://github.com/jk-junhokim/deep-learning-from-scratch/tree/master/ch06

 

728x90
반응형

댓글