β¨Optimization Algorithms
Usage of effective optimization algorithms
Having fast and good optimization algorithms can speed up the efficiency of the whole work β¨
π© Batch Gradient Descent
In batch gradient we use the entire dataset to compute the gradient of the cost function for each iteration of the gradient descent and then update the weights.
Since we use the entire dataset to compute the gradient convergence is slow.
π© Stochastic Gradient Descent (SGD)
In stochastic gradient descent we use a single datapoint or example to calculate the gradient and update the weights with every iteration, we first need to shuffle the dataset so that we get a completely randomized dataset.
Random sample helps to arrive at a global minima and avoids getting stuck at a local minima.
Learning is much faster and convergence is quick for a very large dataset π
π© Mini Batch Gradient Descent
Mini-batch gradient is a variation of stochastic gradient descent where instead of single training example, mini-batch of samples is used.
Mini batch gradient descent is widely used and converges faster and is more stable.
Batch size can vary depending on the dataset.
1 β€ batch-size β€ m, batch-size is a hyperparameter β
π Comparison
Very large batch-size (m or close to m):
Too long per iteration
Very small batch-size (1 or close to 1)
losing speed up of vectorization
Not batch-size too large/small
We can do vectorization
Good speed per iteration
The fastest (best) learning π€β¨
π© Guidelines for Choosing Batch-Size
For a small (m β€ 2000) dataset β‘ use batch gradient descent
Typical mini batch-size: 64, 128, 256, 512, up to 1024
Make sure mini batch-size fits in your CPU/GPU memory
It is better(faster) to choose mini batch size as a power of 2 (due to memory issues) π§
π© Gradient Descent with Momentum
Almost always, gradient descent with momentum converges faster β¨ than the standard gradient descent algorithm. In the standard gradient descent algorithm, we take larger steps in one direction and smaller steps in another direction which slows down the algorithm. π€
This is what momentum can improve, it restricts the oscillation in one direction so that our algorithm can converge faster. Also, since the number of steps taken in the y-direction is restricted, we can set a higher learning rate. π€
The following image describes better: π§
Formula:
For better understanding:
In gradient descent with momentum, while we are trying to speed up gradient descent we can say that:
Derivatives are the accelerator
v's are the velocity
Ξ² is the friction
π© RMSprop Optimizer
The RMSprop optimizer is similar to the gradient descent algorithm with momentum. The RMSprop optimizer restricts the oscillations in the vertical direction. Therefore, we can increase our learning rate and our algorithm could take larger steps in the horizontal direction converging faster.
The difference between RMSprop and gradient descent is on how the gradients are calculated, RMSProp gradients are calculated by the following formula:
β¨ Adam Optimizer
Adam stands for: ADAptive Moment estimation
Commonly used algorithm nowadays, Adam can be looked at as a combination of RMSprop and Stochastic Gradient Descent with momentum. It uses the squared gradients to scale the learning rate like RMSprop and it takes advantage of momentum by using moving average of the gradient instead of gradient itself like SGD with momentum.
To summarize: Adam = RMSProp + GD with momentum + bias correction
π΅π΅π΅
π©βπ« Hyperparameters choice (recommended values)
Ξ±: needs to be tuned
Ξ²1: 0.9
Ξ²2: 0.999
Ξ΅:
π§ References
Last updated