ANN Series - 3 - Gradient Descent with Linear Activation

What is Gradient Descent?

Gradient Descent is a fundamental optimization algorithm used in machine learning and artificial intelligence to minimize the cost function, which is a measure of how far off a model's predictions are from the actual values. The goal of gradient descent is to find the model parameters (typically weights in neural networks) that minimize the cost function, thereby improving the model's predictions.

With numeric example, let us see how to find the model parameters using Gradient Descent where the activation function is the identity function f(x) = x.


Python Code for Calculating Gradient and Weight Updation:

ml-course/gd.ipynb at main · lovelynrose/ml-course (github.com)

How Gradient Descent Works: Summary

  1. Initialization: Start with initial values for the parameters to be optimized (weights in the case of neural networks).

  2. Compute Gradient: Calculate the gradient of the cost function with respect to each parameter. The gradient is a vector that represents the direction and rate of the fastest increase of the function. By computing the gradient, we find out how the cost function changes with changes in the parameters.

  3. Update Parameters: Adjust the parameters in the opposite direction of the gradient (the direction that reduces the cost function) by a small step. The size of the step is determined by the learning rate, a hyperparameter that controls how much we adjust the parameters with respect to the gradient. The formula for updating each parameter is:

    =×Gradient

    where
    w represents the parameters,

  4. Iterate: Repeat the process of calculating the gradient and updating the parameters until the cost function converges to a minimum value or until a specified number of iterations is reached.

Types of Gradient Descent:

  • Batch Gradient Descent: Uses the entire training dataset to calculate the gradient of the cost function for each iteration. While accurate, it can be very slow and computationally expensive for large datasets.

  • Stochastic Gradient Descent (SGD): Updates the parameters for each training example one by one. It is much faster and can be used for online learning, but the frequent updates result in a more fluctuating convergence path.

  • Mini-batch Gradient Descent: Strikes a balance between batch and stochastic gradient descent by updating parameters for a small subset of the training data at each iteration. This approach offers a compromise between the computational efficiency of SGD and the stability of batch gradient descent.

Key Features:

  • Convergence: Gradient descent aims to reach the global minimum of the cost function for convex problems or a local minimum for non-convex problems.

  • Learning Rate: Choosing the right learning rate is crucial. Too small a learning rate leads to slow convergence, while too large a learning rate can cause the algorithm to oscillate around the minimum or even diverge.

  • Application: Gradient descent is widely used in training a variety of machine learning models, including linear regression, logistic regression, and neural networks.

Gradient descent is powerful because it is simple yet effective at optimizing complex functions, making it a cornerstone of many machine learning algorithms.

Comments

Popular posts from this blog

ANN Series - 10 - Backpropagation of Errors

Regression 10 - Pseudo-Inverse Matrix Approach - Relationship between MLE and LSA - Derivation

Regression 9 - Maximum Likelihood Estimation(MLE) Approach for Regression - Derivation