Gradient descent is a very important parameter in machine learning. Gradient descent is used to minimize different functions. In machine learning it is used to update important parameters. The goal of gradient descent is to take the steepest possible way to achieve convergence. Here we will discuss how gradient descent minimizes the cost function.
Look at the picture above. Theta 0 and theta 1 are on the x and y axis respectively and J(theta 0, theta 1), cost function is on the vertical axis. Here we are assuming that J is the function of theta 0 and theta 1 only. But J can be a function of theta 0, theta 1 and up to some theta n. For simplicity I am assuming that we have only theta 0 and theta 1. We are trying to minimize this cost function in this example. It can be some other function too. But here we are talking about cost function. So, how we go about that. First, we assume the value of theta 0 and theta 1. Here in this picture it initializes with zeros. But in some other projects you might want to initialize them with some other value.
So, our job is to find out the fastest way to converge or the fastest route to reach the local optimum as shown in the picture. One of the properties of gradient descent is you might reach to different local optimum depending on where you start as shown in the picture above. Let’s look at the math part. This is the gradient descent algorithm:
Here, j represents feature index. Here j = 0, 1. To get the gradient descent repeat this until convergence. In each iteration, theta 0 and theta 1 should be updated simultaneously.
Correct way is to update the value of theta 0 and theta 1 is:
The wrong way is to do this way:
This is wrong way because theta 0 is assigned before computing temp1. So, this new theta 0 will be used to compute temp1. This is not right.
Alpha in this equation is learning rate. Choosing a suitable learning rate can be tricky. As alpha is multiplied by the slope, if slope is negative theta increases and if the slope is positive theta decreases. The value of alpha can be fixed. There is no need to increase or decrease the value of alpha over time. It is very important to watch the time that is required to converge. If it takes too much time to converge, then the value of alpha needs to be changed. One good way to choose alpha is to make a table of the time required to converge with respect to alpha and choose a optimum value from there.