Gradient Descent

Gradient descent is an optimization algorithm for finding a local minimum of a differentiable function. The algorithm iteratively updates the parameters of the function by computing the gradient of the function. Consider a differentiable function, f(x)=x^2-10x as shown in Figure 1. We know that the function is minimum when x=5. Assuming that x=12, to get to the minimum, we need to move to the “left”. Specifically, we need to update x by subtracting a value from it. To determine the value that we need to subtract from x, we compute the gradient at x=12 which is the slope of the tangent line at the given point. In other words, the gradient is the rate of change of the function at a given point.

Figure 1. The gradient is the slope of the tangent line at a given point.

Given a function, the gradient is computed by taking the derivative of the function with respect to its parameter.

\frac{\partial f}{\partial x} = 2x - 10

Then, the parameter is updated iteratively until optimal solution is obtained as follows:

Repeat until converge: x := x - \alpha \frac{\partial f}{\partial x}

where 0 \leq \alpha \leq 1 is the step size or learning rate. \alpha controls the amount of the update to be applied to the parameter x. A small \alpha corresponds to a little step which means the optimization will take longer to complete while a large \alpha may cause the gradient descent to overshoot (miss the minimum) and fail to converge.

Let’s analyze the parameter update equation. Consider x=12 as shown in Figure 1. For now, assuming \alpha = 1. We know that \frac{\partial f}{\partial x}>0 (positive value). Based on the update equation, x will be reduced by the value of \frac{\partial f}{\partial x}.

If x is to the left of the minimum as shown in Figure 2. We know \frac{\partial f}{\partial x}<0 (negative value). Based on the update equation, x will be increased by the value of \frac{\partial f}{\partial x}.

Figure 2.