Backpropagation and Gradient Descent

A neural network needs to be trained to solve a prediction problem. The training (or learning) is a process of finding the weight and bias values that will produce the desired output at the output layer when a certain input is given to the network. How does a neural network learn? For a human to learn to do things, we will be told that we are doing it right or wrong. For example, if we are to learn to play football, we watch how the ball flies as we kick the ball. The next time we kick the ball, we remember the mistake that we have done before and improvise the movement to kick the ball a bit better. Similarly, a neural network learns to solve a problem by comparing the output that is produced by the network with the output that was meant to produce. The difference between the network’s output and the true output is the amount of error that the network needs to reduce to improve its output. This is formulated as follows.

    \[ J(w)=\frac{1}{2}\sum_{i=1}^{N} (y_i-\hat{y}_i)^2 \]

J(w) is a measure of how badly the network is performing. Specifically, it quantifies the error between the true outputs and the predicted outputs. The measure is known as the cost function, is a function that needs to be minimized in order for the network to learn to produce the correct output. y is the true output, \hat{y} is the predicted output produced by the network and N is the total number of training examples.

The above cost function is for regression problem. For binary classification problem, the cost functions is given as follows.

    \[ J(w)=\sum_{i=1}^{N} -(y\log_{\hat{y}} + (1-y)\log_{(1-\hat{y})}) \]

For multi-class classification, the cost function is given as follows.

    \[ J(w)=-\sum_{c=1}^{C} (y\log_{\hat{y}}) \]

Minimizing the cost function involves adjusting the weights of the connections between two neurons in the network. The weights are adjusted in such a way that the output that will be produced by the network is closed to the actual output. This involve propagating the error from the output layer to the the hidden layer(s) and the input layer. Hence the process is called backpropagation. The commonly used backpropagation algorithm is the Gradient Descent.

Gradient Descent is an optimization (iterative) algorithm for finding the parameters (weights) that will minimize the cost function. It works by calculating the partial derivatives with respect to each weight (gradient) of the cost function and use the calculated gradient to update the weights. The weight update is defined as follow.

    \[ w_{ij}^{(l)} := w_{ij}^{(l)} - \alpha \frac{\partial J}{\partial w_{ij}^{(l)}} \]

where \alpha is the step size or learning rate and \frac{\partial J}{\partial w_{ij}^{(l)}} is the gradient.

The learning rate defines how much a step is taken or how much the weights are updated. A small learning rate causes small changes to the weights. Hence, the Gradient Descent can be slow or in another words it would take a longer time to converge. In contrast, a large learning rate causes large update and results in a faster convergence rate. However, it may cause the Gradient Descent to miss the minima (overshoot) and fail to converge.

Figure 1: A single hidden layer neural network

The gradient is calculated using the chain rule. Using the network shown in Figure 1 as an example, the calculation of the gradient is given as follows.

    \[ \frac{\partial J}{\partial w_{ij}^{(2)}} = \frac{\partial J}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z^{(3)}} \cdot \frac{\partial z^{(3)}}{\partial w_{ij}^{(2)}} \]

    \[ \frac{\partial J}{\partial w_{ij}^{(1)}} = \frac{\partial J}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z^{(3)}} \cdot \frac{\partial z^{(3)}}{\partial a_{i}^{(2)}} \cdot \frac{\partial a_{i}^{(2)}}{\partial z_{i}^{(2)}} \cdot \frac{\partial z_{i}^{(2)}}{\partial w_{ij}^{(1)}} \]

Leave a Reply

Your email address will not be published. Required fields are marked *