Artificial Neural Network is a machine learning algorithm that is modeled loosely after the human brain. A neural network is a regression or classification model, represented by a network as shown in Fig. 1. The network composes of layers of artificial neurons or nodes. Each layer’s neurons are connected to adjacent or next layer’s neurons. Each connection (or edge) is responsible to carry a signal from one neuron to another neuron i.e. from the input layer (blue) to the output layer (yellow). The layer(s) in between the input and output layers is called hidden layer. The edges have a weight that refers to the strength of the connection between two neurons.

Feed-forward or forward propagation is the process of propagating the signal from the input layer to the output layer. Let the inputs are denoted by , the propagation of the inputs to the neuron is given as follows.

where is the weights of the edges that are connected to neuron . The propagation of the signal to the following layers follow the same computation i.e. the outputs of a neuron in a hidden layer to the neuron of the next layer

Each neuron is also characterized by an activation function. The activation function represents the rate at which one neuron is activated or not (firing its output or not). Notice that the output of a neuron is just a *summation of the input values* (identity function) which essentially a linear model. A linear model is simple to solve but limited in its ability to solve complex problems. To make a neural network that is capable of learning and solving complex tasks, a non-linear activation function is used to transform the output values. The commonly used activation functions are sigmoid, tangent or rectified linear unit (relu). For example, the sigmoid function is defined as

where is the output of neuron after applying the activation function.

We can get different models of prediction (classification or regression) by varying the weights of the connections. In other words, the weights determine the outputs that will be produced by the network. Now, how do we adjust the weights such that the network produces the expected or desired outputs? To manually find and set the acceptable weights would be a laborious task especially when it involves a large network. Fortunately, we have an algorithm to automatically adjust the weights called gradient descent.

Gradient descent is an optimization algorithm that finds the weights that will minimize the error in prediction. Suppose denotes the weight vector and is the prediction error, gradient descent can be summarized as follow