# Introduction to Artificial Neural Network

Artificial Neural Network or simply Neural Network (ANN) is a machine learning technique that is modeled loosely after the human brain. It is a powerful, scalable and versatile technique that has been used to solve highly complex problems such as pattern recognition, machine translation and e-mail spam filtering. A neural network composes of layers of artificial neurons or nodes. Each layer’s neurons are connected to adjacent or next layer’s neurons. Each connection (or edge) is responsible to carry a signal from one neuron to another neuron i.e. from the input layer to the output layer. The layer(s) in between the input and output layers are called hidden layer. The connections have a weight that refers to the strength of the connection between two neurons.

Artificial Neuron (Perceptron)
An artificial neuron (or perceptron) is the fundamental unit of a neural network. Figure 1 shows a neuron that accepts three inputs which may come from the external environment (inputs) or maybe from the outputs of other neurons. Each input is associated with a connection weight. The neuron computes the summation of the weighted inputs, then applies an activation function to that sum to produce an output. The output can be defined as follows.

In addition to inputs, a neuron typically has a bias unit which is always has the value of 1. We will see the function of bias with example later. A neuron is characterized by an activation function. The activation function represents the rate at which a neuron is activated or not (firing its output or not). Notice that the output of the neuron is just a summation of the input values (identity function) which essentially a linear model. A linear model is simple to solve but limited in its ability to solve complex problems. To make a neural network that is capable of learning and solving complex tasks, a non-linear activation function is used to transform the output value. The commonly used activation functions are sigmoid, hyperbolic tangent or rectified linear unit (ReLU).

For example, applying a sigmoid function to the output is defined as

The summation of weighted inputs can be defined as follows

where and .

Let’s plot the output of the neuron, for input values in the range of -10 to 10 as can be seen in Figure 2.

As shown in Figure 2, sigmoid function has positive values for all . Notice that the plot looks like a step function with smooth or soft edges. That’s means, there is a non-zero derivative (gradient) of the function for every value of . This is important for training the neural network which will be discussed later.

Let’s see the outputs of the sigmoid function with different weight values. We define three weight values: 0.5, 1.0 and 1.5.

As shown in Figure 3, we can see that changing the weights changes the slope of the output of the sigmoid function. The slope of the output is getting steeper as the weight is increased.

Now, to see the function of bias, we plot the output of the sigmoid function with different values of bias. The weight is equals 1.

As can be seen in Figure 4, changing the value of bias will shift the activation function forward or backward. That’s means the bias allows the network to model a conditional relationship such as if (x > a) then 1 (the neuron is activated) else 0 (the neuron is not activated)

Activation Functions
Different types of activation functions can be used to introduce non-linear properties to a neural network. Various activation functions have been used but the most commonly used are sigmoid (or logistic), hyperbolic tangent (tanh), and rectified linear unit (ReLU).

Sigmoid function maps the input to any value range between . It has a nice property where it tends to bring the activations towards 0 or 1 due to the steep slope in the middle of the function. However, the gradient at either tail of 0 or 1 is very small or almost zero. During backpropagation, the gradients are backpropagated (multiplied with the gradients of the neurons) through the hidden layers to adjust the weights. If the gradient of a neuron is very small, the backpropagation signal will not get through the neuron and as a result the network will barely learn.

Tanh is similar to sigmoid function. Instead of mapping the input to any value in the range of , the output of the function is in the range of . It is closely related to sigmoid because the following holds.

where is the sigmoid function. Although they have similar properties, tanh is preferred because it often converge faster than the sigmoid [1].

ReLU is a non-linear activation function that is defined as follows.

In other word, ReLU is linear for all positive values and zero for all negative values as shown in Figure 5. Compared to sigmoid and tanh, ReLU is simpler to compute and therefore take less time to train and execute. Unlike sigmoid and tanh, ReLU does not suffer from saturation or vanishing gradient. A variant of ReLU called Leaky ReLU is proposed to fix dying ReLU [2]. Dying ReLU refers to a neuron that never activates due to zero gradient. Another variant of ReLU called exponential liner unit (ELU). It has been shown that ELU could speed up training and achieve higher classification accuracy [3].

Artificial Neural Network Model
An ANN is an interconnection of the artificial neurons organized in layers. The layers are called input, hidden and output layers. Feed-forward or forward propagation is the process of propagating the signal from the input layer to the output layer through the hidden layers. The hidden layers are responsible to learn non-linear combination (hidden representations or features) of the input data. Each of the connections that connect a neuron of a layer to a neuron of the adjacent layer will have an associated weight. Figure 6 illustrates an ANN consists of input and output layers with two hidden layers.

Let is the number of layers in a neural network. Hence, in this example where is the input layer and is the hidden layer. The weights including the bias are denoted by where denote the weight associated with the connection between neuron in layer and neuron in layer . The is the bias associated with neuron in layer . We denote the -th input as . The propagation of input to the neurons of the next layer is given as follows.

where is the activation function. The propagation of the signals to the output layer follow the same computation.

The formulation of forward propagation can be generalized as follows.

where is the number of neurons in layer .
A neural network needs to be trained to solve a prediction problem. The training (or learning) is a process of finding the weight and bias values that will produce the desired output at the output layer when a certain input is given to the network. The algorithm to train the network is known as backpropagation algorithm.