Introduction to Artificial Neural Network

Artificial Neural Network or simply Neural Network (ANN) is a machine learning technique that is modeled loosely after the human brain. It is a powerful, scalable and versatile technique that has been used to solve highly complex problems such as pattern recognition, machine translation and e-mail spam filtering. A neural network composes of layers of artificial neurons or nodes. Each layer’s neurons are connected to adjacent or next layer’s neurons. Each connection (or edge) is responsible to carry a signal from one neuron to another neuron i.e. from the input layer to the output layer. The layer(s) in between the input and output layers are called hidden layer. The connections have a weight that refers to the strength of the connection between two neurons.

Artificial Neuron (Perceptron)
An artificial neuron (or perceptron) is the fundamental unit of a neural network. Figure 1 shows a neuron that accepts three inputs which may come from the external environment (inputs) or maybe from the outputs of other neurons. Each input is associated with a connection weight. The neuron computes the summation of the weighted inputs, then applies an activation function to that sum to produce an output. The output can be defined as follows.

    \[z=\left[\sum_{i=1}^{N}x_iw_i\right]+b \]

In addition to inputs, a neuron typically has a bias unit which is always has the value of 1. We will see the function of bias with example later. A neuron is characterized by an activation function. The activation function represents the rate at which a neuron is activated or not (firing its output or not). Notice that the output of the neuron is just a summation of the input values (identity function) which essentially a linear model. A linear model is simple to solve but limited in its ability to solve complex problems. To make a neural network that is capable of learning and solving complex tasks, a non-linear activation function is used to transform the output value. The commonly used activation functions are sigmoid, hyperbolic tangent or rectified linear unit (ReLU).

Figure 1. An artificial neuron.

For example, applying a sigmoid function to the output is defined as

    \[a=g(z)=\frac{1}{1+e^-z} \]

The summation of weighted inputs can be defined as follows

    \[z=\sum_{i=0}^{N}x_iw_i \]

where x_0=1 and w_0=b.

Let’s plot the output of the neuron, z for input values in the range of -10 to 10 as can be seen in Figure 2.

Figure 2. The shape of sigmoid function.

As shown in Figure 2, sigmoid function has positive values for all x. Notice that the plot looks like a step function with smooth or soft edges. That’s means, there is a non-zero derivative (gradient) of the function for every value of x. This is important for training the neural network which will be discussed later.

Let’s see the outputs of the sigmoid function with different weight values. We define three weight values: 0.5, 1.0 and 1.5.

Figure 3. The output of sigmoid function with three different weight values.

As shown in Figure 3, we can see that changing the weights changes the slope of the output of the sigmoid function. The slope of the output is getting steeper as the weight is increased.

Now, to see the function of bias, we plot the output of the sigmoid function with different values of bias. The weight is equals 1.

Figure 4. The output of sigmoid function with different values of bias.

As can be seen in Figure 4, changing the value of bias will shift the activation function forward or backward. That’s means the bias allows the network to model a conditional relationship such as if (x > a) then 1 (the neuron is activated) else 0 (the neuron is not activated)

Activation Functions
Different types of activation functions can be used to introduce non-linear properties to a neural network. Various activation functions have been used but the most commonly used are sigmoid (or logistic), hyperbolic tangent (tanh), and rectified linear unit (ReLU).

Sigmoid function maps the input to any value range between (0,1). It has a nice property where it tends to bring the activations towards 0 or 1 due to the steep slope in the middle of the function. However, the gradient at either tail of 0 or 1 is very small or almost zero. During backpropagation, the gradients are backpropagated (multiplied with the gradients of the neurons) through the hidden layers to adjust the weights. If the gradient of a neuron is very small, the backpropagation signal will not get through the neuron and as a result the network will barely learn.

Tanh is similar to sigmoid function. Instead of mapping the input to any value in the range of (0,1), the output of the function is in the range of (-1,1). It is closely related to sigmoid because the following holds.

    \[ g(z)=2\sigma(2z)-1 \]

where \sigma(z) is the sigmoid function. Although they have similar properties, tanh is preferred because it often converge faster than the sigmoid [1].

ReLU is a non-linear activation function that is defined as follows.

    \[g(z)=max(0,z) \]

In other word, ReLU is linear for all positive values and zero for all negative values as shown in Figure 5. Compared to sigmoid and tanh, ReLU is simpler to compute and therefore take less time to train and execute. Unlike sigmoid and tanh, ReLU does not suffer from saturation or vanishing gradient. A variant of ReLU called Leaky ReLU is proposed to fix dying ReLU [2]. Dying ReLU refers to a neuron that never activates due to zero gradient. Another variant of ReLU called exponential liner unit (ELU). It has been shown that ELU could speed up training and achieve higher classification accuracy [3].

Figure 5. Rectified linear unit activation function.

Artificial Neural Network Model
An ANN is an interconnection of the artificial neurons organized in layers. The layers are called input, hidden and output layers. Feed-forward or forward propagation is the process of propagating the signal from the input layer to the output layer through the hidden layers. The hidden layers are responsible to learn non-linear combination (hidden representations or features) of the input data. Each of the connections that connect a neuron of a layer to a neuron of the adjacent layer will have an associated weight. Figure 6 illustrates an ANN consists of input and output layers with two hidden layers.

Figure 6. Single hidden layer neural network

Let L is the number of layers in a neural network. Hence, in this example L=3 where l=1 is the input layer and l=2 is the hidden layer. The weights including the bias are denoted by w where w_{ij}^{(l)} denote the weight associated with the connection between neuron j in layer l and neuron i in layer l+1. The w_{i0}^{(l)} is the bias associated with neuron i in layer l+1. We denote the j-th input as x_j. The propagation of input x_j to the neurons of the next layer is given as follows.

    \[ z_1^{(2)}=w_{10}^{(1)} + w_{11}^{(1)}x_1 + w_{12}^{(1)}x_2 \]

    \[ z_2^{(2)}=w_{20}^{(1)} + w_{21}^{(1)}x_1 + w_{22}^{(1)}x_2 \]

    \[ a_1^{(2)}= g(z_1^{(2)})\]

    \[ a_2^{(2)}= g(z_2^{(2)})\]

where g(z_i^{(l)}) is the activation function. The propagation of the signals to the output layer follow the same computation.

    \[ z^{(3)}=w_{0}^{(2)} + w_{1}^{(2)}a_1^{(2)} + w_{2}^{(2)}a_2^{(2)} \]

    \[ \hat{y} = a^{(3)} = g(z^{(3)}) \]

The formulation of forward propagation can be generalized as follows.

    \[ z_i^{(l+1)} = \sum_{j=0}^{s_l} w_{ij}^{(l)} a_i^{(l)} \]

    \[ a_i^{(l+1)} = g(z_i^{(l+1)}) \]

where s_l is the number of neurons in layer l.
A neural network needs to be trained to solve a prediction problem. The training (or learning) is a process of finding the weight and bias values that will produce the desired output at the output layer when a certain input is given to the network. The algorithm to train the network is known as backpropagation algorithm.

Leave a Reply

Your email address will not be published. Required fields are marked *