###### Introduction

Perceptron seeks to find a hyperplane that can linearly separate the data points into their classes. But perceptron does not guarantee the hyperplane is the optimal solution. Figure 1 illustrates three possible solutions among an almost infinite number of possible hyperplanes. Although the hyperplanes perfectly separate the data points, misclassification could easily occur. For example, it is very likely for test points belonging to NEGATIVE class to be above hyperplane . Similarly, test points belonging to POSITIVE class could appear below hyperplane . As for , although the hyperplane appears to be a “better” hyperplane, the decision boundary is closer to POSITIVE class. Thus, the misclassification rate would increase especially for POSITIVE class.

Support vector machine (SVM) finds an optimal hyperplane that best separates the data points. This optimal solution is the hyperplane that is far away from the data points of both classes. To this end, SVM introduces the notion of margin which is illustrated in Figure 2. The margin is defined by the distance from the parallel dotted lines that are crossing the closest data points of both classes. The hyperplane is in the middle indicated by the blue line. The optimal hyperplane is the one that maximizes the margin of the training instances. The idea is if the margin is maximized, the hyperplane will be far away from the data points of both sides and reduce the chance of misclassification.

How do we find the margin? Given hyperplane that separates the two classes, we know that is equidistant from hyperplane and . is defined by

where is the feature vector and and are the parameters of the hyperplane. For each instance , we know that

for all belong to class

for all belong to class

These are the constraints that need to be respected in order to select the optimal hyperplane. The first constraint states that all instances belonging to class must be above and the second constraint states that all instances belonging to class must be below . By not violating these constraints, we are selecting two hyperplanes (that define the margin) with no data points in between them. Let’s simplify the constraints. We multiply the first constraint on both sides with .

We know that for all belonging to class , is . Thus, the constraint becomes

We do the same for the second constraint.

We know that for all belonging to class , is . Thus, the constraint becomes

Therefore, we can combine the two constraints as follows

for all instances .

is the margin as shown in Figure 3. We know that is a vector that is perpendicular to the hyperplanes. If the unit vector of is multiplied with , we get a vector , where it has the same direction as and with a distance equal to . Let the unit vector be , thus .

We know that and are

Let is a data point on . If we add with , the resultant is a point on .

We can write as follows:

We know that . We can rewrite the equation and solve for .

Therefore, margin is maximized when is minimized. For mathematical convenience, we change from minimizing to minimizing .

subject to for

You may skip the derivation of the solution and go to parameter solving.

We can solve the constrained optimization problem using quadratic programming. Before that, we restate the optimization problem by applying Lagrange multiplier. Lagrange multiplier states that to find the maximum or minimum of a function subjected to an equality constraint , form the following expression , set the gradient of to zero and find the Lagrange multiplier (parameter) .

Substitute and .

for all .

.

Differentiate with respect to and , and set it to zero

,

,

Substitute and into and simplify it.

Thus, the optimization problem becomes as follows.

subject to and .

As we can see above, the objective function now depends only on the Lagrange multipliers which is easier to be solved. Upon solving the optimization problem, most Lagrange multipliers () will be zero, and only a few of them are non-zero. The data points with are called the **support vectors**. These support vectors lie on the edge of the margin which define the margin and the optimal hyperplane.

From the above, we see the solution for has the form

We can use any data point on the edge of the margin to solve . However, typically we use an average of all the solutions for numerical stability.

where is the set of support vector indices. Given a test point , the predicted value is

###### Soft Margin

Real-world data are typically noisy and not linearly separable. When the data are not separable, the algorithm above will not be able to produce the solution to the problem. To handle non-separable data, we relax the constraints by allowing data points within the margin. Consider the margin in Figure 4. Three data points are allowed to be within the margin. These “errors” need to be quantified as indicated by and included in the constraints. The variable is known as **slack variable**. represents the amount of which the data points on the wrong side of the margin. If , the data point is on the right side. If , the data point is in the margin. If , the data point is on the wrong side of the hyperplane.

The combined constraint is now defined as follows:

where and

The equation (constraint) implies that we define an error if the prediction is on the wrong side of margin boundary. This is called the **hinge loss**. Given the prediction , the loss function is defined as

By introducing the slack variables in the optimization problem, it is still possible to satisfy the constraint, albeit with some data points within or on the wrong side of the hyperplane. We add a term to the objective function which represents the total amount of which the data points on the wrong side of their margin.

subject to and for

where is the regularization parameter to control the trade-off between maximizing the margin and minimizing the error. Small gives less importance to the error term, hence more mistakes are allowed, resulting in large margin. Conversely, large allows less mistakes resulting in small margin. You may skip the derivation of the solution and go to parameter solving.

Using Lagrange multiplier to formulate the optimization problem with respect to the Lagrange parameters.

where is the Lagrange multiplier for constraint .

Differentiate with respect to , and , and set it to zero.

,

,

,

Since , this implies .

Substitute the derivatives into and simplify it.

subject to and .

Upon solving the optimization problem using quadratic programming, data points that lie on the correct side of the hyperplane, their corresponding , and the others have their . Of these data points, the ones that lie on the edge of the margin have their , and the rest lie within the margin with .

is computed as follows:

As before, we can use any data point on the edge of the margin to solve . However, typically we use an average of all the solutions for numerical stability.

where is the set of support vector indices. Given a test point , the predicted value is

Numerical Example

Consider the following dataset. Suppose , a soft margin SVM is built by solving the optimization problem using quadratic programming. The lagrange multipliers are given as follows. Calculate parameter and .

where

Given a test point , predict the test point using the SVM model.

###### Non-linear SVM with Kernel Trick

The algorithm we described above finds the linear model that divides the feature space. However, some real-world data is non-linear and a linear model will not be able to effectively classify the data as shown in Figure 5. As shown in the figure, no matter how we try to fit the linear model, the decision boundary will give a bad classification result. One way to deal with this kind of data is to extend the feature dimension by performing data transformation using basis functions such as polynomial and spline. An example of non-linear transformation is given in Figure 5(bottom) whereby the polynomial basis function is used to map the data points from 2-dimension to 3-dimension. After the transformation, we can easily fit a linear model to separate the data points.

We select a basis function, to transform each instance from -dimension to -dimension where is generally much larger than . Then, the transformed dataset is used to fit the linear model

The optimization problem of finding the optimal hyperplane has the same form except the constraints are defined in the new feature space.

subject to and for

Reformulate the problem using Lagrange multiplier.

subject to and .

Although the mapping of the features to a high dimensional feature space allows the data to be linearly separable, computing the transformation is expensive especially when the dimension is very high. This is because we have to apply the transformation on each sample followed by the inner product. The idea of the **kernel trick** is that the inner product between the transformed data is not needed. Instead, we require only the kernel function that allows us to operate in the high-dimensional space without computing the data transformation. Therefore, we just need to apply the kernel function, directly in the original feature space to achieve the same effect.

Therefore, is

subject to and . Given the set of support vector indices , can be solved

We use an average of all the solutions for numerical stability. Given a test point , the predicted value is calculate as follows:

Kernels

We discuss some of the commonly used kernel functions in the literature. Polynomial kernel function is defined as

where is degree parameter. Consider a dataset with two input features and and a polynomial kernel with a degree of 2.

We can see from the above that the dimension of the dataset is extended from 2 to 6.

Radial basis function (RBF) kernel function is similar to the pdf of the normal distribution. The kernel function is defined as

where is the euclidean distance between the two data points and defines the spread of the kernel or the decision region of the support vectors. In other words, it determines how far the influence of a support vector in the prediction. A small value of means large decision region area while a large value of means small decision region area. Figure 6 shows the decision boundaries of non-linear SVM with polynomial and RBF kernels. The effect of is illustrated in Figure 7.

Numerical Example

Consider the following dataset. Suppose , a SVM with RBF kernel is built by solving the optimization problem using quadratic programming. The lagrange multipliers are given as follows.

All data points are support vectors since all . Thus, support vector indices . Thus, can be solved as follows:

, , , , ,

Given a test point , predict the test point using the SVM model.

###### References

[1] J. Nocedal and S. J. Wright, *Numerical optimization*. Springer, 1999.

[2] C. Cortes and V. Vapnik, “Support-vector networks,” *Mach Learn*, vol. 20, no. 3, pp. 273–297, Sep. 1995, doi: 10.1007/BF00994018.

[3] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers,” in *Proceedings of the fifth annual workshop on Computational learning theory*, New York, NY, USA, Jul. 1992, pp. 144–152. doi: 10.1145/130385.130401.