Linear Regression

Introduction

A regression problem is when the response variable is continuous, y=\mathbb{R} such as salary, height, price etc. Consider a regression problem of predicting the price of a house given its built-up area. The data is given in Figure 1. The built-up area is the input feature, x and the price is the response variable, y which can take values ranging from RM100,000 to RM1,300,000.

Figure 1. House prices against their built-up area in square feet.

Linear regression is one of the most widely used algorithms for solving regression problems. The algorithm assumes the response has a linear relationship with the input feature. Thus, the linear regression model is expressed as 

\hat{f}(\mathbf{x}) = w_0 + \sum_{j=1}^d w_jx_j

where w_0 is the intercept and w_jx_j is the inner product between the model’s parameters (or weights) and the input features.

Suppose the input is 1 dimensional (there is only one feature as above), the linear regression model is represented as follows:

\hat{f}(\mathbf{x}) = w_0 + w_1x_1

where the w_0 and w_1 are the intercept and the slope respectively. This is known as univariate (simple) linear regression. A univariate linear regression model is fitted to those data points. The fitted line is indicated by the red line.

The univariate linear regression can be generalized to multivariate linear regression where the input is d dimensional data points. Consider the problem of predicting house price, in addition to the built-up area, the inputs can be the number of rooms, the number of floors, location etc. The model is known as multivariate linear regression. A common notational trick is to introduce a constant input, x_0 whose value is always 1. Thus, the intercept term can be combined with the other terms as follows:

\hat{f}(\mathbf{x}) = \sum_{j=0}^d w_jx_j=\mathbf{w}^T\mathbf{x}

Model Parameters

The role of weights is to define the direction of the fitted line. Let’s review the role of intercept and slope in univariate linear regression. Figure 2 illustrates three linear models with different parameters.

Linear model 1, on the left of the figure, the weights are w_0=2, w_1=0. Since the slope is 0 and the intercept is 2, the linear model is a horizontal line crossing 2 at the y-axis.

Linear model 2, in the middle of the figure, the weights are w_0=0, w_1=1. The slope of the model is 1 and the intercept is 0. Hence the linear model has an upward trend, and the line is going through the origin.

Finally, linear model 3, on the right of the figure, the weights are w_0=1, w_1=1. The linear model has a similar trend with linear model 2. However, the line is crossing 1 at the y-axis due to its intercept being 1.

Figure 2. Parameters (intercept and slope) define the direction of the fitted lines.

Now, how do we find the parameters that will give us an optimal linear regression model? Essentially, the idea is we want to choose weights such that the linear regression model will give us prediction \hat{f}(\mathbf{x}) that is close to the response, y. This is formulated by choosing a set of weights, \mathbf{w} that minimizes the sum of squared loss (the difference between the response and the predicted value) over all the training instances. We ignore the fraction \frac{1}{N} to simplify the explanation. Ignoring the fraction will not affect the parameter estimation since it is a constant.

J(\mathbf{w})= \sum_{i=1}^{N}(y^i-\hat{f}(\mathbf{x}^i))^2

Least Squares

The common method to estimate the weights that will minimize the loss function is the least squares. The method is illustrated in Figure 3. The data points are indicated by the blue dots and the red line is the linear regression model. The predicted value for each training instance is shown as blue crosses. The blue line from a data point to the blue cross is the prediction error which is the difference between the response and the predicted value. The aim is to find a set of weights that will minimize the sum of squared errors.

Figure 3. Least squares find the weights that will minimize the sum of squared errors (the length of the blue lines).

Let’s express the loss function using matrix notation.

J(\mathbf{w})= \sum_{i=1}^{N}(y^i-\hat{f}(\mathbf{x}^i))^2= ( \mathbf{y} - \mathbf{w}^T \mathbf{x} )^2

Differentiating J with respect to \mathbf{w}.

\frac{\partial J}{\partial \mathbf{w}} = -2 \mathbf{x} (\mathbf{y} - \mathbf{w}^T \mathbf{x})

Set the derivative to zero to solve for \mathbf{w}.

-2 \mathbf{x} (\mathbf{y} - \mathbf{w}^T \mathbf{x}) = 0

-\mathbf{x}^T\mathbf{y} + \mathbf{w} \mathbf{x}^T\mathbf{x} = 0

\mathbf{w} \mathbf{x}^T\mathbf{x} = \mathbf{x}^T\mathbf{y}
The optimal weights of the linear regression model is given by

\mathbf{w} = (\mathbf{x}^T\mathbf{x})^{-1} \mathbf{x}^T\mathbf{y}

It might happen that the least squares fail to provide the optimal weights. This occurs when the input features are (perfectly) correlated e.g. x_1=2x_2. Then, the \mathbf{x}^T\mathbf{x} is singular (determinant is zero) and the inverse of the matrix does not exist.

Numerical Example

Given the following dataset, find \mathbf{w} using least square and gradient descent.

\mathbf{x}=\begin{bmatrix} 1 & 3.1 \\ 1 & 2.9 \\1 & 3.5 \\1 & 4.3\\1 & 3.6 \end{bmatrix}

\mathbf{y}=\begin{bmatrix} 2.0 \\2.6\\2.2\\3.2\\3.0 \end{bmatrix}

Calculate \mathbf{w}=(\mathbf{x}^T\mathbf{x})^{-1} \mathbf{x}^T\mathbf{y}

(\mathbf{x}^T\mathbf{x})^-1=\begin{bmatrix} 10.5685 & -2.9795 \\ -2.9795 & 0.8562 \end{bmatrix}

\mathbf{x}^T\mathbf{y}=\begin{bmatrix} 13\\46 \end{bmatrix}

\mathbf{w}=\begin{bmatrix} 0.334 \\0.652 \end{bmatrix}

Gradient Descent

The weights of the linear regression model can be estimated using an optimization algorithm known as gradient descent. Gradient descent is an algorithm for finding a local minimum of a differentiable function. Note that, gradient descent is a generic algorithm that is used not only in linear regression but it is applied in various places in machine learning. In the context of linear regression, the squared loss is a differentiable function, thus gradient descent can be used to find the weights that will minimize the loss function. Unlike least squares, gradient descent solves the optimization problem in an iterative manner. The weights are iteratively updated based on the derivative of the loss function. For iterative k, the weights are updated as follows:

\mathbf{w}_{k} = \mathbf{w}_{k-1} - \alpha \mathbf{g}_{k}

where \alpha is the learning rate which control the amount of update to be applied to the weights, \mathbf{g}_{k} is the derivative of the loss function with respect to each weight, w_j which can be written as follows:

J(\mathbf{w})=\frac{1}{2N} \sum_{i=1}^{N}(y^i-\hat{f}(\mathbf{x}^i))^2

g_{k,j} = \frac{\partial J}{\partial w_j} = - \sum_{i=1}^N x_j^i (y^i - \hat{f}(\mathbf{x}^i)) )

\frac{1}{2} is a mathematical convenience to cancel out 2 when derivative is performed on the loss function.

Numerical Example

Given the following dataset, find \mathbf{w} using least square and gradient descent.

\mathbf{x}=\begin{bmatrix} 1 & 3.1 \\ 1 & 2.9 \\1 & 3.5 \\1 & 4.3\\1 & 3.6 \end{bmatrix}

\mathbf{y}=\begin{bmatrix} 2.0 \\2.6\\2.2\\3.2\\3.0 \end{bmatrix}

Randomly initialize \mathbf{w}.

\mathbf{w}=\begin{bmatrix} 0.45\\0.75\end{bmatrix}

Calculate \hat{\mathbf{y}}

\hat{\mathbf{y}}=\begin{bmatrix} 2.775\\2.626\\3.075\\3.675\\3.150\end{bmatrix}

Calculate the loss using mean squared error.

J_{mse} = \frac{1}{2N}\sum_{i=1}^N (y-\hat{y})^2 = 0.1615

Calculate the gradient with respect to \mathbf{w}

\begin{aligned}\frac{\partial J}{\partial w_0} &= - \frac{1}{N}\sum_{i=1}^N x_0 (y^i - \hat{y}^i)) ) \\ &=\frac{(0.775+0.025+0.875+0.475+0.15)}{5} \\&=0.46 \end{aligned}

\begin{aligned}\frac{\partial J}{\partial w_1} &= - \frac{1}{N}\sum_{i=1}^N x_1 (y^i - \hat{y}^i)) ) \\ &=\frac{(2.405+0.0725+3.063+2.043+0.540)}{5}\\&=1.624\end{aligned}

Suppose \alpha=0.01. Update \mathbf{w}.

\begin{aligned}w_0 &:= w_0 - \alpha \frac{\partial J}{\partial w_0}\\ &:=0.45-0.01(0.46) \\ &:=0.445 \end{aligned}

\begin{aligned}w_1 &:= w_1 - \alpha \frac{\partial J}{\partial w_1} \\ &:=0.75 - 0.01(1.624)\\&:=0.734 \end{aligned}

Using the updated \mathbf{w}, calculate (new) prediction and loss.

J_{mse} = \frac{1}{2N}\sum_{i=1}^N (y-\hat{y})^2 = 0.135

Notice that the loss is decreased after the weight update. The steps are repeated until the error is converged.