Subset Selection

Introduction

Subset selection is an approach of selecting the set of features that is most relevant to predicting the target variable. Specifically, given a dataset with d dimensions (features), the aim is to find k features that provide the most information to the prediction and discard the other (d-k) features. There are 2^d possible subsets of d features and ideally, we want to evaluate all subsets to determine the optimal one. However, it is computationally infeasible especially when d is large. Thus, heuristic methods are employed to obtain a satisfactory solution in an acceptable time. The approach is categorized into two based on the choice of evaluation metrics: filter method and wrapper method.

Wrapper Method

Wrapper methods utilize a machine learning model to evaluate the feature subsets [1]. Each subset is used to train the model which is then evaluated on a hold-out set. The misclassification rate of the hold-out set represents the score of the subset. Wrapper method is computationally intensive because a new model is trained for each subset. However, the selected subset contains the most relevant features for that particular machine learning model and problem. Wrapper method can be classified into two types: sequential forward selection and sequential backward selection.

In sequential forward selection, we start with an empty set of features (no features), and add one feature into the subset one by one. The feature to be added is the feature with the minimum error. The process of adding a feature into the subset stops when the error rate is not decreasing or the decrease is very minimal. The algorithm of sequential forward selection is given below.

Training set with d dimensions, \mathbf{x} = {x_1, x_2, ..., x_d}
F=\emptyset
k is the number of selected features

Split training set into training D_{tr} and validation subsets D_{vd}
for 1 to k:
   for x_j in \mathbf{x} \in D_{tr}
      Train a predictive model \hat{f} using F \cup x_j
      Compute the error rate \epsilon_{\hat{f}} on D_{vd}
   Choose x_j with minimum \epsilon_{\hat{f}}(F \cup x_j)
   add x_j into F

In sequential backward selection, we start with all features, and remove one feature from the subset one by one. The feature to be removed is the feature that results in minimum error rate upon removal. The process of removing a feature from the subset stops when the error rate is not decreasing or the decrease is very minimal. The algorithm of sequential backward selection is given below.

Training set with d dimensions, \mathbf{x} = {x_1, x_2, ..., x_d}
F=\mathbf{x}
k is the number of selected features

Split training set into training D_{tr} and validation subsets D_{vd}
for 1 to (d-k):
   for x_j in \mathbf{x} \in D_{tr}
      Train a predictive model \hat{f} using F - x_j
      Compute the error rate \epsilon_{\hat{f}} on D_{vd}
   Choose x_j with minimum \epsilon_{\hat{f}}(F - x_j)
   remove x_j from F
Filter Method

Filter methods utilize various statistical test to determine the relevance of the features instead of the error rate of a machine learning model [2]. Some of the common statistical measures are analysis of variance (ANOVA), mutual information and chi-square. Filter methods rank the features according to their relevance. The top k features are chosen via cross-validation. Because filter methods do not rely on a machine learning model, filter methods are less computationally intensive. However, the selected set of features is generally less optimal and giving lower prediction performance. Thus, filter methods are preferable for uncovering the relationship between the features and the target variable, and can be used as preprocessing step for wrapper methods.

References

[1] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial Intelligence, vol. 97, no. 1, pp. 273–324, Dec. 1997, doi: 10.1016/S0004-3702(97)00043-X.

[2] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” J. Mach. Learn. Res., vol. 3, no. null, pp. 1157–1182, Mar. 2003.