{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Selection\n", "\n", "In this lab, we will be implementing Univariate Feature Selection and Sequential Feature Selection using Scikit-learn.\n", "\n", "## Part 1: Feature Selection using Univariate Feature Selection\n", "Univariate feature selection selects the best features based on univariate statistical tests. Scikit-learn provides several methods for features selection routines (which can be referred [here](http://scikit-learn.org/stable/modules/feature_selection.html). In this lab, we will look at how to use SelectKBest to selects the best features. SelectKBest works by removing all but the *k* highest scoring features. SelectKBest takes as input a scoring function that returns univariate scores. Several scoring functions are provided as follows.\n", "\n", "for classification: chi2, f_classif and mutual_info_classif
\n", "for regression: f_regression, mutual_info_regression\n", "\n", "- f_classif: ANOVA F-value between label/feature for classification tasks.\n", "- mutual_info_classif: Mutual information for a discrete target.\n", "- chi2: Chi-squared stats of non-negative features for classification tasks.\n", "- f_regression: F-value between label/feature for regression tasks.\n", "- mutual_info_regression: Mutual information for a continuous target." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Import the standard modules to be used in this lab\n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexcptrestbpscholfbsrestecgthalachexangoldpeakslopecathaltarget
063131452331015002.30011
137121302500118703.50021
241011302040017201.42021
356111202360117800.82021
457001203540116310.62021
\n", "
" ], "text/plain": [ " age sex cp trestbps chol fbs restecg thalach exang oldpeak slope \\\n", "0 63 1 3 145 233 1 0 150 0 2.3 0 \n", "1 37 1 2 130 250 0 1 187 0 3.5 0 \n", "2 41 0 1 130 204 0 0 172 0 1.4 2 \n", "3 56 1 1 120 236 0 1 178 0 0.8 2 \n", "4 57 0 0 120 354 0 1 163 1 0.6 2 \n", "\n", " ca thal target \n", "0 0 1 1 \n", "1 0 2 1 \n", "2 0 2 1 \n", "3 0 2 1 \n", "4 0 2 1 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_pd = pd.read_csv('heart.csv')\n", "data_pd.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Feature Selection using Sequential Feature Selection\n", "The implementation of Sequential Feature Selector is available in Mlxtend (machine learning extensions) library. We will be using a high dimensional dataset to demonstrate how RFE works.

\n", "First we import SequentialFeatureSelector from Mlxtend.feature_selection package. We will be using Decision Tree as the predictive model for the feature selection. Here, we set 5 as the number of features that we want to select and forward=True indicates Sequential Forward Selection. By choosing cv=0, we do not perform any cross-validation, which means the performance (accuracy) is computed entirely on the training set. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3: Exercise\n", "The lab exercise uses auto mpg datasets.

The data is technical spec of cars.\n", "Attribute Information:\n", "1. mpg (miles per gallon): continuous\n", "2. cylinders: multi-valued discrete\n", "3. displacement: continuous\n", "4. horsepower: continuous\n", "5. weight: continuous\n", "6. acceleration: continuous\n", "7. model year: multi-valued discrete\n", "8. origin: multi-valued discrete\n", "9. car name: string (unique for each instance)\n", "\n", "We would like to predict miles per gallon (mpg) using decision tree regressor. \n", "1. Perform feature reduction on the dataset to reduce the dimension of the dataset.\n", "2. Perform feature selection on the dataset to select the most relevant attributes to predict mpg. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mpgcylindersdisplacementhorsepowerweightaccelerationmodel_yearorigincar_name
018.08307.0130.03504.012.0701chevrolet chevelle malibu
115.08350.0165.03693.011.5701buick skylark 320
218.08318.0150.03436.011.0701plymouth satellite
316.08304.0150.03433.012.0701amc rebel sst
417.08302.0140.03449.010.5701ford torino
\n", "
" ], "text/plain": [ " mpg cylinders displacement horsepower weight acceleration \\\n", "0 18.0 8 307.0 130.0 3504.0 12.0 \n", "1 15.0 8 350.0 165.0 3693.0 11.5 \n", "2 18.0 8 318.0 150.0 3436.0 11.0 \n", "3 16.0 8 304.0 150.0 3433.0 12.0 \n", "4 17.0 8 302.0 140.0 3449.0 10.5 \n", "\n", " model_year origin car_name \n", "0 70 1 chevrolet chevelle malibu \n", "1 70 1 buick skylark 320 \n", "2 70 1 plymouth satellite \n", "3 70 1 amc rebel sst \n", "4 70 1 ford torino " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "auto_mpg = pd.read_csv(\"auto_mpg.csv\", delim_whitespace=True)\n", "auto_mpg.head()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SelectKBest(k=5, score_func=)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_selection import SelectKBest\n", "from sklearn.feature_selection import f_regression\n", "kBest = SelectKBest(f_regression, k=5)\n", "kBest.fit(auto_mpg.iloc[:,1:-1], auto_mpg['mpg']) # run the score function on the data" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 1 2 3 5]\n", "[597.07704785 724.99430337 604.99596063 888.85068265 84.95770025\n", " 199.98200802 184.19963937]\n" ] } ], "source": [ "idx = kBest.get_support(True)\n", "print(idx)\n", "scores = kBest.scores_\n", "print(scores)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "E:\\Programs\\Miniconda3\\envs\\myenv\\lib\\site-packages\\ipykernel_launcher.py:2: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " \n" ] } ], "source": [ "auto_mpg_kbest = auto_mpg.iloc[:,idx+1]\n", "auto_mpg_kbest['mpg'] = auto_mpg['mpg']" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(auto_mpg_kbest.iloc[:,0:-1], auto_mpg_kbest.iloc[:,-1], test_size=0.3, random_state=5, shuffle=False)\n", "tree = DecisionTreeRegressor()\n", "path = tree.cost_complexity_pruning_path(X_train, y_train)\n", "ccp_alphas, impurities = path.ccp_alphas, path.impurities" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'Total Impurity vs effective alpha for training set')" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from matplotlib import pyplot as plt\n", "fig, ax = plt.subplots()\n", "ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle=\"steps-post\")\n", "ax.set_xlabel(\"effective alpha\")\n", "ax.set_ylabel(\"total impurity of leaves\")\n", "ax.set_title(\"Total Impurity vs effective alpha for training set\")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of nodes in the last tree is: 1 with ccp_alpha: 25.178707081037665\n" ] } ], "source": [ "clfs = []\n", "for ccp_alpha in ccp_alphas:\n", " clf = DecisionTreeRegressor(random_state=0, ccp_alpha=ccp_alpha)\n", " clf.fit(X_train, y_train)\n", " clfs.append(clf)\n", "print(\"Number of nodes in the last tree is: {} with ccp_alpha: {}\".format(\n", " clfs[-1].tree_.node_count, ccp_alphas[-1]))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "182" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(ccp_alphas)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "train_scores = [clf.score(X_train, y_train) for clf in clfs]\n", "test_scores = [clf.score(X_test, y_test) for clf in clfs]\n", "idx = 140\n", "fig, ax = plt.subplots(figsize=(10,6))\n", "ax.set_xlabel(\"alpha\")\n", "ax.set_ylabel(\"$R^2$\")\n", "ax.set_title(\"Accuracy vs alpha for training and testing sets\")\n", "#ax.plot(ccp_alphas[:30], train_scores[:30], marker='o', label=\"train\",\n", "# drawstyle=\"steps-post\")\n", "ax.plot(ccp_alphas[:idx], test_scores[:idx], marker='o', label=\"test\",\n", " drawstyle=\"steps-post\")\n", "ax.legend()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(array([0], dtype=int64),)\n", "0.9959869170126309\n" ] }, { "data": { "text/plain": [ "0.004328537170263941" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "idx = 100\n", "print(np.where(train_scores == np.max(train_scores)))\n", "print(train_scores[78])\n", "ccp_alphas[80]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "37.877120949074076" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import mean_squared_error\n", "X_train, X_test, y_train, y_test = train_test_split(auto_mpg_kbest.iloc[:,0:-1], auto_mpg_kbest.iloc[:,-1], test_size=0.3, random_state=5, shuffle=False)\n", "idx = 79\n", "tree = DecisionTreeRegressor(ccp_alpha=ccp_alphas[idx])\n", "tree.fit(X_train, y_train)\n", "y_pred = tree.predict(X_test)\n", "mean_squared_error(y_test, y_pred)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.12" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }