{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Decision Trees Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "We will be using the wine quality data set for these exercises. This data set contains various chemical properties of wine, such as acidity, sugar, pH, and alcohol. It also contains a quality metric (3-9, with highest being better) and a color (red or white). The name of the file is `Wine_Quality_Data.csv`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2017-04-10T00:04:57.164238Z", "start_time": "2017-04-09T20:04:57.158472-04:00" } }, "outputs": [], "source": [ "from __future__ import print_function\n", "import os\n", "data_path = ['..', '..', 'data']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 1\n", "\n", "* Import the data and examine the features.\n", "* We will be using all of them to predict `color` (white or red), but the colors feature will need to be integer encoded." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "path = 'data/Wine_Quality_data.csv'\n", "data = pd.read_csv(path, sep=',')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "fixed_acidity float64\n", "volatile_acidity float64\n", "citric_acid float64\n", "residual_sugar float64\n", "chlorides float64\n", "free_sulfur_dioxide float64\n", "total_sulfur_dioxide float64\n", "density float64\n", "pH float64\n", "sulphates float64\n", "alcohol float64\n", "quality int64\n", "color object\n", "dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dtypes" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "data['color'] = data.color.replace('white', 0)\n", "data['color'] = data.color.replace('red', 1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "print(data.color)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 2\n", "\n", "* Use `StratifiedShuffleSplit` to split data into train and test sets that are stratified by wine quality. If possible, preserve the indices of the split for question 5 below.\n", "* Check the percent composition of each quality level for both the train and test data sets." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']\n" ] } ], "source": [ "features_cols = [x for x in data.columns if x not in 'color']\n", "print(features_cols)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5497\n", "1000\n" ] } ], "source": [ "from sklearn.model_selection import StratifiedShuffleSplit\n", "\n", "splitter = StratifiedShuffleSplit(n_splits = 1, test_size=1000, random_state=40)\n", "list_split = splitter.split(data[features_cols], data['color'])\n", "train_idx, test_idx = next(list_split)\n", "print(train_idx.size)\n", "print(test_idx.size)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "X_train = data.loc[train_idx, features_cols]\n", "y_train = data.loc[train_idx, 'color']\n", "\n", "X_test = data.loc[test_idx, features_cols]\n", "y_test = data.loc[test_idx, 'color']" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.754\n", "1 0.246\n", "Name: color, dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train.value_counts(normalize=True).sort_index()\n", "y_test.value_counts(normalize=True).sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 3\n", "\n", "* Fit a decision tree classifier with no set limits on maximum depth, features, or leaves.\n", "* Determine how many nodes are present and what the depth of this (very large) tree is.\n", "* Using this tree, measure the prediction error in the train and test data sets. What do you think is going on here based on the differences in prediction error?" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "criterion = 'gini'\n", "dt = DecisionTreeClassifier(criterion=criterion, max_depth=6)\n", "dt = dt.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(6, 57)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dt.tree_.max_depth, dt.tree_.node_count" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score\n", "\n", "def measure_score(y_true, y_pred, label):\n", " df = pd.Series({'accuracy': accuracy_score(y_true, y_pred),\n", " 'recall': accuracy_score(y_true, y_pred),\n", " 'precision': accuracy_score(y_true, y_pred),\n", " 'f1_score': accuracy_score(y_true, y_pred)}, name=label)\n", " return df" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | train | \n", "test | \n", "
---|---|---|
accuracy | \n", "0.992905 | \n", "0.986 | \n", "
f1_score | \n", "0.992905 | \n", "0.986 | \n", "
precision | \n", "0.992905 | \n", "0.986 | \n", "
recall | \n", "0.992905 | \n", "0.986 | \n", "