In the past two missions, we learned about how decision trees are constructed. We used a modified version of ID3, which is a bit simpler than the most common tree building algorithms, C4.5, and CART. However, the basics are all the same, and so we can apply the principles we learned about how decision trees work to any tree construction algorithm.html
In this mission, we'll learn about when to use decision trees, and how to use them most effectively.node
We've been using a dataset on US income, which we'll keep using here. The data is from the 1994 Census, and contains information on an individual's marital status, age, type of work, and more. The target column, high_income
, is if they make less than or equal to 50k a year (0
), or more than 50k a year (1
).c#
You can download the data from here.app
We can use the scikit-learn package to fit a decision tree. The interface is very similar to other algorithms we've fit in the past.less
We use the DecisionTreeClassifier class for classification problems, and DecisionTreeRegressor for regression problems. Both of these classes are in the sklearn.tree
package.dom
In this case, we're predicting a binary outcome, so we'll use a classifier.ide
The first step is to train the classifier on the data. We'll use the fit
method on a classifier to do this.ui
Fit clf
to the income
data.this
income[columns]
to only use the named columns as predictors.high_income
column.from sklearn.tree import DecisionTreeClassifierlua
# A list of columns to train with.
# All columns have been converted to numeric.
columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]
# Instantiate the classifier.
# Set random_state to 1 to keep results consistent.
clf = DecisionTreeClassifier(random_state=1)
# The variable income is loaded, and contains all the income data.
clf.fit(income[columns], income["high_income"])
Now that we've fit a model, we can make predictions. We'll want to split our data into training and testing sets first. If we don't, we'll be making predictions on the same data that we train our algorithm with. This leads to overfitting, and will make our error appear lower than it is.
We covered overfitting in more depth earlier, but a simple explanation is that if you memorize how to perform 3 specific addition problems (2+2
, 3+6
, 3+3
), you'll get those specific problems correct every time. On the other hand, if you're asked 4+4
, you won't know how to do it, because you don't know the rules of addition.
If you learn the rules of addition, you'll sometimes get problems wrong (3443343434+24344343
can be hard to do mentally), but you'll be able to do any problem, and you'll get most of them right. Overfitting is the first example, where you memorize the details of the training set, but are unable to generalize to new examples that you're asked to make predictions on.
We can avoid overfitting by always making predictions and evaluating error on data that our algorithm hasn't been trained with. This will show us when we're overfitting by giving us a realistic error on data that the algorithm hasn't seen before.
We can split the data by shuffling the order of the dataframe, then selecting certain rows to be in the training set, and certain rows to be in the testing set.
In this case, we'll make 80%
of our rows training data, and the rest testing data.
All the rows in income
with a position up to train_max_row
(but not including it) will be part of the training set.
train
containing all of these rows.test
containing all of the rows with a position greater than or equal totrain_max_row
.import numpy
import math
# Set a random seed so the shuffle is the same every time.
numpy.random.seed(1)
# Shuffle the rows. This first permutes the index randomly using numpy.random.permutation.
# Then, it reindexes the dataframe with this.
# The net effect is to put the rows into random order.
income = income.reindex(numpy.random.permutation(income.index))
train_max_row = math.floor(income.shape[0] * .8)
train = income.iloc[:train_max_row]
test = income.iloc[train_max_row:]
While there are many methods to evaluate error with classification, we'll use AUC, which we covered extensively earlier in the machine learning material. AUC ranges from 0
to 1
, and is ideal for binary classification. The higher the AUC, the more accurate our predictions.
We can compute AUC with the roc_auc_score function from sklearn.metrics
. This function takes in 2 parameters:
y_true
: true labelsy_score
: predicted labelsand returns the computed AUC value.
predictions
and thehigh_income
column of test
and assign the result to error
.print
function to display error
.from sklearn.metrics import roc_auc_score
clf = DecisionTreeClassifier(random_state=1)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
error = roc_auc_score(test["high_income"], predictions)
print(error)
The AUC for the predictions on the testing set is about .694
. Let's compare this against the AUC for predictions on the training set to see if the model is overfitting.
It's normal for the model to predict the training set better than the testing set. After all, it has full knowledge of that data and the outcomes. However, if the AUC between training set predictions and actual values is significantly higher than the AUC between test set predictions and actual values, it's a sign that the model may be overfitting.
predictions
and thehigh_income
column of train
.predictions = clf.predict(train[columns])
print(roc_auc_score(train["high_income"], predictions))
Our AUC on the training set was .947
. The AUC on the test set was .694
. There's no hard and fast rule on when overfitting is happening, but our model is predicting the training set much better than it's predicting the test set. Splitting the data into training and testing sets doesn't prevent overfitting -- it just helps us detect it and fix it.
Based on our AUC measurements, it appears that we are in fact overfitting. Let's look a little more into why decision trees might overfit.
In the last mission, we looked at this data:
high_income age marital_status
0 20 0
0 60 2
0 40 1
1 25 1
1 35 2
1 55 1
Here's the full diagram for the decision tree we can build from the above data:
FulltreeAgeabove37.5?1NoYesAgeaboveAgeabove25?255?7NYesNYesAgeaboveLeaf(1)AgeaboveLeaf(0)22.5?3647.5?811NYesNYesLeaf(0)Leaf(1)Leaf(0)Leaf(1)45910
This tree perfectly predicts all of our values. It can always get a right answer on the training set. This is equivalent to memorizing the rules of addition. We've built our tree in such a way that it can perfectly predict the training set -- but, the way the tree has been constructed doesn't make sense when we step back.
The tree above is saying, if you're under 22.5
years old, you have low income. If you're 22.5
- 37.5
, high income. If you're 37.5
-47.5
, low income. If you're 47.5
to 55
, high income. If you're above 55
, low income. These rules are very specific to the training set.
Think about the problem with a real-world lens. Does it make sense to predict that someone who is 20
is low income, someone who is25
is high income, and someone who is 40
is low income? Intuitively, we know that people who are younger probably make less, people who are middle aged make more, and people who have retired make less.
Our tree has created so many age-based splits in an attempt to perfectly predict everyone's income that each split is effectively meaningless.
Here's a tree that matches up with our intuition better:
SmallertreeAgeabove37.5?1NoYesAgeaboveAgeabove25?255?7NYesNYesLeaf(0)Leaf(1)Leaf(.66)Leaf(0)6811
All we've done is "pruned" the tree, and removed some of the lower leaves. We've made some of the higher up nodes into leaves instead.
The tree above makes more intuitive sense. If you're under 25
, we predict low income. If you're between 25
and 55
, we predict high income (the .66
rounds up to 1). If you're above 55
, we predict low income.
This actually has lower accuracy on our training set, but it will generalize better to new examples, because it matches reality better.
Trees overfit when they have too much depth, and make overly complex rules that match the training data, but aren't able to generalize well to new data.
This may seem to be a strange principle at first, but the more depth a tree has, typically the worse it performs on new data.
There are three main ways to combat overfitting:
We'll explore all of these, but we'll look at the third method first.
By controlling how deep the tree can go while we build it, we keep the rules more general than they would be otherwise. This prevents the tree from overfitting.
We can restrict how deep the tree is built with a few parameters when we initialize the DecisionTreeClassifier
class:
max_depth
-- this globally restricts how deep the tree can go.min_samples_split
-- The minimum number of rows needed in a node before it can be split. For example, if this is set to 2
, then nodes with 2
rows won't be split, and will become leaves instead.min_samples_leaf
-- the minimum number of rows that a leaf must have.min_weight_fraction_leaf
-- the fraction of input rows that are required to be at a leaf.max_leaf_nodes
-- the maximum number of total leaves. This will cap the count of leaf nodes as the tree is being built.As you can see, some of these parameters don't make sense together. Having max_depth
and max_leaf_nodes
together isn't allowed.
Now that we know what to tweak, let's improve our model.
min_samples_split
to 13
when creating theDecisionTreeClassifier
.train_auc
.test_auc
.# Decision trees model from the last screen.
clf = DecisionTreeClassifier(random_state=1)
clf = DecisionTreeClassifier(min_samples_split=13, random_state=1)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(test["high_income"], predictions)
train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train["high_income"], train_predictions)
print(test_auc)
print(train_auc)
By setting min_samples_split
to 13
, we managed to boost test AUC from .694
to .700
. Training set AUC decreased from .947
to.843
, showing that the model we built was less overfit to the training set than before:
settings | train AUC | test AUC |
---|---|---|
default | 0.947 | 0.694 |
min_samples_split: 13 | 0.843 | 0.700 |
Let's play around some more with parameters.
max_depth
to 7
andmin_samples_split
to 13
when creating theDecisionTreeClassifier
.train_auc
.test_auc
.# First decision trees model we trained and tested.
clf = DecisionTreeClassifier(random_state=1)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(test["high_income"], predictions)
train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train["high_income"], train_predictions)
print(test_auc)
print(train_auc)
clf = DecisionTreeClassifier(random_state=1, min_samples_split=13, max_depth=7)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(test["high_income"], predictions)
train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train["high_income"], train_predictions)
print(test_auc)
print(train_auc)
We just improved the AUC again! Test set AUC increased to .744
, while the training set AUC decreased to .748
:
settings | train AUC | test AUC |
---|---|---|
default (min_samples_split: 2, max_depth: None) | 0.947 | 0.694 |
min_samples_split: 13 | 0.843 | 0.700 |
min_samples_split: 13, max_depth: 7 | 0.748 | 0.7744 |
We aren't overfitting anymore since both AUC valeus are about the same. Let's tweak the parameters more aggressively, and see what happens!
max_depth
to 2
andmin_samples_split
to 100
when creating theDecisionTreeClassifier
.train_auc
.test_auc
.# First decision trees model we trained and tested.
clf = DecisionTreeClassifier(random_state=1)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(test["high_income"], predictions)
train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train["high_income"], train_predictions)
print(test_auc)
print(train_auc)
clf = DecisionTreeClassifier(random_state=1, min_samples_split=100, max_depth=2)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(test["high_income"], predictions)
train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train["high_income"], train_predictions)
print(test_auc)
print(train_auc)
Our accuracy went down in the past screen relative to the screen before:
settings | train AUC | test AUC |
---|---|---|
default (min_samples_split: 2, max_depth: None) | 0.947 | 0.694 |
min_samples_split: 13 | 0.843 | 0.700 |
min_samples_split: 13, max_depth: 7 | 0.748 | 0.7744 |
min_samples_split: 100, max_depth: 2 | 0.662 | 0.655 |
This is because we're now underfitting. Underfitting is what happens when our model is too simple to actually explain the relations between the variables.
Let's go back to our tree diagram to explain underfitting.
Here's the data:
high_income age marital_status
0 20 0
0 60 2
0 40 1
1 25 1
1 35 2
1 55 1
And here's the "right fit" tree. This tree explains the data properly, without overfitting:
"Rightfit"treeAgeabove37.5?1NoYesAgeaboveAgeabove25?255?7NYesNYesLeaf(0)Leaf(1)Leaf(.66)Leaf(0)6811
Let's trim this tree even more to show what happens when the model isn't complex enough to explain the data:
UnderfittreeAgeabove37.5?1NoYesLeaf(.66)Leaf(.33)23
In this model, anybody under 37.5
will be predicted to have high income (.66
rounds up), and anyone over 37.5
will be predicted to have low income (.33
rounds down). This model is too simple to model reality -- which is younger people make less, middle-aged people make more, and elderly people make less.
Thus, this tree underfits the data and will have lower accuracy than the properly fit version.
By artificially restricting the depth of our tree, we prevent it from creating a complex enough model to correctly categorize some of the rows. If we don't perform the artificial restrictions, the tree becomes too complex, and fits quirks in the data that only exist in the training set, but don't generalize to new data.
This is known as the bias-variance tradeoff. If we take a random sample of training data and create many models, if the predictions of the models for the same row are far apart from each other, we have high variance. If we take a random sample of training data, and create many models, and the predictions of the models for the same row are close together, but far from the actual value, then we have high bias.
High bias can cause underfitting -- if a model is consistently failing to predict the correct value, it may be that it is too simple to actually model the data.
High variance can cause overfitting -- if a model is very susceptible to small changes in the input data, and changes its predictions massively, then it is likely fitting itself to quirks in the training data, and not making a generalizable model.
It's called the bias-variance tradeoff because decreasing one will usually increase the other. This is a limitation of all machine learning algorithms. If you want to read more about the tradeoff, you can look here.
In general, decision trees suffer from high variance. The whole structure of a decision tree can change if you make a minor alteration to its training data. By restricting the depth of the tree, we increase the bias and decrease the variance. If we restrict the depth too much, we increase bias to the point where it will underfit.
Generally, you'll need to use your intuition and manually tweak parameters to get the "right" fit.
We can induce variance and see what happens with a decision tree. To add noise to the data, we'll just add a column of random values. A model with high variance (like a decision tree) will pick up on this noise, and overfit to it. This is because models with high variance are very sensitive to small changes in input data.
Fit the classifier to the training data.
Make predictions on the training set, and compute AUC and assign it to train_auc
.
Make predictions on the test set, and compute AUC and assign it to test_auc
.
numpy.random.seed(1)
# Generate a column with random numbers from 0 to 4.
income["noise"] = numpy.random.randint(4, size=income.shape[0])
# Adjust columns to include the noise column.
columns = ["noise", "age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]
# Make new train and test sets.
train_max_row = math.floor(income.shape[0] * .8)
train = income.iloc[:train_max_row]
test = income.iloc[train_max_row:]
# Initialize the classifier.
clf = DecisionTreeClassifier(random_state=1)
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
test_auc = roc_auc_score(test["high_income"], predictions)
train_predictions = clf.predict(train[columns])
train_auc = roc_auc_score(train["high_income"], train_predictions)
print(test_auc)
print(train_auc)
As you can see above, the random noise column causes significant overfitting. Our test set accuracy decreases to .691
, and our training set accuracy increases to .975
.
One way to prevent overfitting that we tried before was to prevent the tree from growing beyond a certain depth. Another technique is called pruning. Pruning involves building a full tree, and then removing the leaves that don't add to prediction accuracy. Pruning prevents a model from becoming overly complex, and can make a simpler model with higher accuracy on the testing set.
Pruning is less commonly used than parameter optimization (like we just did), and ensembling. That's not to say that it isn't an important technique, and we'll cover it in more depth down the line.
Let's go over the main advantages and disadvantages of decision trees. The main advantages of decision trees are:
The main disadvantage is a tendency to overfit.
In tasks where it's important to be able to interpret and convey why the algorithm is doing what it's doing, decision trees are a good choice.
The most powerful way to reduce decision tree overfitting is to create ensembles of trees. A popular algorithm to do this is called random forest. We'll cover random forests in the next mission. In cases where prediction accuracy is the most important consideration, random forests usually perform better.
In the next mission, we'll explore the random forest algorithm in more depth.