Python Machine Learning Cookbook（Second Edition）

上QQ阅读APP看书，第一时间看更新

How to do it…

Let's see how to estimate housing prices in Python:

Create a new file called housing.py and add the following lines:

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn import datasets
from sklearn.metrics import mean_squared_error, explained_variance_score
from sklearn.utils import shuffle
import matplotlib.pyplot as plt

There is a standard housing dataset that people tend to use to get started with machine learning. You can download it at https://archive.ics.uci.edu/ml/machine-learning-databases/housing/. We will be using a slightly modified version of the dataset, which has been provided along with the code files.
The good thing is that scikit-learn provides a function to directly load this dataset:

housing_data = datasets.load_boston()

Each data point has 12 input parameters that affect the price of a house. You can access the input data using housing_data.data and the corresponding price using housing_data.target. The following attributes are available:

crim: Per capita crime rate by town
zn: Proportion of residential land zoned for lots that are over 25,000 square feet
indus: Proportion of non-retail business acres per town
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox: Nitric oxides concentration (parts per ten million)
rm: Average number of rooms per dwelling
age: Proportion of owner-occupied units built prior to 1940
dis: Weighted distances to the five Boston employment centers
rad: Index of accessibility to radial highways
tax: Full-value property-tax rate per $10,000
ptratio: Pupil-teacher ratio by town
lstat: Percent of the lower status of the population
target: Median value of owner-occupied homes in $1000

Of these, target is the response variable, while the other 12 variables are possible predictors. The goal of this analysis is to fit a regression model that best explains the variation in target.

Let's separate this into input and output. To make this independent of the ordering of the data, let's shuffle it as well:

X, y = shuffle(housing_data.data, housing_data.target, random_state=7)

The sklearn.utils.shuffle() function shuffles arrays or sparse matrices in a consistent way to do random permutations of collections. Shuffling data reduces variance and makes sure that the patterns remain general and less overfitted. The random_state parameter controls how we shuffle data so that we can have reproducible results.

Let's divide the data into training and testing. We'll allocate 80% for training and 20% for testing:

num_training = int(0.8 * len(X))
X_train, y_train = X[:num_training], y[:num_training]
X_test, y_test = X[num_training:], y[num_training:]

Remember, machine learning algorithms, train models by using a finite set of training data. In the training phase, the model is evaluated based on its predictions of the training set. But the goal of the algorithm is to produce a model that predicts previously unseen observations, in other words, one that is able to generalize the problem by starting from known data and unknown data. For this reason, the data is divided into two datasets: training and test. The training set is used to train the model, while the test set is used to verify the ability of the system to generalize.

We are now ready to fit a decision tree regression model. Let's pick a tree with a maximum depth of 4, which means that we are not letting the tree become arbitrarily deep:

dt_regressor = DecisionTreeRegressor(max_depth=4)
dt_regressor.fit(X_train, y_train)

The DecisionTreeRegressor function has been used to build a decision tree regressor.

Let's also fit the decision tree regression model with AdaBoost:

ab_regressor = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4), n_estimators=400, random_state=7)
ab_regressor.fit(X_train, y_train)

The AdaBoostRegressor function has been used to compare the results and see how AdaBoost really boosts the performance of a decision tree regressor.

Let's evaluate the performance of the decision tree regressor:

y_pred_dt = dt_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred_dt)
evs = explained_variance_score(y_test, y_pred_dt)
print("#### Decision Tree performance ####")
print("Mean squared error =", round(mse, 2))
print("Explained variance score =", round(evs, 2))

First, we used the predict() function to predict the response variable based on the test data. Next, we calculated mean squared error and explained variance. Mean squared error is the average of the squared difference between actual and predicted values across all data points in the input. The explained variance is an indicator that, in the form of proportion, indicates how much variability of our data is explained by the model in question.

Now, let's evaluate the performance of AdaBoost:

y_pred_ab = ab_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred_ab)
evs = explained_variance_score(y_test, y_pred_ab)
print("#### AdaBoost performance ####")
print("Mean squared error =", round(mse, 2))
print("Explained variance score =", round(evs, 2))

Here is the output on the Terminal:

#### Decision Tree performance ####
Mean squared error = 14.79
Explained variance score = 0.82

#### AdaBoost performance ####
Mean squared error = 7.54
Explained variance score = 0.91

The error is lower and the variance score is closer to 1 when we use AdaBoost, as shown in the preceding output.