Implementing linear regression through scikit-learn
Like we did in the previous chapter, we will show you how you can quickly use scikit-learn to train a linear model straight from a SageMaker notebook instance. First, you must create the notebook instance (choosing conda_python3 as the kernel).
- We will start by loading the training data into a pandas dataframe:
housing_df = pd.read_csv(SRC_PATH + 'train.csv')
housing_df.head()
The preceding code displays the following output:
- The last column, (medv), stands for median value and represents the variable that we're trying to predict (dependent variable) based on the values from the remaining columns (independent variables).
As usual, we will split the dataset for training and testing:
from sklearn.model_selection import train_test_split
housing_df_reordered = housing_df[[label] + training_features]
training_df, test_df = train_test_split(housing_df_reordered,
test_size=0.2)
- Once we have these datasets, we will proceed to construct a linear regressor:
from sklearn.linear_model import LinearRegression
regression = LinearRegression()
training_features = ['crim', 'zn', 'indus', 'chas', 'nox',
'rm', 'age', 'dis', 'tax', 'ptratio', 'lstat']
model = regression.fit(training_df[training_features],
training_df['medv'])
We start by constructing an estimator (in this case, linear regression) and fit the model by providing the matrix of training values, (training_df[training_features]), and the labels, (raining_df['medv']).
- After fitting the model, we can use it to get predictions for every row in our testing dataset. We do this by appending a new column to our existing testing dataframe:
test_df['predicted_medv'] = model.predict(test_df[training_features])
test_df.head()
The preceding code displays the following output:
- It's always useful to check our predictions graphically. One way to do this is by plotting the predicted versus actual values as a scatterplot:
test_df[['medv', 'predicted_medv']].plot(kind='scatter',
x='medv',
y='predicted_medv')
The preceding code displays the following output:
Note how the values are located mostly on the diagonal. This is a good sign, as a perfect regressor would yield all data points exactly on the diagonal (every predicted value would be exactly the same as the actual value).
- In addition to this graphical verification, we obtain an evaluation metric that tells us how good our model is at predicting the values. In this example, we use R-squared evaluation metrics, as explained in the previous section, which is available in scikit-learn.
Let's look at the following code block:
from sklearn.metrics import r2_score
r2_score(test_df['medv'], test_df['predicted_medv'])
0.695
A value near 0.7 is a decent value. If you want to get a sense of what a good R2 correlation is, we recommend you play this game: http://guessthecorrelation.com/.
Our linear model will create a predicted price by multiplying the value of each feature by a coefficient and adding up all these values, plus an independent term, or intercept.
We can find the values of these coefficients and intercept by accessing the data members in the model instance variable:
model.coef_
array([-7.15121101e-02, 3.78566895e-02, -4.47104045e-02, 5.06817970e+00,
-1.44690998e+01, 3.98249374e+00, -5.88738235e-03, -1.73656446e+00,
1.01325463e-03, -6.18943939e-01, -6.55278930e-01])
model.intercept_
32.20
It is usually very convenient to examine the coefficients of the different variables as they can be indicative of the relative importance of the features in terms of their independent predictive ability.
By default, most linear regression algorithms such as scikit-learn or Spark will automatically do some degree of preprocessing (for example, it will scale the variables to prevent features with large values from introducing bias). Additionally, these algorithms support regularization parameters and provide you with options to choose the optimizer that's used to efficiently search for the coefficients that maximize the R2 score (or minimize the loss function).