There's more…
Let's see what happens when you include the fourteenth and fifteenth columns in the dataset. In the feature importance graph, every feature other than these two has to go to zero. The reason is that the output can be obtained by simply summing up the fourteenth and fifteenth columns, so the algorithm doesn't need any other features to compute the output. Make the following change inside the for loop (the rest of the code remains unchanged):
X.append(row[2:15])
If you plot the feature importance graph now, you will see the following:
As expected, it says that only these two features are important. This makes sense intuitively because the final output is a simple summation of these two features. So, there is a direct relationship between these two variables and the output value. Hence, the regressor says that it doesn't need any other variable to predict the output. This is an extremely useful tool to eliminate redundant variables in your dataset. But this is not the only difference from the previous model. If we analyze the model's performance, we can see a substantial improvement:
#### Random Forest regressor performance ####
Mean squared error = 22552.26
Explained variance score = 0.99
We therefore have 99% of the variance explained: a very good result.
There is another file, called bike_hour.csv, that contains data about how the bicycles are shared hourly. We need to consider columns 3 to 14, so let's make this change in the code (the rest of the code remains unchanged):
filename="bike_hour.csv"
file_reader = csv.reader(open(filename, 'r'), delimiter=',')
X, y = [], []
for row in file_reader:
X.append(row[2:14])
y.append(row[-1])
If you run the new code, you will see the performance of the regressor displayed, as follows:
#### Random Forest regressor performance ####
Mean squared error = 2613.86
Explained variance score = 0.92
The feature importance graph will look like the following:
This shows that the hour of the day is the most important feature, which makes sense intuitively if you think about it! The next important feature is temperature, which is consistent with our earlier analysis.