Preprocessing using pipelines
When taking measurements of real-world objects, we can often get features in very different ranges. For instance, if we are measuring the qualities of an animal, we might have several features, as follows:
- Number of legs: This is between the range of 0-8 for most animals, while some have many more!
- Weight: This is between the range of only a few micrograms, all the way to a blue whale with a weight of 190,000 kilograms!
- Number of hearts: This can be between zero to five, in the case of the earthworm.
For a mathematical-based algorithm to compare each of these features, the differences in the scale, range, and units can be difficult to interpret. If we used the above features in many algorithms, the weight would probably be the most influential feature due to only the larger numbers and not anything to do with the actual effectiveness of the feature.
One of the methods to overcome this is to use a process called preprocessing to normalize the features so that they all have the same range, or are put into categories like small, medium and large. Suddenly, the large difference in the types of features has less of an impact on the algorithm, and can lead to large increases in the accuracy.
Preprocessing can also be used to choose only the more effective features, create new features, and so on. Preprocessing in scikit-learn
is done through Transformer
objects, which take a dataset in one form and return an altered dataset after some transformation of the data. These don't have to be numerical, as Transformers are also used to extract features-however, in this section, we will stick with preprocessing.
An example
We can show an example of the problem by breaking the Ionosphere
dataset. While this is only an example, many real-world datasets have problems of this form. First, we create a copy of the array so that we do not alter the original dataset:
X_broken = np.array(X)
Next, we break the dataset by piding every second feature by 10
:
X_broken[:,::2] /= 10
In theory, this should not have a great effect on the result. After all, the values for these features are still relatively the same. The major issue is that the scale has changed and the odd features are now larger than the even features. We can see the effect of this by computing the accuracy:
estimator = KNeighborsClassifier() original_scores = cross_val_score(estimator, X, y, scoring='accuracy') print("The original average accuracy for is {0:.1f}%".format(np.mean(original_scores) * 100)) broken_scores = cross_val_score(estimator, X_broken, y, scoring='accuracy') print("The 'broken' average accuracy for is {0:.1f}%".format(np.mean(broken_scores) * 100))
This gives a score of 82.3 percent for the original dataset, which drops down to 71.5 percent on the broken dataset. We can fix this by scaling all the features to the range 0
to 1
.
Standard preprocessing
The preprocessing we will perform for this experiment is called feature-based normalization through the MinMaxScaler
class. Continuing with the IPython notebook from the rest of this chapter, first, we import this class:
from sklearn.preprocessing import MinMaxScaler
This class takes each feature and scales it to the range 0
to 1
. The minimum value is replaced with 0
, the maximum with 1
, and the other values somewhere in between.
To apply our preprocessor, we run the transform function on it. While MinMaxScaler
doesn't, some transformers need to be trained first in the same way that the classifiers do. We can combine these steps by running the fit_transform
function instead:
X_transformed = MinMaxScaler().fit_transform(X)
Here, X_transformed
will have the same shape as X. However, each column will have a maximum of 1 and a minimum of 0.
There are various other forms of normalizing in this way, which is effective for other applications and feature types:
- Ensure the sum of the values for each sample equals to 1, using
sklearn.preprocessing.Normalizer
- Force each feature to have a zero mean and a variance of 1, using
sklearn.preprocessing.StandardScaler
, which is a commonly used starting point for normalization - Turn numerical features into binary features, where any value above a threshold is 1 and any below is 0, using
sklearn.preprocessing.Binarizer
We will use combinations of these preprocessors in later chapters, along with other types of Transformers
object.
Putting it all together
We can now create a workflow by combining the code from the previous sections, using the broken dataset previously calculated:
X_transformed = MinMaxScaler().fit_transform(X_broken) estimator = KNeighborsClassifier() transformed_scores = cross_val_score(estimator, X_transformed, y, scoring='accuracy') print("The average accuracy for is {0:.1f}%".format(np.mean(transformed_scores) * 100))
This gives us back our score of 82.3 percent accuracy. The MinMaxScaler
resulted in features of the same scale, meaning that no features overpowered others by simply being bigger values. While the Nearest Neighbor algorithm can be confused with larger features, some algorithms handle scale differences better. In contrast, some are much worse!