Python Machine Learning Cookbook（Second Edition）

上QQ阅读APP看书，第一时间看更新

How to do it...

Let's see how to encode data in Python:

Let's take an array with four rows (vectors) and three columns (features):

>> data = np.array([[1, 1, 2], [0, 2, 3], [1, 0, 1], [0, 1, 0]])
>> print(data)

The following result is printed:

[[1 1 2]
 [0 2 3]
 [1 0 1]
 [0 1 0]]

Let's analyze the values present in each column (feature):

The first feature has two possible values: 0, 1
The second feature has three possible values: 0, 1, 2
The third feature has four possible values: 0, 1, 2, 3

So, overall, the sum of the possible values present in each feature is given by 2 + 3 + 4 = 9. This means that 9 entries are required to uniquely represent any vector. The three features will be represented as follows:

Feature 1 starts at index 0
Feature 2 starts at index 2
Feature 3 starts at index 5

To encode categorical integer features as a one-hot numeric array, the preprocessing.OneHotEncoder() function can be used as follows:

>> encoder = preprocessing.OneHotEncoder()
>> encoder.fit(data)

The first row of code sets the encoder, then the fit() function fits the OneHotEncoder object to a data array.

Now we can transform the data array using one-hot encoding. To do this, the transform() function will be used as follows:

>> encoded_vector = encoder.transform([[1, 2, 3]]).toarray()

If you were to print encoded_vector, the expected output would be:

[[0. 1. 0. 0. 1. 0. 0. 0. 1.]]

The result is clear: the first feature (1) has an index of 1, the second feature (3) has an index of 4, and the third feature (3) has an index of 8. As we can verify, only these positions are occupied by a 1; all the other positions have a 0. Remember that Python indexes the positions starting from 0, so the 9 entries will have indexes from 0 to 8.