Categorical data and multiple classes_Deep Learning for Beginners-QQ阅读男生轻小说网

上QQ阅读APP看书，第一时间看更新

Categorical data and multiple classes

Now that you know how to binarize data for different purposes, we can look into other types of data, such as categorical or multi-labeled data, and how to make them numeric. Most advanced deep learning algorithms, in fact, only accept numerical data. This is merely a design issue that can easily be solved later on, and it is not a big deal because you will learn there are easy ways to take categorical data and convert it to a meaningful numerical representation.

Categorical data has information embedded as distinct categories. These categories can be represented as numbers or as strings. For example, a dataset that has a column named country with items such as "India", "Mexico", "France", and "U.S". Or, a dataset with zip codes such as 12601, 85621, and 73315. The former is non-numeric categorical data, and the latter is numeric categorical data. Country names would need to be converted to a number to be usable at all, but zip codes are already numbers that are meaningless as mere numbers. Zip codes would be more meaningful, from a machine learning perspective, if we converted them to latitude and longitude coordinates; this would better capture places that are closer to each other than using plain numbers.

To begin, we will address the issue of converting string categories to plain numbers and then we will convert those to numbers in a format called one-hot encoding.

Converting string labels to numbers

We will take the MNIST dataset again and use its string labels, 0, 1, ..., 9, and convert them to numbers. We can achieve this in many different ways:

We could simply map all strings to integers with one simple command, y = list(map(int, mnist.target)), and be done. The variable y now contains only a list of integers such as [8, 7, 1, 2, ... ]. But this will only solve the problem for this particular case; you need to learn something that will work for all cases. So, let's not do this.
We could do some hard work by iterating over the data 10 times – mnist.target = [0 if v=='0' else v for v in mnist.target] – doing this for every numeral. But again, this (and other similar things) will work only for this case. Let's not do this.
We could use scikit-learn's LabelEncoder() method, which will take any list of labels and map them to a number. This will work for all cases.

Let's use the scikit method by following these steps:

Run the following code:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
print(sorted(list(set(mnist.target))))

le.fit(sorted(list(set(mnist.target))))

This produces the following output:

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

LabelEncoder()

The sorted(list(set(mnist.target))) command does three things:

set(mnist.target) retrieves the set of unique values in the data, for example, {'8', '2', ..., '9'}.
list(set(mnist.target)) simply converts the set into a list because we need a list or an array for the LabelEncoder() method.
sorted(list(set(mnist.target))) is important here so that 0 maps to 0 and not to have 8 map to 0, and so on. It sorts the list, and the result looks like this - ['0', '1', ..., '9'].

The le.fit() method takes a list (or an array) and produces a map (a dictionary) to be used forward (and backward if needed) to encode labels, or strings, into numbers. It stores this in a LabelEncoder object.

Next, we could test the encoding as follows:

print(le.transform(["9", "3", "7"]) )

list(le.inverse_transform([2, 2, 1]))

This will output the following:

[9 3 7]

['2', '2', '1']

The transform() method transforms a string-based label into a number, whereas the inverse_transform() method takes a number and returns the corresponding string label or category.

Any attempt to map to and from an unseen category or number will cause a LabelEncoder object to produce an error. Please be diligent in providing the list of all possible categories to the best of your knowledge.

Once the LabelEncoder object is fitted and tested, we can simply run the following instruction to encode the data:

print("Before ", mnist.target[:3])
y = le.transform(mnist.target)
print("After ", y[:3])

This will output the following:

Before ['5' '0' '4']
After [5 0 4]

The new encoded labels are now in y and ready to be used.

This method of encoding a label to an integer is also known as Ordinal Encoding.

This methodology should work for all labels encoded as strings, for which you can simply map to numbers without losing context. In the case of the MNIST dataset, we can map 0 to 0 and 7 to 7 without losing context. Other examples of when you can do this include the following:

Age groups: ['18-21', '22-35', '36+'] to [0, 1, 2]
Gender: ['male', 'female'] to [0, 1]
Colors: ['red', 'black', 'blue', ...] to [0, 1, 2, ...]
Studies: ['primary', 'secondary', 'high school', 'university'] to [0, 1, 2, 3]

However, we are making one big assumption here: the labels encode no special meaning in themselves. As we mentioned earlier, zip codes could be simply encoded to smaller numbers; however, they have a geographical meaning, and doing so might negatively impact the performance of our deep learning algorithms. Similarly, in the preceding list, if studies require a special meaning that indicates that a university degree is much higher or more important than a primary degree, then perhaps we should consider different number mappings. Or perhaps we want our learning algorithms to learn such intricacies by themselves! In such cases, we should then use the well-known strategy of one-hot encoding.

Converting categories to one-hot encoding

Converting categories to one-hot encoding is better in most cases in which the categories or labels may have special meanings with respect to each other. In such cases, it has been reported to outperform ordinal encoding [Potdar, K., et al. (2017)].

The idea is to represent each label as a Boolean state having independent columns. Take, for example, a column with the following data:

This can be uniquely transformed, using one-hot encoding, into the following new piece of data:

As you can see, the binary bit is hot (is one) only if the label corresponds to that specific row and it is zero otherwise. Notice also that we renamed the columns to keep track of which label corresponds to which column; however, this is merely a recommended format and is not a formal rule.

There are a number of ways we can do this in Python. If your data is in a pandas DataFrame, then you can simply do pd.get_dummies(df, prefix=['Gender']), assuming your column is in df and you want to use Gender as a prefix.

To reproduce the exact results as discussed in the preceding table, follow these steps:

Run the following command:

import pandas as pd
df=pd.DataFrame({'Gender': ['female','male','male',
                            'female','female']})
print(df)

This will output the following:

  Gender
0 female
1 male
2 male
3 female
4 female

Now simply do the encoding by running the following command:

pd.get_dummies(df, prefix=['Gender'])

And this is produced:

  Gender_female  Gender_male
0             1            0
1             0            1
2             0            1
3             1            0
4             1            0

A fun, and perhaps obvious, property of this encoding is that the OR and XOR operations along the rows of all the encoded columns will always be one, and the AND operation will yield zeros.

For cases in which the data is not a pandas DataFrame, for example, MNIST targets, we can use scikit-learn's OneHotEncoder.transform() method.

A OneHotEncoder object has a constructor that will automatically initialize everything to reasonable assumptions and determines most of its parameters using the fit() method. It determines the size of the data, the different labels that exist in the data, and then creates a dynamic mapping that we can use with the transform() method.

To do a one-hot encoding of the MNIST targets, we can do this:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
y = [list(v) for v in mnist.target] # reformat for sklearn
enc.fit(y)

print('Before: ', y[0])
y = enc.transform(y).toarray()
print('After: ', y[0])
print(enc.get_feature_names())

This will output the following:

Before: ['5']
After: [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
['x0_0' 'x0_1' 'x0_2' 'x0_3' 'x0_4' 'x0_5' 'x0_6' 'x0_7' 'x0_8' 'x0_9']

This code includes our classic sanity check in which we verify that label '5' was in fact converted to a row vector with 10 columns, of which number 6 is hot. It works, as expected. The new dimensionality of y is n rows and 10 columns.

This is the preferred format for the targets that use deep learning methods on MNIST. One-hot encoding targets are great for neural networks that will have exactly one neuron per class. In this case, one neuron per digit. Each neuron will need to learn to predict one-hot encoded behavior, that is, only one neuron should fire up (be "hot") while the others should be inhibited.

The preceding process can be repeated exactly to convert any other columns into one-hot encoding, provided that they contain categorical data.

Categories, labels, and specific mappings to integers or bits are very helpful when we want to classify input data into those categories, labels, or mappings. But what if we want to have input data that maps to continuous data? For example, data to predict a person's IQ by looking at their responses; or predicting the price of electricity depending on the input data about weather and the seasons. This is known as data for regression, which we will cover next.