Categorical data and multiple classes
Now that you know how to binarize data for different purposes, we can look into other types of data, such as categorical or multi-labeled data, and how to make them numeric. Most advanced deep learning algorithms, in fact, only accept numerical data. This is merely a design issue that can easily be solved later on, and it is not a big deal because you will learn there are easy ways to take categorical data and convert it to a meaningful numerical representation.
To begin, we will address the issue of converting string categories to plain numbers and then we will convert those to numbers in a format called one-hot encoding.
Converting string labels to numbers
We will take the MNIST dataset again and use its string labels, 0, 1, ..., 9, and convert them to numbers. We can achieve this in many different ways:
- We could simply map all strings to integers with one simple command, y = list(map(int, mnist.target)), and be done. The variable y now contains only a list of integers such as [8, 7, 1, 2, ... ]. But this will only solve the problem for this particular case; you need to learn something that will work for all cases. So, let's not do this.
- We could do some hard work by iterating over the data 10 times – mnist.target = [0 if v=='0' else v for v in mnist.target] – doing this for every numeral. But again, this (and other similar things) will work only for this case. Let's not do this.
- We could use scikit-learn's LabelEncoder() method, which will take any list of labels and map them to a number. This will work for all cases.
Let's use the scikit method by following these steps:
- Run the following code:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
print(sorted(list(set(mnist.target))))
le.fit(sorted(list(set(mnist.target))))
This produces the following output:
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
LabelEncoder()
The sorted(list(set(mnist.target))) command does three things:
- set(mnist.target) retrieves the set of unique values in the data, for example, {'8', '2', ..., '9'}.
- list(set(mnist.target)) simply converts the set into a list because we need a list or an array for the LabelEncoder() method.
- sorted(list(set(mnist.target))) is important here so that 0 maps to 0 and not to have 8 map to 0, and so on. It sorts the list, and the result looks like this - ['0', '1', ..., '9'].
The le.fit() method takes a list (or an array) and produces a map (a dictionary) to be used forward (and backward if needed) to encode labels, or strings, into numbers. It stores this in a LabelEncoder object.
- Next, we could test the encoding as follows:
print(le.transform(["9", "3", "7"]) )
list(le.inverse_transform([2, 2, 1]))
This will output the following:
[9 3 7]
['2', '2', '1']
The transform() method transforms a string-based label into a number, whereas the inverse_transform() method takes a number and returns the corresponding string label or category.
- Once the LabelEncoder object is fitted and tested, we can simply run the following instruction to encode the data:
print("Before ", mnist.target[:3])
y = le.transform(mnist.target)
print("After ", y[:3])
This will output the following:
Before ['5' '0' '4']
After [5 0 4]
The new encoded labels are now in y and ready to be used.
This methodology should work for all labels encoded as strings, for which you can simply map to numbers without losing context. In the case of the MNIST dataset, we can map 0 to 0 and 7 to 7 without losing context. Other examples of when you can do this include the following:
- Age groups: ['18-21', '22-35', '36+'] to [0, 1, 2]
- Gender: ['male', 'female'] to [0, 1]
- Colors: ['red', 'black', 'blue', ...] to [0, 1, 2, ...]
- Studies: ['primary', 'secondary', 'high school', 'university'] to [0, 1, 2, 3]
However, we are making one big assumption here: the labels encode no special meaning in themselves. As we mentioned earlier, zip codes could be simply encoded to smaller numbers; however, they have a geographical meaning, and doing so might negatively impact the performance of our deep learning algorithms. Similarly, in the preceding list, if studies require a special meaning that indicates that a university degree is much higher or more important than a primary degree, then perhaps we should consider different number mappings. Or perhaps we want our learning algorithms to learn such intricacies by themselves! In such cases, we should then use the well-known strategy of one-hot encoding.
Converting categories to one-hot encoding
Converting categories to one-hot encoding is better in most cases in which the categories or labels may have special meanings with respect to each other. In such cases, it has been reported to outperform ordinal encoding [Potdar, K., et al. (2017)].
The idea is to represent each label as a Boolean state having independent columns. Take, for example, a column with the following data:
This can be uniquely transformed, using one-hot encoding, into the following new piece of data:
As you can see, the binary bit is hot (is one) only if the label corresponds to that specific row and it is zero otherwise. Notice also that we renamed the columns to keep track of which label corresponds to which column; however, this is merely a recommended format and is not a formal rule.
There are a number of ways we can do this in Python. If your data is in a pandas DataFrame, then you can simply do pd.get_dummies(df, prefix=['Gender']), assuming your column is in df and you want to use Gender as a prefix.
To reproduce the exact results as discussed in the preceding table, follow these steps:
- Run the following command:
import pandas as pd
df=pd.DataFrame({'Gender': ['female','male','male',
'female','female']})
print(df)
This will output the following:
Gender
0 female
1 male
2 male
3 female
4 female
- Now simply do the encoding by running the following command:
pd.get_dummies(df, prefix=['Gender'])
And this is produced:
Gender_female Gender_male
0 1 0
1 0 1
2 0 1
3 1 0
4 1 0
For cases in which the data is not a pandas DataFrame, for example, MNIST targets, we can use scikit-learn's OneHotEncoder.transform() method.
A OneHotEncoder object has a constructor that will automatically initialize everything to reasonable assumptions and determines most of its parameters using the fit() method. It determines the size of the data, the different labels that exist in the data, and then creates a dynamic mapping that we can use with the transform() method.
To do a one-hot encoding of the MNIST targets, we can do this:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
y = [list(v) for v in mnist.target] # reformat for sklearn
enc.fit(y)
print('Before: ', y[0])
y = enc.transform(y).toarray()
print('After: ', y[0])
print(enc.get_feature_names())
This will output the following:
Before: ['5']
After: [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
['x0_0' 'x0_1' 'x0_2' 'x0_3' 'x0_4' 'x0_5' 'x0_6' 'x0_7' 'x0_8' 'x0_9']
This code includes our classic sanity check in which we verify that label '5' was in fact converted to a row vector with 10 columns, of which number 6 is hot. It works, as expected. The new dimensionality of y is n rows and 10 columns.
The preceding process can be repeated exactly to convert any other columns into one-hot encoding, provided that they contain categorical data.
Categories, labels, and specific mappings to integers or bits are very helpful when we want to classify input data into those categories, labels, or mappings. But what if we want to have input data that maps to continuous data? For example, data to predict a person's IQ by looking at their responses; or predicting the price of electricity depending on the input data about weather and the seasons. This is known as data for regression, which we will cover next.