Binary data and binary classification_Deep Learning for Beginners-QQ阅读中文玄幻网

上QQ阅读APP看书，第一时间看更新

Binary data and binary classification

In this section, we will focus all our efforts on preparing data with binary inputs or targets. By binary, of course, we mean values that can be represented as either 0 or 1. Notice the emphasis on the words represented as. The reason is that a column may contain data that is not necessarily a 0 or a 1, but could be interpreted as or represented by a 0 or a 1.

Consider the following fragment of a dataset:

In this short dataset example with only four rows, the column x₁ has values that are clearly binary and are either 0 or a 1. However, x₂, at first glance, may not be perceived as binary, but if you pay close attention, the only values in that column are either 5 or 7. This means that the data can be correctly and uniquely mapped to a set of two values. Therefore, we could map 5 to 0, and 7 to 1, or vice versa; it does not really matter.

A similar phenomenon is observed in the target output value, y, which also contains unique values that can be mapped to a set of size two. And we can do such mapping by assigning, say, b to 0, and a to 1.

If you are going to map from strings to binary, always make sure to check what type of data your specific models can handle. For example, in some Support Vector Machine implementations, the preferred values for targets are -1 and 1. This is still binary but in a different set. Always double-check before deciding what mapping you will use.

In the next sub-section, we will deal specifically with binary targets using a dataset as a case study.

Binary targets on the Cleveland Heart Disease dataset

The Cleveland Heart Disease (Cleveland 1988) dataset contains patient data for 303 subjects. Some of the columns in the dataset have missing values; we will deal with this, too. The dataset contains 13 columns that include cholesterol and age.

The target is to detect whether a subject has heart disease or not, thus, is binary. The problem we will deal with is that the data is encoded with values from 0 to 4, where 0 indicates the absence of heart disease and the range 1 to 4 indicates some type of heart disease.

We will use the portion of the dataset identified as Cleveland, which can be downloaded from this link: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data

The attributes of the dataset are as follows:

Let's follow the next steps in order to read the dataset into a pandas DataFrame and clean it:

In our Google Colab, we will first download the data using the wget command as follows:

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data

This, in turn, downloads the file processed.cleveland.data to the default directory for Colab. This can be verified by inspecting the Files tab on the left side of Colab. Please note that the preceding instruction is all one single line that, unfortunately, is very long.

Next, we load the dataset using pandas to verify that the dataset is readable and accessible.

Pandas is a Python library that is very popular among data scientists and machine learning scientists. It makes it easy to load and save datasets, to replace missing values, to retrieve basic statistical properties on data, and even perform transformations. Pandas is a lifesaver and now most other libraries for machine learning accept pandas as a valid input format.

Run the following commands in Colab to load and display some data:

import pandas as pd
df = pd.read_csv('processed.cleveland.data', header=None)
print(df.head())

The read_csv() function loads a file that is formatted as comma-separated values (CSV). We use the argument header=None to tell pandas that the data does not have any actual headers; if omitted, pandas will use the first row of the data as the names for each column, but we do not want that in this case.

The loaded data is stored in a variable called df, which can be any name, but I think it is easy to remember because pandas stores the data in a DataFrame object. Thus, df seems like an appropriate, short, memorable name for the data. However, if we work with multiple DataFrames, then it would be more convenient to name all of them differently with a name that describes the data they contain.

The head() method that operates over a DataFrame is analog to a unix command that retrieves the first few lines of a file. On a DataFrame, the head() method returns the first five rows of data. If you wish to retrieve more, or fewer, rows of data, you can specify an integer as an argument to the method. Say, for example, that you want to retrieve the first three rows, then you would do df.head(3).

The results of running the preceding code are as follows:

    0   1   2     3     4   5   6     7   8    9   10  11  12  13
0  63.  1.  1.  145.  233.  1.  2.  150.  0.  2.3  3.  0.  6.   0
1  67.  1.  4.  160.  286.  0.  2.  108.  1.  1.5  2.  3.  3.   2
2  67.  1.  4.  120.  229.  0.  2.  129.  1.  2.6  2.  2.  7.   1
3  37.  1.  3.  130.  250.  0.  0.  187.  0.  3.5  3.  0.  3.   0
4  41.  0.  2.  130.  204.  0.  2.  172.  0.  1.4  1.  0.  3.   0

Here are a few things to observe and remember for future reference:

On the left side, there is an unnamed column that has rows with consecutive numbers, 0, 1, ..., 4. These are the indices that pandas assigns to each row in the dataset. These are unique numbers. Some datasets have unique identifiers, such as a filename for an image.
On the top, there is a row that goes from 0, 1, ..., 13. These are the column identifiers. These are also unique and can be set if they are given to us.
At the intersection of every row and column, we have values that are either floating-point decimals or integers. The entire dataset contains decimal numbers except for column 13, which is our target and contains integers.

Because we will use this dataset as a binary classification problem, we now need to change the last column to contain only binary values: 0 and 1. We will preserve the original meaning of 0, that is, no heart disease, and anything greater than or equal to 1 will be mapped to 1, indicating the diagnosis of some type of heart disease. We will run the following instructions:

print(set(df[13]))

The instruction df[13] looks at the DataFrame and retrieves all the rows of the column whose index is 13. Then, the set() method over all the rows of column 13 will create a set of all the unique elements in the column. In this way, we can know how many different values there are so that we can replace them. The output is as follows:

{0, 1, 2, 3, 4}

From this, we know that 0 is no heart disease and 1 implies heart disease. However, 2, 3, and 4 need to be mapped to 1, because they, too, imply positive heart disease. We can make this change by executing the following commands:

df[13].replace(to_replace=[2,3,4], value=1, inplace=True)
print(df.head())
print(set(df[13]))

Here, the replace() function works on the DataFrame to replace specific values. In our case, it took three arguments:

to_replace=[2,3,4] denotes the list of items to search for, in order to replace them.
value=1 denotes the value that will replace every matched entry .
inplace=True indicates to pandas that we want to make the changes on the column.

In some cases, pandas DataFrames behave like an immutable object, which, in this case, makes it necessary to use the inplace=True argument. If we did not use this argument, we would have to do something like this.
df[13] = df[13].replace(to_replace=[2,3,4], value=1), which is not a problem for experienced pandas users. This means that you should be comfortable doing this either way.
The main problem for people beginning to use pandas is that it does not always behave like an immutable object. Thus, you should keep all the pandas documentation close to you: https://pandas.pydata.org/pandas-docs/stable/index.html

The output for the preceding commands is the following:

    0   1   2     3     4   5   6     7   8    9   10  11  12  13
0  63.  1.  1.  145.  233.  1.  2.  150.  0.  2.3  3.  0.  6.   0
1  67.  1.  4.  160.  286.  0.  2.  108.  1.  1.5  2.  3.  3.   1
2  67.  1.  4.  120.  229.  0.  2.  129.  1.  2.6  2.  2.  7.   1
3  37.  1.  3.  130.  250.  0.  0.  187.  0.  3.5  3.  0.  3.   0
4  41.  0.  2.  130.  204.  0.  2.  172.  0.  1.4  1.  0.  3.   0

{0, 1}

First, notice that when we print the first five rows, the thirteenth column now exclusively has the values 0 or 1. You can compare this to the original data to verify that the number in bold font actually changed. We also verified, with set(df[13]), that the set of all unique values of that column is now only {0, 1}, which is the desired target.

With these changes, we could use the dataset to train a deep learning model and perhaps improve the existing documented performance [Detrano, R., et al. (1989)].

The same methodology can be applied to make any other column have binary values in the set we need. As an exercise, let's do another example with the famous MNIST dataset.

Binarizing the MNIST dataset

The MNIST dataset is well known in the deep learning community (Deng, L. (2012)). It is composed of thousands of images of handwritten digits. Figure 3.1 shows eight samples of the MNIST dataset:

Figure 3.1 – Eight samples of the MNIST dataset. The number on top of each image corresponds to the target class

As you can see, the samples in this dataset are messy and are very real. Every image has a size of 28 x 28 pixels. And there are only 10 target classes, one for each digit, 0, 1, 2, ..., 9. The complication here is usually that some digits may look similar to others; for example, 1 and 7, or 0 and 6. However, most deep learning algorithms have successfully solved the classification problem with high accuracy.

From Figure 3.1, a close inspection will reveal that the values are not exactly zeros and ones, that is, binary. In fact, the images are 8-bit grayscale, in the range [0-255]. As mentioned earlier, this is no longer a problem for most advanced deep learning algorithms. However, for some algorithms, such as Restricted Boltzmann Machines (RMBs), the input data needs to be in binary format [0,1] because that is how the algorithm works, traditionally.

Thus, we will do two things:

Binarize the images, so as to have binary inputs
Binarize the targets, to make it a binary classification problem

For this example, we will arbitrarily select two numerals only, 7 and 8, as our target classes.

Binarizing the images

The binarization process is a common step in image processing. It is formally known as image thresholding because we need a threshold to decide which values become zeros and ones. For a full survey about this topic, please consult (Sezgin, M., and Sankur, B. (2004)). This is all to say that there is a science behind picking the perfect threshold that will minimize the range conversion error from [0, 255] down to [0, 1].

However, since this is not a book about image processing, we will arbitrarily set a threshold of 128. Thus, any value below 128 will become a zero, and any value greater than or equal to 128 will become a one.

This step can be easily done by using indexing in Python. To proceed, we will display a small portion of the dataset to make sure the data is transformed correctly. We will do this by executing the following commands in the next steps:

To load the dataset and verify its dimensionality (shape), run the following command:

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
print(mnist.data.shape)
print(mnist.target.shape)

The following is the output:

(70000, 784)
(70000,)

The first thing to notice is that we are using a machine learning library known as scikit learn or sklearn in Python. It is one of the most used libraries for general-purpose machine learning. The MNIST dataset is loaded using the fetch_openml() method, which requires an argument with the identifier of the dataset to be loaded, which in this case is 'mnist_784'. The number 784 comes from the size of MNIST images, which is 28 x 28 pixels and can be interpreted as a vector of 784 elements rather than a matrix of 28 columns and 28 rows. By verifying the shape property, we can see that the dataset has 70,000 images represented as vectors of size 784, and the targets are in the same proportion.

Please note here that, as opposed to the previous section where we used a dataset loaded into pandas, in this example, we use the data directly as lists or arrays of lists. You should feel comfortable manipulating both pandas and raw datasets.

To actually do the binarization by verifying the data before and after, run the following:

print(mnist.data[0].reshape(28, 28)[10:18,10:18])
mnist.data[mnist.data < 128] = 0
mnist.data[mnist.data >=128] = 1
print(mnist.data[0].reshape(28, 28)[10:18,10:18])

This will output the following:

[[ 1. 154. 253.  90.   0.   0.   0.   0.]
 [ 0. 139. 253. 190.   2.   0.   0.   0.]
 [ 0.  11. 190. 253.  70.   0.   0.   0.]
 [ 0.   0.  35. 241. 225. 160. 108.   1.]
 [ 0.   0.   0.  81. 240. 253. 253. 119.]
 [ 0.   0.   0.   0.  45. 186. 253. 253.]
 [ 0.   0.   0.   0.   0.  16.  93. 252.]
 [ 0.   0.   0.   0.   0.   0.   0. 249.]]

[[ 0. 1. 1. 0. 0. 0. 0. 0.]
 [ 0. 1. 1. 1. 0. 0. 0. 0.]
 [ 0. 0. 1. 1. 0. 0. 0. 0.]
 [ 0. 0. 0. 1. 1. 1. 0. 0.]
 [ 0. 0. 0. 0. 1. 1. 1. 0.]
 [ 0. 0. 0. 0. 0. 1. 1. 1.]
 [ 0. 0. 0. 0. 0. 0. 0. 1.]
 [ 0. 0. 0. 0. 0. 0. 0. 1.]]

The instruction data[0].reshape(28, 28)[10:18,10:18] is doing three things:

data[0] returns the first image as an array of size (1, 784).
reshape(28, 28) resizes the (1, 784) array as a (28, 28) matrix, which is the actual image; this can be useful to display the actual data, for example, to produce Figure 3.1.
[10:18,10:18] takes only a subset of the (28, 28) matrix at positions 10 to 18 for both columns and rows; this more or less corresponds to the center area of the image and it is a good place to look at what is changing.

The preceding is for looking at the data only, but the actual changes are done in the next lines. The line mnist.data[mnist.data < 128] = 0 uses Python indexing. The instruction mnist.data < 128 returns a multidimensional array of Boolean values that mnist.data[ ] uses as indices on which to set the value to zero. The key is to do so for all values strictly less than 128. And the next line does the same, but for values greater than or equal to 128.

By inspecting the output, we can confirm that the data has successfully changed and has been thresholded, or binarized.

Binarizing the targets

We will binarize the targets by following the next two steps:

First, we will discard image data for other numerals and we will only keep 7 and 8. Then, we will map 7 to 0 and 8 to 1. These commands will create new variables, X and y, that will hold only the numerals 7 and 8:

X = mnist.data[(mnist.target == '7') | (mnist.target == '8')]
y = mnist.target[(mnist.target == '7') | (mnist.target == '8')]
print(X.shape)
print(y.shape)

This will output the following:

(14118, 784)
(14118)

Notice the use of the OR operator, |, to logically take two sets of Boolean indices and produce one with the OR operator. These indices are used to produce a new dataset. The shape of the new dataset contains a little over 14,000 images.

To map 7 to 0 and 8 to 1, we can run the following command:

print(y[:10])
y = [0 if v=='7' else 1 for v in y]
print(y[:10])

This outputs the following:

['7' '8' '7' '8' '7' '8' '7' '8' '7' '8']
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1]

The instruction [0 if v=='7' else 1 for v in y] checks every element in y, and if an element is '7', then it returns a 0, otherwise (for example, when it is '8'), it returns a 1. As the output suggests, choosing the first 10 elements, the data is binarized to the set {0, 1}.

Remember, the target data in y was already binary in the sense that it only had two sets of unique possible numbers { 7, 8}. But we made it binary to the set { 0, 1} because often this is better when we use different deep learning algorithms that calculate very specific types of loss functions.

With this, the dataset is ready to use with binary and general classifiers. But what if we actually want to have multiple classes, for example, to detect all 10 digits of the MNIST dataset and not just 2? Or what if we have features, columns, or inputs that are not numeric but are categorical? The next section will help you prepare the data in these cases.