Data augmentation
Now that you have learned how to process the data to have specific distributions, it is important for you to know about data augmentation, which is usually associated with missing data or high-dimensional data. Traditional machine learning algorithms may have problems dealing with data where the number of dimensions surpasses the number of samples available. The problem is not particular to all deep learning algorithms, but some algorithms have a much more difficult time learning to model a problem that has more variables to figure out than samples to work on. We have a few options to correct that: either we reduce the dimensions or variables (see the following section) or we increase the samples in our dataset (this section).
One of the tools for adding more data is known as data augmentation (Van Dyk, D. A., and Meng, X. L. (2001)). In this section, we will use the MNIST dataset to exemplify a few techniques for data augmentation that are particular to images but can be conceptually extended to other types of data.
We will cover the basics: adding noise, rotating, and rescaling. That is, from one original example, we will produce three new, different images of numerals. We will use the image processing library known as scikit image.
Rescaling
We begin by reloading the MNIST dataset as we have done before:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
Then we can simply invoke the rescale() method to create a rescaled image. The whole purpose behind resizing an image is to rescale it back to its original size because this makes the image look like a small resolution image of the original. It loses some of its characteristics in the process, but it can actually make a more robust deep learning model. That is, a model robust to the scale of objects, or in this case, the scale of numerals:
from skimage.transform import rescale
x = mnist.data[0].reshape(28,28)
Once we have x as the original image from which we will augment, we can do the scaling down and up as follows:
s = rescale(x, 0.5, multichannel=False)
x_= rescale(s, 2.0, multichannel=False)
Here, the augmented image (rescaled) is in x_. Notice that, in this case, the image is downscaled by a factor of two (50%) and then upscaled, also by a factor of two (200%). The multichannel argument is set to false since the images have only one single channel, meaning they are grayscale.
Besides rescaling, we can also modify the existing data slightly so as to have variations of the existing data without deviating much from the original, as we'll discuss next.
Adding noise
Similarly, we can also contaminate the original image with additive Gaussian noise. This creates random patterns all over the image to simulate a camera problem or noisy acquisition. Here, we use it to also augment our dataset and, in the end, to produce a deep learning model that is robust against noise.
For this, we use the random_noise() method as follows:
from skimage.util import random_noise
x_ = random_noise(x)
Once again, the augmented image (noisy) is in x_.
Besides noise, we can also change the perspective of an image slightly so as to preserve the original shape at a different angle, as we'll discuss next.
Rotating
We can use a plain rotation effect on the images to have even more data. The rotation of images is a crucial part of learning good features from images. Larger datasets contain, naturally, many versions of images that are slightly rotated or fully rotated. If we do not have such images in our dataset, we can manually rotate them and augment our data.
For this, we use the rotate() method like so:
from skimage.transform import rotate
x_ = rotate(x, 22)
In this example, the number 22 specifies the angle of rotation:
The first column is the original numeral of the MNIST dataset. The second column shows the effect of rescaling. The third column shows the original plus additive Gaussian noise. The last column shows a rotation of 20 degrees (top) and -20 degrees (bottom).
Other augmentation techniques
For image datasets, there are other ideas for augmenting data that include the following:
- Changing the projection of the image
- Adding compression noise (quantizing the image)
- Other types of noise besides Gaussian, such as salt and pepper, or multiplicative noise
- The translation of the image by different distances at random
But the most robust augmentation would be a combination of all of these!
Images are fun because they are highly correlated in local areas. But for general non-image datasets, such as the heart disease dataset, we can augment data in other ways, for example:
- Adding low-variance Gaussian noise
- Adding compression noise (quantization)
- Drawing new points from a calculated probability density function over the data
For other special datasets, such as text-based data, we can also do the following:
- Replace some words with synonyms
- Remove some words
- Add words that contain errors
- Remove punctuation (only if you do not care about proper language structures)
For more information on this and many other augmentation techniques, consult online resources on the latest advances pertaining to your specific type of data.
Let's now dive into some techniques for dimensionality reduction that can be used to alleviate the problem of high-dimensional and highly correlated datasets.