Introducing deep learning
While a more detailed discussion of learning algorithms will be addressed in Chapter 4, Learning from Data, in this section, we will deal with the fundamental concept of a neural network and the developments that led to deep learning.
The model of a neuron
The human brain has input connections from other neurons (synapses) that receive stimuli in the form of electric charges, and then has a nucleus that depends on how the input stimulates the neuron that can trigger the neuron's activation. At the end of the neuron, the output signal is propagated to other neurons through dendrites, thus forming a network of neurons.
The analogy of the human neuron is depicted in Figure 1.3, where the input is represented with the vector x, the activation of the neuron is given by some function z(.), and the output is y. The parameters of the neuron are w and b:
The trainable parameters of a neuron are w and b, and they are unknown. Thus, we can use training data to determine these parameters using some learning strategy. From the picture, x1 multiplies w1, then x2 multiplies w2, and b is multiplied by 1; all these products are added, which can be simplified as follows:
The activation function operates as a way to ensure the output is within the desired output range. Let's say that we want a simple linear activation, then the function z(.) is non-existing or can be bypassed, as follows:
This is usually the case when we want to solve a regression problem and the output data can have a range from -∞ to +∞. However, we may want to train the neuron to determine whether a vector x belongs to one of two classes, say -1 and +1. Then we would be better suited using a function called a sign activation:
Where the sign(.) function is denoted as follows:
There are many other activation functions, but we will introduce those later on. For now, we will briefly show one of the simplest learning algorithms, the perceptron learning algorithm (PLA).
The perceptron learning algorithm
The PLA begins from the assumption that you want to classify data, X, into two different groups, the positive group (+) and the negative group (-). It will find some w and b by training to predict the corresponding correct labels y. The PLA uses the sign( . ) function as the activation. Here are the steps that the PLA follows:
- Initialize w to zeros, and iteration counter t = 0
- While there are any incorrectly classified examples:
- Pick an incorrectly classified example, call it x*, whose true label is y*
- Update w as follows: wt+1 = wt + y*x*
- Increase iteration counter t++ and repeat
Notice that, for the PLA to work as we want, we have to make an adjustment. What we want is for to be implied in the expression . The only way this could work is if we set and . The previous rule seeks w, which implies the search for b.
To illustrate the PLA, consider the case of the following linearly separable dataset:
This two-dimensional dataset was produced at random using Python tools that we will discuss later on. For now, it should be self-evident that you can draw a line between the two groups and divide them.
Following the steps outlined previously, the PLA can find a solution, that is, a separating line that satisfies the training data target outputs completely in only four iterations in this particular case. The plots after each update are depicted in the following plots with the corresponding line found at every update:
At iteration zero, all 100 points are misclassified, but after randomly choosing one misclassified point to make the first update, the new line only misses four points:
After the second update, the line only misses one data point:
Finally, after update number three, all data points are correctly classified. This is just to show that a simple learning algorithm can successfully learn from data. Also, the perceptron model led to much more complicated models such as a neural network. We will now introduce the concept of a shallow network and its basic complexities.
Shallow networks
A neural network consists of multiple networks connected in different layers. In contrast, a perceptron has only one neuron and its architecture consists of an input layer and an output layer. In neural networks, there are additional layers between the input and output layer, as shown in Figure 1.4, and they are known as hidden layers:
The example in the figure shows a neural network that has a hidden layer with eight neurons in it. The input size is 10-dimensional, and the output layer has four dimensions (four neurons). This intermediate layer can have as many neurons as your system can handle during training, but it is usually a good idea to keep things to a reasonable number of neurons.
Neural networks can solve more difficult problems than without a network, for example, with a single neural unit such as the perceptron. This must feel intuitive and must be easy to conceive. A neural network can solve problems including and beyond those that are linearly separable. For linearly separable problems, we can use both the perceptron model and a neural network. However, for more complex and non-linearly separable problems, the perceptron cannot offer a high-quality solution, while a neural network does.
For example, if we consider the sample two-class dataset and we bring the data groups closer together, the perceptron will fail to terminate with a solution and some other strategy can be used to stop it from going forever. Or, we can switch to a neural network and train it to find the best solution it can possibly find. Figure 1.5 shows an example of training a neural network with 100 neurons in the hidden layer over a two-class dataset that is not linearly separable:
This neural network has 100 neurons in the hidden layer. This was a choice done by experimentation and you will learn strategies on how to find such instances in further chapters. However, before we go any further, there are two new terms introduced that require further explanation: non-separable data and non-linear models, which are defined as follows:
- Non-separable data is such that there is no line that can separate groups of data (or classes) into two groups.
- Non-linear models, or solutions, are those that naturally and commonly occur when the best solution to a classification problem is not a line. For example, it can be some curve described by some polynomial of any degree greater than one. For an example, see Figure 1.5.
A non-linear model is usually what we will be working with throughout this book, and the reason is that this is most likely what you will encounter out there in the real world. Also, it is non-linear, in a way, because the problem is non-separable. To achieve this non-linear solution, the neural network model goes through the following mathematical operations.
The input-to-hidden layer
In a neural network, the input vector x is connected to a number of neurons through weights w for each neuron, which can be now thought of as a number of weight vectors forming a matrix W. The matrix W has as many columns as neurons as the layer has, and as many rows as the number of features (or dimensions) x has. Thus, the output of the hidden layer can be thought of as the following vector:
Where b is a vector of biases, whose elements correspond to one neural unit, and the size of h is proportional to the number of hidden units. For example, eight neurons in Figure 1.4, and 100 neurons in Figure 1.5. However, the activation function z(.) does not have to be the sign(.) function, in fact, it usually never is. Instead, most people use functions that are easily differentiable.
The hidden-to-hidden layer
In a neural network, we could have more than one single hidden layer, and we will work with this kind a lot in this book. In such case, the matrix W can be expressed as a three-dimensional matrix that will have as many elements in the third dimension and as many hidden layers as the network has. In the case of the i-th layer, we will refer to that matrix as Wi for convenience.
Therefore, we can refer to the output of the i-th hidden layer as follows:
For i = 2, 3, ..., k-1, where k is the total number of layers, and the case of h1 is computed with the equation given for the first layer (see previous section), which uses x directly, and does not go all the way to the last layer, hk, because that is computed as discussed next.
The hidden-to-output layer
The overall output of the network is the output at the last layer:
Here, the last activation function is usually different from the hidden layer activations. The activation function in the last layer (output) traditionally depends on the type of problem we are trying to solve. For example, if we want to solve a regression problem, we would use a linear function, or sigmoid activations for classification problems. We will discuss those later on. For now, it should be evident that the perceptron algorithm will no longer work in the training phase.
While the learning still has to be in terms of the mistakes the neural network makes, the adjustments cannot be in direct proportion to the data point that is incorrectly classified or predicted. The reason is that the neurons in the last layer are responsible for making the predictions, but they depend on a previous layer, and those may depend on more previous layers, and when making adjustments to W and b, the adjustment has to be made differently for each neuron.
One approach to do this is to apply gradient descent techniques on the neural network. There are many of these techniques and we will discuss the most popular of these in further chapters. In general, a gradient descent algorithm is one that uses the notion that, if you take the derivative of a function and that reaches a value of zero, then you have found the maximum (or minimum) value you can get for the set of parameters on which you are taking the derivatives. For the case of scalars, we call them derivatives, but for vectors or matrices (W, b), we call them gradients.
The function we can use is called a loss function.
We can define a loss function, for example, as follows:
This loss is known as the mean squared error (MSE); it is meant to measure how different the target output y is from the predicted output in the output layer hk in terms of the square of its elements, and averaged. This is a good loss because it is differentiable and it is easy to compute.
A neural network such as this introduced a great number of possibilities, but relied heavily on a gradient descent technique for learning them called backpropagation (Hecht-Nielsen, R. 1992). Rather than explaining backpropagation here (we will reserve that for later), we rather have to remark that it changed the world of ML, but did not make much progress for a number of years because it had some practical limitations and the solutions to these paved the way for deep learning.
Deep networks
On March 27, 2019, an announcement was published by the ACM saying that three computer scientists were awarded the Nobel Prize in computing, that is, the ACM Turing Award, for their achievements in deep learning. Their names are Yoshua Bengio, Yann LeCun, and Geoffrey Hinton; all are very accomplished scientists. One of their major contributions was in the learning algorithm known as backpropagation.
In the official communication, the ACM wrote the following about Dr. Hinton and one of his seminal papers (Rumelhart, D. E. 1985):
Similarly, they wrote the following about Dr. LeCun's paper (LeCun, Y., et.al., 1998):
Dr. Hinton was able to show that there was a way to minimize a loss function in neural networks using biologically inspired algorithms such as the backward and forward adjustment of connections by modifying its importance for particular neurons. Usually, backpropagation is related to feed-forward neural networks, while backward-forward propagation is related to Restricted Boltzmann Machines (covered in Chapter 10, Restricted Boltzmann Machines).
A feed-forward neural network is one whose input is pipelined directly toward the output layer through intermediate layers that have no backward connections, as shown in Figure 1.4, and we will talk about these all the time in this book.
Backpropagation enabled people to train neural networks in a way that was never seen before; however, people had problems training neural networks on large datasets, and on larger (deeper) architectures. If you go ahead and look at neural network papers in the late '80s and early '90s, you will notice that architectures were small in size; networks usually had no more than two or three layers, and the number of neurons usually did not exceed the order of hundreds. These are (today) known as shallow neural networks.
The major problems were with convergence time for larger datasets, and convergence time for deeper architectures. Dr. LeCun's contributions were precisely in this area as he envisioned different ways to speed up the training process. Other advances such as vector (tensor) computations over graphics processing units (GPUs) increased training speeds dramatically.
Thus, over the last few years, we have seen the rise of deep learning, that is, the ability to train deeper neural networks, with more than three or four layers, in fact with tens and hundreds of layers. Further, we have a wide variety of architectures that can accomplish things that we were not able in the last decade.
The deep network shown in Figure 1.6 would have been impossible to train 30 years ago, and it is not that deep anyway:
Regardless of the future of DL, let us now discuss what makes DL so important today.