Hands-On Natural Language Processing with PyTorch 1.x
上QQ阅读APP看书,第一时间看更新

Overview of machine learning

Fundamentally, machine learning is the algorithmic process used to identify patterns and extract trends from data. By training specific machine learning algorithms on data, a machine learning model may learn insights that aren't immediately obvious to the human eye. A medical imaging model may learn to detect cancer from images of the human body, while a sentiment analysis model may learn that a book review containing the words good, excellent, and entertaining is more likely to be a positive review than one containing the words bad, terrible, and boring.

Broadly speaking, machine learning algorithms fall into two main categories: supervised learning and unsupervised learning.

Supervised learning

Supervised learning covers any task where we wish to use an input to predict an output. Let's say we wish to train a model to predict house prices. We know that larger houses tend to sell for more money, but we don't know the exact relationship between price and size. A machine learning model can learn this relationship by looking at the data:

Figure 1.1 – Table showing housing data

Here, we have been given the sizes of four houses that recently sold, as well as the prices they sold for. Given the data on these four houses, can we use this information to make a prediction about a new house on the market? A simple machine learning model known as a regression can estimate the relationship between these two factors:

Figure 1.2 – Output of the housing data

Given this historic data, we can use this data to estimate a relationship between size (X) and price (Y). Now that we have an estimation of the relationship between size and price, if we are given a new house where we just know its size, we can use this to predict its price using the learned function:

Figure 1.3 – Predicting house prices

Therefore, all supervised learning tasks aim to learn some function of the model inputs to predict an output, given many examples of how input relates to output:

Given many (X, y), learn:

F (X) = y

The input to your number can consist of any number of features. Our simple house price model consisted of just a single feature (size), but we may wish to add more features to give an even better prediction (for example, number of bedrooms, size of the garden, and so on). So, more specifically, our supervised model learns a function in order to map a number of inputs to an output. This is given by the following equation:

Given many ([X0, X1, X2,…,Xn], y), learn:

f(X0, X1, X2,…,Xn) = y

In the preceding example, the function that we learn is as follows:

Here, is the x axis intercept and is the slope of the line.

Models can consist of millions, even billions, of input features (though you may find that you run into hardware limitations when the feature space becomes too large). The types of inputs to a model may vary as well, with models being able to learn from images:

Figure 1.4 – Model training

As we shall explore in more detail later, they can also learn from text:

I loved this film -> Positive

This movie was terrible -> Negative

The best film I saw this year -> ?

Unsupervised learning

Unsupervised learning differs from supervised learning in that unsupervised learning doesn't use pairs of inputs and outputs (X, y) to learn. Instead, we only provide input data and the model will learn something about the structure or representation of the input data. One of the most common methods of unsupervised learning is clustering

For example, we take a dataset of readings of temperatures and rainfall measures from a set of four different countries but have no labels about where these readings were taken. We can use a clustering algorithm to identify the distinct clusters (countries) that exist within the data:

Figure 1.5 – Output of the clustering algorithm

Clustering also has uses within the realm of NLP. If we are given a dataset of emails and want to determine how many different languages are being spoken within these emails, a form of clustering could help us identify this. If English words appear frequently with other English words within the same email and Spanish words appear frequently with other Spanish words, we would use clustering to determine how many distinct clusters of words our dataset has and, thus, the number of languages.

How do models learn?

In order for models to learn, we need some way of evaluating how our model is performing. To do this, we use a concept called loss. Loss is a measure of how close our model predictions are from their true values. For a given house in our dataset, one measure of loss could be the difference between the true price (y) and the price predicted by our model (). We could assess the total loss within our system by taking an average of this loss across all houses in the dataset. However, the positive loss could theoretically cancel out negative loss, so a more common measure of loss is the mean squared error:

While other models may use different loss functions, regressions generally use mean squared error. Now, we can calculate a measure of loss across our entire dataset, but we still need a way of algorithmically arriving at the lowest possible loss. This process is known as gradient descent.

Gradient descent

Here, we have plotted our loss function as it relates to a single learned parameter within our house price model, . We note that when is set too high, the MSE loss is high, and when is set too low, the MSE loss is also high. The sweet spot, or the point where the loss is minimized, lies somewhere in the middle. To calculate this algorithmically, we use gradient descent. We will see this in more detail when we begin to train our own neural networks:

Figure 1.6 – Gradient descent

We first initialize with a random value. To reach the point where our loss is minimized, we need to move further downhill from our loss function, toward the middle of the valley. To do this, we first need to know which direction to move in. At our initial point, we use basic calculus to calculate the initial gradient of the slope:

In our preceding example, the gradient at the initial point is positive. This tells us that our value of is larger than the optimal value, so we update our value of so that it's lower than our previous value. We gradually iterate this process until  moves closer and closer to the value where MSE is minimized. This happens at the point where the gradient equals zero.

Overfitting and underfitting

Consider the following scenario, where a basic linear model is poorly fitted to our data. We can see that our model, denoted by the equation , does not appear to be a good predictor:

Figure 1.7 – Example of underfitting and overfitting

When our model does not fit the data well because of a lack of features, lack of data, or model underspecification, we call this underfitting. We note the increasing gradient of our data and suspect that a model, if using a polynomial, may be a better fit; for example, . We will see later that due to the complex architecture of neural networks, underfitting is rarely an issue:

Consider the following example. Here, we're fitting a function using our house price model to not only the size of the house (X), but the second and third order polynomials too (X2, X3). Here, we can see that our new model fits our data points perfectly. However, this does not necessarily result in a good model:

Figure 1.8 – Sample output of overfitting

We now have a house of size 110 sq m to predict the price of. Using our intuition, as this house is larger than the 100 sq m house, we would expect this house to be more expensive at around $340,000. Using our fitted polynomial model, we can see that the predicted price is actually lower than the smaller house at around $320,000. Our model fits the data we have trained it on well, but it does not generalize well to a new, unseen datapoint. This is known as overfitting. Because of overfitting, it is important not to evaluate a model's performance on the data it was trained on, so we need to generate a separate set of data to evaluate our data on.

Train versus test

Normally, when training models, we separate our data into two parts: a training set of data and a smaller test set of data. We train the model using the training set of data and evaluate it on the test set of data. This is done in order to measure the model's performance on an unseen set of data. As mentioned previously, for a model to be a good predictor, it must generalize well to a new set of data that the model hasn't seen before, and this is precisely what evaluating on a testing set of data measures. 

Evaluating models

While we seek to minimize loss in our models, this alone does not give us much information about how good our model is at actually making predictions. Consider an anti-spam model that predicts whether a received email is spam or not and automatically sends spam emails to a junk folder. One simple measure of evaluating performance is accuracy:

To calculate accuracy, we simply take the number of emails that were predicted correctly as spam/non-spam and pide this by the total number of predictions we made. If we correctly predicted 990 emails out of 1,000, we would have an accuracy of 99%. However, a high accuracy does not necessarily mean our model is good:

Figure 1.9 – Table showing data predicted as spam/non-spam

Here, we can see that although our model predicted 990 emails as not spam correctly (known as true negatives), it also predicted 10 emails that were spam as not spam (known as false negatives). Our model just assumes that all emails are not spam, which is not a good anti-spam filter at all! Instead of just using accuracy, we should also evaluate our model using precision and recall. In this scenario, the fact that our model would have a recall of zero (meaning no positive results were returned) would be an immediate red flag: