Hands-On Natural Language Processing with PyTorch 1.x
上QQ阅读APP看书,第一时间看更新

NLP for PyTorch

Now that we have learned how to build neural networks, we will see how it is possible to build models for NLP using PyTorch. In this example, we will create a basic bag-of-words classifier in order to classify the language of a given sentence.

Setting up the classifier

For this example, we'll take a selection of sentences in Spanish and English:

  1. First, we split each sentence into a list of words and take the language of each sentence as a label. We take a section of sentences to train our model on and keep a small section to one side as our test set. We do this so that we can evaluate the performance of our model after it has been trained:

    ("This is my favourite chapter".lower().split(),\

    "English"),

    ("Estoy en la biblioteca".lower().split(), "Spanish")

    Note that we also transform each word into lowercase, which stops words being double counted in our bag-of-words. If we have the word book and the word Book, we want these to be counted as the same word, so we transform these into lowercase.

  2. Next, we build our word index, which is simply a dictionary of all the words in our corpus, and then create a unique index value for each word. This can be easily done with a short for loop:

    word_dict = {}

    i = 0

    for words, language in training_data + test_data:

        for word in words:

            if word not in word_dict:

                word_dict[word] = i

                i += 1

    print(word_dict)

    This results in the following output:

    Figure 2.18 – Setting up the classifier

    Note that here, we looped through all our training data and test data. If we just created our word index on training data, when it came to evaluating our test set, we would have new words that were not seen in the original training, so we wouldn't be able to create a true bag-of-words representation for these words.

  3. Now, we build our classifier in a similar fashion to how we built our neural network in the previous section; that is, by building a new class that inherits from nn.Module.

    Here, we define our classifier so that it consists of a single linear layer with log softmax activation functions approximating a logistic regression. We could easily extend this to operate as a neural network by adding extra linear layers here, but a single layer of parameters will serve our purpose. Pay close attention to the input and output sizes of our linear layer:

    corpus_size = len(word_dict)

    languages = 2

    label_index = {"Spanish": 0, "English": 1}

    class BagofWordsClassifier(nn.Module):  

        def __init__(self, languages, corpus_size):

            super(BagofWordsClassifier, self).__init__()

            self.linear = nn.Linear(corpus_size, languages)

        def forward(self, bow_vec):

            return F.log_softmax(self.linear(bow_vec), dim=1)

    The input is of length corpus_size, which is just the total count of unique words in our corpus. This is because each input to our model will be a bag-of-words representation, consisting of the counts of words in each sentence, with a count of 0 if a given word does not appear in our sentence. Our output is of size 2, which is our number of languages to predict. Our final predictions will consist of a probability that our sentence is English versus the probability that our sentence is Spanish, with our final prediction being the one with the highest probability.

  4. Next, we define some utility functions. We first define make_bow_vector, which takes the sentence and transforms it into a bag-of-words representation. We first create a vector consisting of all zeros. We then loop through them and for each word in the sentence, we increment the count of that index within the bag-of-words vector by one. We finally reshape this vector using with .view() for entry into our classifier:

    def make_bow_vector(sentence, word_index):

        word_vec = torch.zeros(corpus_size)

        for word in sentence:

            word_vec[word_dict[word]] += 1

        return word_vec.view(1, -1)

  5. Similarly, we define make_target, which simply takes the label of the sentence (Spanish or English) and returns its relevant index (0 or 1):

    def make_target(label, label_index):

        return torch.LongTensor([label_index[label]])

  6. We can now create an instance of our model, ready for training. We also define our loss function as Negative Log Likelihood as we are using a log softmax function, and then define our optimizer in order to use standard stochastic gradient descent (SGD):

    model = BagofWordsClassifier(languages, corpus_size)

    loss_function = nn.NLLLoss()

    optimizer = optim.SGD(model.parameters(), lr=0.1)

Now, we are ready to train our model.

Training the classifier

First, we set up a loop consisting of the number of epochs we wish our model to run for. In this instance, we will select 100 epochs.

Within this loop, we first zero our gradients (as otherwise, PyTorch calculates gradients cumulatively) and then for each sentence/label pair, we transform each into a bag-of-words vector and target, respectively. We then calculate the predicted output of this particular sentence pair by making a forward pass of our data through the current state of our model.

Using this prediction, we then take our predicted and actual labels and call our defined loss_function on the two to obtain a measure of loss for this sentence. By calling backward(), we then backpropagate this loss through our model and by calling step() on our optimizer, we update our model parameters. Finally, we print our loss after every 10 training steps:

for epoch in range(100):

    for sentence, label in training_data:

        model.zero_grad()

        bow_vec = make_bow_vector(sentence, word_dict)

        target = make_target(label, label_index)

        log_probs = model(bow_vec)

        loss = loss_function(log_probs, target)

        loss.backward()

        optimizer.step()

        

    if epoch % 10 == 0:

        print('Epoch: ',str(epoch+1),', Loss: ' +                         str(loss.item()))

This results in the following output:

Figure 2.19 – Training loss

Here, we can see that our loss is decreasing over time as our model learns. Although our training set in this example is very small, we can still demonstrate that our model has learned something useful, as follows:

  1. We evaluate our model on a couple of sentences from our test data that our model was not trained on. Here, we first set torch.no_grad(), which deactivates the autograd engine as there is no longer any need to calculate gradients as we are no longer training our model. Next, we take our test sentence and transform it into a bag-of-words vector and feed it into our model to obtain predictions.
  2. We then simply print the sentence, the true label of the sentence, and then the predicted probabilities. Note that we transform the predicted values from log probabilities back into probabilities. We obtain two probabilities for each prediction, but if we refer back to the label index, we can see that the first probability (index 0) corresponds to Spanish, whereas the other one corresponds to English:

    def make_predictions(data):

        with torch.no_grad():

            sentence = data[0]

            label = data[1]

            bow_vec = make_bow_vector(sentence, word_dict)

            log_probs = model(bow_vec)

            print(sentence)

            print(label + ':')

            print(np.exp(log_probs))

            

    make_predictions(test_data[0])

    make_predictions(test_data[1])

    This results in the following output:

    Figure 2.20 – Predicted output

    Here, we can see that for both our predictions, our model predicts the correct answer, but why is this? What exactly has our model learned? We can see that our first test sentence contains the word estoy, which was previously seen in a Spanish sentence within our training set. Similarly, we can see that the word book was seen within our training set in an English sentence. Since our model consists of a single layer, the parameters on each of our nodes are easy to interpret.

  3. Here, we define a function that takes a word as input and returns the weights on each of the parameters within the layer. For a given word, we get the index of this word from our dictionary and then select these parameters from the same index within the model. Note that our model returns two parameters as we are making two predictions; that is, the model's contribution to the Spanish prediction and the model's contribution to the English prediction:

    def return_params(word):

        index = word_dict[word]

        for p in model.parameters():

            dims = len(p.size())

            if dims == 2:

                print(word + ':')

                print('Spanish Parameter = ' +                    str(p[0][index].item()))

                print('English Parameter = ' +                    str(p[1][index].item()))

                print('\n')

                

    return_params('estoy')

    return_params('book')

    This results in the following output:

Figure 2.21 – Predicted output for the updated function

Here, we can see that for the word estoy, this parameter is positive for the Spanish prediction and negative for the English one. This means that for each count of the word "estoy" in our sentence, the sentence becomes more likely to be a Spanish sentence. Similarly, for the word book, we can see that this contributes positively to the prediction that the sentence is English.

We can show that our model has only learned based on what it has been trained on. If we try to predict a word the model hasn't been trained on, we can see it is unable to make an accurate decision. In this case, our model thinks that the English word "not" is Spanish:

new_sentence = (["not"],"English")

make_predictions(new_sentence)

This results in the following output:

Figure 2.22 – Final output