Various Steps in NLP
We've talked about the types of computations that are done with natural language. Apart from these basic tasks, you can also design your own tasks as per your requirements. In the coming sections, we will discuss the various preprocessing tasks in detail and demonstrate each of them with an exercise.
To perform these tasks, we will be using a Python library called NLTK (Natural Language Toolkit). NLTK is a powerful open source tool that provides a set of methods and algorithms to perform a wide range of NLP tasks, including tokenizing, parts-of-speech tagging, stemming, lemmatization, and more.
Tokenization
Tokenization refers to the procedure of splitting a sentence into its constituent parts—the words and punctuation that it is made up of. It is different from simply splitting the sentence on whitespaces, and instead actually pides the sentence into constituent words, numbers (if any), and punctuation, which may not always be separated by whitespaces. For example, consider this sentence: "I am reading a book." Here, our task is to extract words/tokens from this sentence. After passing this sentence to a tokenization program, the extracted words/tokens would be "I," "am," "reading," "a," "book," and "." – this example extracts one token at a time. Such tokens are called unigrams.
NLTK provides a method called word_tokenize(), which tokenizes given text into words. It actually separates the text into different words based on punctuation and spaces between words.
To get a better understanding of tokenization, let's solve an exercise based on it in the next section.
Exercise 1.02: Tokenization of a Simple Sentence
In this exercise, we will tokenize the words in a given sentence with the help of the NLTK library. Follow these steps to implement this exercise using the sentence, "I am reading NLP Fundamentals."
- Open a Jupyter Notebook.
- Insert a new cell and add the following code to import the necessary libraries and download the different types of NLTK data that we are going to use for different tasks in the following exercises:
from nltk import word_tokenize, download
download(['punkt','averaged_perceptron_tagger','stopwords'])
In the preceding code, we are using NLTK's download() method, which downloads the given data from NLTK. NLTK data contains different corpora and trained models. In the preceding example, we will be downloading the stop word list, 'punkt', and a perceptron tagger, which is used to implement parts of speech tagging using a structured algorithm. The data will be downloaded at nltk_data/corpora/ in the home directory of your computer. Then, it will be loaded from the same path in further steps.
- The word_tokenize() method is used to split the sentence into words/tokens. We need to add a sentence as input to the word_tokenize() method so that it performs its job. The result obtained will be a list, which we will store in a word variable. To implement this, insert a new cell and add the following code:
def get_tokens(sentence):
words = word_tokenize(sentence)
return words
- In order to view the list of tokens generated, we need to view it using the print() function. Insert a new cell and add the following code to implement this:
print(get_tokens("I am reading NLP Fundamentals."))
This code generates the following output:
['I', 'am', 'reading', 'NLP', 'Fundamentals', '.']
We can see the list of tokens generated with the help of the word_tokenize() method.
Note
To access the source code for this specific section, please refer to https://packt.live/30bGG85.
You can also run this example online at https://packt.live/30dK1mZ.
In the next section, we will see another pre-processing step: Parts-of-Speech (PoS) tagging.
PoS Tagging
In NLP, the term PoS refers to parts of speech. PoS tagging refers to the process of tagging words within sentences with their respective PoS. We extract the PoS of tokens constituting a sentence so that we can filter out the PoS that are of interest and analyze them. For example, if we look at the sentence, "The sky is blue," we get four tokens, namely "The," "sky," "is," and "blue", with the help of tokenization. Now, using a PoS tagger, we tag the PoS for each word/token. This will look as follows:
[('The', 'DT'), ('sky', 'NN'), ('is', 'VBZ'), ('blue', 'JJ')]
The preceding format is an output of the NLTK pos_tag()method. It is a list of tuples in which every tuple consists of the word followed by the PoS tag:
DT = Determiner
NN = Noun, common, singular or mass
VBZ = Verb, present tense, third-person singular
JJ = Adjective
For the complete list of PoS tags in NLTK, you can refer to https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/.
PoS tagging is performed using different techniques, one of which is a rule-based approach that builds a list to assign a possible tag for each word.
PoS tagging finds application in many NLP tasks, including word sense disambiguation, classification, Named Entity Recognition (NER), and coreference resolution. For example, consider the usage of the word "planted" in these two sentences: "He planted the evidence for the case " and " He planted five trees in the garden. " We can see that the PoS tag of "planted" would clearly help us in differentiating between the different meanings of the sentences.
Let's perform a simple exercise to understand how PoS tagging is done in Python.
Exercise 1.03: PoS Tagging
In this exercise, we will find out the PoS for each word in the sentence, I am reading NLP Fundamentals. We first make use of tokenization in order to get the tokens. Later, we will use the pos_tag() method, which will help us find the PoS for each word/token. Follow these steps to implement this exercise:
- Open a Jupyter Notebook.
- Insert a new cell and add the following code to import the necessary libraries:
from nltk import word_tokenize, pos_tag
- To find the tokens in the sentence, we make use of the word_tokenize() method. Insert a new cell and add the following code to implement this:
def get_tokens(sentence):
words = word_tokenize(sentence)
return words
- Print the tokens with the help of the print() function. To implement this, add a new cell and write the following code:
words = get_tokens("I am reading NLP Fundamentals")
print(words)
This code generates the following output:
['I', 'am', 'reading', 'NLP', 'Fundamentals']
- We'll now use the pos_tag() method. Insert a new cell and add the following code:
def get_pos(words):
return pos_tag(words)
get_pos(words)
This code generates the following output:
[('I', 'PRP'),
('am', 'VBP'),
('reading', 'VBG'),
('NLP', 'NNP'),
('Fundamentals', 'NNS')]
In the preceding output, we can see that for each token, a PoS has been allotted. Here, PRP stands for personal pronoun, VBP stands for verb present, VGB stands for verb gerund, NNP stands for proper noun singular, and NNS stands for noun plural.
Note
To access the source code for this specific section, please refer to https://packt.live/306WY24.
You can also run this example online at https://packt.live/38VLDpF.
We have learned about assigning appropriate PoS labels to tokens in a sentence. In the next section, we will learn about stop words in sentences and ways to deal with them.
Stop Word Removal
Stop words are the most frequently occurring words in any language and they are just used to support the construction of sentences and do not contribute anything to the semantics of a sentence. So, we can remove stop words from any text before an NLP process, as they occur very frequently and their presence doesn't have much impact on the sense of a sentence. Removing them will help us clean our data, making its analysis much more efficient. Examples of stop words include "a," "am," "and," "the," "in," "of," and more.
In the next exercise, we will look at the practical implementation of removing stop words from a given sentence.
Exercise 1.04: Stop Word Removal
In this exercise, we will check the list of stop words provided by the nltk library. Based on this list, we will filter out the stop words included in our text:
- Open a Jupyter Notebook.
- Insert a new cell and add the following code to import the necessary libraries:
from nltk import download
download('stopwords')
from nltk import word_tokenize
from nltk.corpus import stopwords
- In order to check the list of stop words provided for English, we pass it as a parameter to the words() function. Insert a new cell and add the following code to implement this:
stop_words = stopwords.words('english')
- In the code, the list of stop words provided by English is stored in the stop_words variable. In order to view the list, we make use of the print() function. Insert a new cell and add the following code to view the list:
print(stop_words)
This code generates the following output:
- To remove the stop words from a sentence, we first assign a string to the sentence variable and tokenize it into words using the word_tokenize() method. Insert a new cell and add the following code to implement this:
sentence = "I am learning Python. It is one of the "\
"most popular programming languages"
sentence_words = word_tokenize(sentence)
Note
The code snippet shown here uses a backslash ( \ ) to split the logic across multiple lines. When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.
- To print the list of tokens, insert a new cell and add the following code:
print(sentence_words)
This code generates the following output:
['I', 'am', 'learning', 'Python', '.', 'It', 'is', 'one', 'of', 'the', 'most', 'popular', 'programming', 'languages']
- To remove the stop words, we need to loop through each word in the sentence, check whether there are any stop words, and then finally combine them to form a complete sentence. To implement this, insert a new cell and add the following code:
def remove_stop_words(sentence_words, stop_words):
return ' '.join([word for word in sentence_words if \
word not in stop_words])
- To check whether the stop words are filtered out from our sentence, print the sentence_no_stops variable. Insert a new cell and add the following code to print:
print(remove_stop_words(sentence_words,stop_words))
This code generates the following output:
I learning Python. It one popular programming languages
As you can see in the preceding code snippet, stop words such as "am," "is," "of," "the," and "most" are being filtered out and text without stop words is produced as output.
- Add your own stop words to the stop word list:
stop_words.extend(['I','It', 'one'])
print(remove_stop_words(sentence_words,stop_words))
This code generates the following output:
learning Python . popular programming languages
As we can see from the output, now words such as "I," "It," and* "One" are removed as we have added them to our custom stop word list. We have learned how to remove stop words from given text.
Note
To access the source code for this specific section, please refer to https://packt.live/3j4KBw7.
You can also run this example online at https://packt.live/3fyYSir.
In the next section, we will focus on normalizing text.
Text Normalization
There are some words that are spelled, pronounced, and represented differently—for example, words such as Mumbai and Bombay, and US and United States. Although they are different, they refer to the same thing. There are also different forms of words that need to be converted into base forms. For example, words such as "does" and "doing," when converted to their base form, become "do." Along these lines, text normalization is a process wherein different variations of text get converted into a standard form. We need to perform text normalization as there are some words that can mean the same thing as each other. There are various ways of normalizing text, such as spelling correction, stemming, and lemmatization, which will be covered later.
For a better understanding of this topic, we will look into a practical implementation of text normalization in the next section.
Exercise 1.05: Text Normalization
In this exercise, we will normalize some given text. Basically, we will be trying to replace select words with new words, using the replace() function, and finally produce the normalized text. replace() is a built-in Python function that works on strings and takes two arguments. It will return a copy of a string in which the occurrence of the first argument will be replaced by the second argument.
Follow these steps to complete this exercise:
- Open a Jupyter Notebook.
- Insert a new cell and add the following code to assign a string to the sentence variable:
sentence = "I visited the US from the UK on 22-10-18"
- We want to replace "US" with "United States", "UK" with "United Kingdom", and "18" with "2018". To do so, use the replace() function and store the updated output in the "normalized_sentence" variable. Insert a new cell and add the following code to implement this:
def normalize(text):
return text.replace("US", "United States")\
.replace("UK", "United Kingdom")\
.replace("-18", "-2018")
- To check whether the text has been normalized, insert a new cell and add the following code to print it:
normalized_sentence = normalize(sentence)
print(normalized_sentence)
The code generates the following output:
I visited the United States from the United Kingdom on 22-10-2018
- Add the following code:
normalized_sentence = normalize('US and UK are two superpowers')
print(normalized_sentence)
The code generates following output:
United States and United Kingdom are two superpowers
In the preceding code, we can see that our text has been normalized.
Note
To access the source code for this specific section, please refer to https://packt.live/2Wm49T8.
You can also run this example online at https://packt.live/2Wm4d5k.
Over the next sections, we will explore various other ways in which text can be normalized.
Spelling Correction
Spelling correction is one of the most important tasks in any NLP project. It can be time-consuming, but without it, there are high chances of losing out on important information.
Spelling correction is executed in two steps:
- Identify the misspelled word, which can be done by a simple dictionary lookup. If there is no match found in the language dictionary, it is considered to be misspelled.
- Replace it or suggest the correctly spelled word. There are a lot of algorithms for this task. One of them is the minimum edit distance algorithm, which chooses the nearest correctly spelled word for a misspelled word. The nearness is defined by the number of edits that need to be made to the misspelled word to reach the correctly spelled word. For example, let's say there is a misspelled word, "autocorect." Now, to make it "autocorrect," we need to add one "r," and to make it "auto," we need to delete 6 characters, which means that "autocorrect" is the correct spelling because it requires the fewest edits.
We make use of the autocorrect Python library to correct spellings.
autocorrect is a Python library used to correct the spelling of misspelled words for different languages. It provides a method called spell(), which takes a word as input and returns the correct spelling of the word.
Let's look at the following exercise to get a better understanding of this.
Exercise 1.06: Spelling Correction of a Word and a Sentence
In this exercise, we will perform spelling correction on a word and a sentence, with the help of Python's autocorrect library. Follow these steps in order to complete this exercise:
- Open a Jupyter Notebook.
- Insert a new cell and add the following code to import the necessary libraries:
from nltk import word_tokenize
from autocorrect import Speller
- In order to correct the spelling of a word, pass a wrongly spelled word as a parameter to the spell() function. Before that, you have to create a spell object of the Speller class using lang='en' to signify the English language. Insert a new cell and add the following code to implement this:
spell = Speller(lang='en')
spell('Natureal')
This code generates the following output:
'Natural'
- To correct the spelling of a sentence, first tokenize it into tokens. After that, loop through each token in sentence, autocorrect the words, and finally combine the words. Insert a new cell and add the following code to implement this:
sentence = word_tokenize("Ntural Luanguage Processin deals with "\
"the art of extracting insightes from "\
"Natural Languaes")
- Use the print() function to print all tokens. Insert a new cell and add the following code to print the tokens:
print(sentence)
This code generates the following output:
['Ntural', 'Luanguage', 'Processin', 'deals', 'with', 'the', 'art', 'of', 'extracting', 'insightes', 'from', 'Natural', 'Languaes']
- Now that we have got the tokens, loop through each token in sentence, correct the tokens, and assign them to a new variable. Insert a new cell and add the following code to implement this:
def correct_spelling(tokens):
sentence_corrected = ' '.join([spell(word) \
for word in tokens])
return sentence_corrected
- To print the correct sentence, insert a new cell and add the following code:
print(correct_spelling(sentence))
This code generates the following output:
['Natural', 'Language', 'Procession', 'deals', 'with', 'the', 'art',
'of', 'extracting', 'insights', 'from', 'Natural', 'Languages']
In the preceding code snippet, we can see that most of the wrongly spelled words have been corrected. But the word "Processin" was wrongly converted into "Procession." It should have been "Processing." This happened because to change "Processin" to "Procession" or "Processing," an equal number of edits is required. To rectify this, we need to use other kinds of spelling correctors that are aware of context.
Note
To access the source code for this specific section, please refer to https://packt.live/38YVCKJ.
You can also run this example online at https://packt.live/3gVpbj4.
In the next section, we will look at stemming, which is another form of text normalization.
Stemming
In most languages, words get transformed into various forms when being used in a sentence. For example, the word "product" might get transformed into "production" when referring to the process of making something or transformed into "products" in plural form. It is necessary to convert these words into their base forms, as they carry the same meaning in any case. Stemming is the process that helps us to do so. If we look at the following figure, we get a perfect idea of how words get transformed into their base forms:
To get a better understanding of stemming, let's perform a simple exercise.
In this exercise, we will be using two algorithms, called the porter stemmer and the snowball stemmer, provided by the NLTK library. The porter stemmer is a rule-based algorithm that transforms words to their base form by removing suffixes from words. The snowball stemmer is an improvement over the porter stemmer and is a little bit faster and uses less memory. In NLTK, this is done by the stem() method provided by the PorterStemmer class.
Exercise 1.07: Using Stemming
In this exercise, we will pass a few words through the stemming process so that they get converted into their base forms. Follow these steps to implement this exercise:
- Open a Jupyter Notebook.
- Insert a new cell and add the following code to import the necessary libraries:
from nltk import stem
- Now pass the following words as parameters to the stem() method. To implement this, insert a new cell and add the following code:
def get_stems(word,stemmer):
return stemmer.stem(word)
porterStem = stem.PorterStemmer()
get_stems("production",porterStem)
- When the input is "production", the following output is generated:
'product'
- Similarly, the following code would be used for the input "coming".
get_stems("coming",porterStem)
We get the following output:
'come'
- Similarly, the following code would be used for the input "firing".
get_stems("firing",porterStem)
When the input is "firing", the following output is generated:
'fire'
- The following code would be used for the input "battling".
get_stems("battling",porterStem)
If we give the input "battling", the following output is generated:
'battl'
- The following code will also generate the same output as above, for the input "battling".
stemmer = stem.SnowballStemmer("english")
get_stems("battling",stemmer)
The output will be as follows:
'battl'
As you have seen while using the snowball stemmer, we have to provide the language as "english". We can also use the stemmer for different languages such as Spanish, French, and many more. From the preceding code snippets, we can see that the entered words are converted into their base forms.
Note
To access the source code for this specific section, please refer to https://packt.live/2DLzisD.
You can also run this example online at https://packt.live/30h147K.
In the next section, we will focus on lemmatization, which is another form of text normalization.
Lemmatization
Sometimes, the stemming process leads to incorrect results. For example, in the last exercise, the word battling was transformed to "battl", which is not a word. To overcome such problems with stemming, we make use of lemmatization. Lemmatization is the process of converting words to their base grammatical form, as in "battling" to "battle," rather than just randomly axing words. In this process, an additional check is made by looking through a dictionary to extract the base form of a word. Getting more accurate results requires some additional information; for example, PoS tags along with words will help in getting better results.
In the following exercise, we will be using WordNetLemmatizer, which is an NLTK interface of WordNet. WordNet is a freely available lexical English database that can be used to generate semantic relationships between words. NLTK's WordNetLemmatizer provides a method called lemmatize(), which returns the lemma (grammatical base form) of a given word using WordNet.
To put lemmatization into practice, let's perform an exercise where we'll use the lemmatize() function.
Exercise 1.08: Extracting the Base Word Using Lemmatization
In this exercise, we will use the lemmatization process to produce the proper form of a given word. Follow these steps to implement this exercise:
- Open a Jupyter Notebook.
- Insert a new cell and add the following code to import the necessary libraries:
from nltk import download
download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
- Create an object of the WordNetLemmatizer class. Insert a new cell and add the following code to implement this:
lemmatizer = WordNetLemmatizer()
- Bring the word to its proper form by using the lemmatize() method of the WordNetLemmatizer class. Insert a new cell and add the following code to implement this:
def get_lemma(word):
return lemmatizer.lemmatize(word)
get_lemma('products')
With the input products, the following output is generated:
'product'
- Similarly, use the input as production now:
get_lemma('production')
With the input production, the following output is generated:
'production'
- Similarly, use the input as coming now:
get_lemma('coming')
With the input coming, the following output is generated:
'coming'
Hence, we have learned how to use the lemmatization process to transform a given word into its base form.
Note
To access the source code for this specific section, please refer to https://packt.live/3903ETS.
You can also run this example online at https://packt.live/2Wlqu33.
In the next section, we will look at another preprocessing step in NLP: named entity recognition (NER).
Named Entity Recognition (NER)
NER is the process of extracting important entities, such as person names, place names, and organization names, from some given text. These are usually not present in dictionaries. So, we need to treat them differently. The main objective of this process is to identify the named entities (such as proper nouns) and map them to categories, which are already defined. For example, categories might include names of people, places, and so on.
NER has found use in many NLP tasks, including assigning tags to news articles, search algorithms, and more. NER can analyze a news article and extract the major people, organizations, and places discussed in it and assign them as tags for new articles.
In the case of search algorithms, let's suppose we have to create a search engine, meant specifically for books. If we were to submit a given query for all the words, the search would take a lot of time. Instead, if we extract the top entities from all the books using NER and run a search query on the entities rather than all the content, the speed of the system would increase dramatically.
To get a better understanding of this process, we'll perform an exercise. Before moving on to the exercise, let me introduce you to chunking, which we are going to use in the following exercise. Chunking is the process of grouping words together into chunks, which can be further used to find noun groups and verb groups, or can also be used for sentence partitioning.
Exercise 1.09: Treating Named Entities
In this exercise, we will find the named entities in a given sentence. Follow these steps to implement this exercise using the following sentence:
"We are reading a book published by Packt which is based out of Birmingham."
- Open a Jupyter Notebook.
- Insert a new cell and add the following code to import the necessary libraries:
from nltk import download
from nltk import pos_tag
from nltk import ne_chunk
from nltk import word_tokenize
download('maxent_ne_chunker')
download('words')
- Declare the sentence variable and assign it a string. Insert a new cell and add the following code to implement this:
sentence = "We are reading a book published by Packt "\
"which is based out of Birmingham."
- To find the named entities from the preceding text, insert a new cell and add the following code:
def get_ner(text):
i = ne_chunk(pos_tag(word_tokenize(text)), binary=True)
return [a for a in i if len(a)==1]
get_ner(sentence)
This code generates the following output:
[Tree('NE', [('Packt', 'NNP')]), Tree('NE', [('Birmingham', 'NNP')])]
In the preceding code, we can see that the code identifies the named entities "Packt" and "Birmingham" and maps them to an already-defined category, "NNP."
Note
To access the source code for this specific section, please refer to https://packt.live/3ezeukC.
You can also run this example online at https://packt.live/32rsOJs.
In the next section, we will focus on word sense disambiguation, which helps us to identify the right sense of any word.