Resources for accessing free corpora
Getting the corpus is a challenging task, but in this section, I will provide you with some of the links from which you can download a free corpus and use it to build NLP applications.
The nltk library provides some inbuilt corpus. To list down all the corpus names, execute the following commands:
import nltk.corpus dir(nltk.corpus) # Python shell print dir(nltk.corpus) # Pycharm IDE syntax
In Figure 2.2, you can see the output of the preceding code; the highlighted part indicates the name of the corpora that are already installed:
If you want to explore more corpus resources, take a look at Big Data: 33 Brilliant and Free Data Sources for 2016, Bernard Marr (https://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-2016/#53369cd5b54d).
Until now, we have looked at a lot of basic stuff. Now let me give you an idea of how we can prepare a dataset for a natural language processing applications, which will be developed with the help of machine learning.