Python Natural Language Processing
上QQ阅读APP看书,第一时间看更新

Resources for accessing free corpora

Getting the corpus is a challenging task, but in this section, I will provide you with some of the links from which you can download a free corpus and use it to build NLP applications.

The nltk library provides some inbuilt corpus. To list down all the corpus names, execute the following commands:

    import nltk.corpus
    dir(nltk.corpus) # Python shell
    print dir(nltk.corpus) # Pycharm IDE syntax
  

In Figure 2.2, you can see the output of the preceding code; the highlighted part indicates the name of the corpora that are already installed:

Figure 2.2: List of all available corpora in nltk
If you guys want to use IDE to develop an NLP application using Python, you can use the PyCharm community version. You can follow its installation steps by clicking on the following URL: https://github.com/jalajthanaki/NLPython/blob/master/ch2/Pycharm_installation_guide.md

If you want to explore more corpus resources, take a look at Big Data: 33 Brilliant and Free Data Sources for 2016, Bernard Marr (https://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-2016/#53369cd5b54d).

Until now, we have looked at a lot of basic stuff. Now let me give you an idea of how we can prepare a dataset for a natural language processing applications, which will be developed with the help of machine learning.