Python Natural Language Processing
上QQ阅读APP看书,第一时间看更新

Web scraping

To develop a web scraping tool, we can use libraries such as beautifulsoup and scrapy. Here, I'm giving some of the basic code for web scraping.

Take a look at the code snippet in Figure 2.6, which is used to develop a basic web scraper using beautifulsoup:

Figure 2.6: Basic web scraper tool using beautifulsoup

The following Figure 2.7 demonstrates the output:

Figure 2.7: Output of basic web scraper using beautifulsoup

You can find the installation guide for beautifulsoup and scrapy at this link:

https://github.com/jalajthanaki/NLPython/blob/master/ch2/Chapter_2_Installation_Commands.txt.

You can find the code at this link:

https://github.com/jalajthanaki/NLPython/blob/master/ch2/2_2_Basic_webscraping_byusing_beautifulsuop.py.

If you get any warning while running the script, it will be fine; don't worry about warnings.

Now, let's do some web scraping using scrapy. For that, we need to create a new scrapy project.

Follow the command to create the scrapy project. Execute the following command on your terminal:

  $ scrapy startproject project_name
  

I'm creating a scrapy project with the web_scraping_test name; the command is as follows:

  $ scrapy startproject web_scraping_test
  

Once you execute the preceding command, you can see the output as shown in Figure 2.8:

Figure 2.8: Output when you create a new scrapy project

After creating a project, perform the following steps:

  1. Edit your items.py file, which has been created already.
  2. Create the WebScrapingTestspider file inside the spiders directory.
  3. Go to the website page that you want to scrape, and select xpath of the element. You can read more on the xpath selector by clicking at this link:
    https://doc.scrapy.org/en/1.0/topics/selectors.html

Take a look at the code snippet in Figure 2.9. Its code is available at the GitHub URL:

https://github.com/jalajthanaki/NLPython/tree/master/web_scraping_test

Figure 2.9: The items.py file where we have defined items we need to scrape

Figure 2.10 is used to develop a basic web scraper using scrapy:

Figure 2.10: Spider file containing actual code

Figure 2.11 demonstrates the output, which is in the form of a CSV file:

Figure 2.11: Output of scraper is redirected to a CSV file

If you get any SSL-related warnings, refer to the answer at this link:

https://stackoverflow.com/questions/29134512/insecureplatformwarning-a-true-sslcontext-object-is-not-available-this-prevent

You can develop a web scraper that bypasses AJAX and scripts, but you need to be very careful when you do this because you need to keep in mind that you are not doing anything unethical. So, here, we are not going to cover the part on bypassing AJAX and scripts and scraping data. Out of curiosity, you can search on the web how people actually do this. You can use the Selenium library to do automatic clicking to perform web events.