Parsing HTML with lxml
Another powerful, fast, and flexible parser is the HTML Parser that comes with lxml. As lxml is an extensive library written for parsing both XML and HTML documents, it can handle messed up tags in the process.
Let's start with an example.
Here, we will use the requests module to retrieve the web page and parse it with lxml:
#Importing modules from lxml import html import requests response = requests.get('http://packtpub.com/') tree = html.fromstring(response.content)
Now the whole HTML is saved to tree
in a nice tree structure that we can inspect in two different ways: XPath or CSS Select. XPath is used to navigate through elements and attributes to find information in structured documents such as HTML or XML.
We can use any of the page inspect tools, such as Firebug or Chrome developer tools, to get the XPath of an element:
If we want to get the book names and prices from the list, find the following section in the source.
<div class="book-block-title" itemprop="name">Book 1</div>
From this we can create Xpath as follows:
#Create the list of Books: books = tree.xpath('//div[@class="book-block-title"]/text()')
Then we can print the lists using the following code:
print books
Note
Learn more on lxml at http://lxml.de.
Scrapy
Scrapy is an open-source framework for web scraping and web crawling. This can be used to parse the whole website. As a framework, this helps to build spiders for specific requirements. Other than Scrapy, we can use mechanize to write scripts that can fill and submit forms.
We can utilize the command line interface of Scrapy to create the basic boilerplate for new spidering scripts. Scrapy can be installed with pip
.
To create a new spider, we have to run the following command in the terminal after installing Scrapy:
$ scrapy startproject testSpider
This will generate a project folder in the current working directory testSpider
. This will also create a basic structure and files inside the folder for our spider:
Scrapy has CLI commands to create a spider. To create a spider, we have to enter the folder generated by the startproject
command:
$ cd testSpider
Then we have to enter the generate spider command:
$ scrapy genspider pactpub pactpub.com
This will generate another folder, named spiders
, and create the required files inside that folder. Then, the folder structure will be as follows:
Now open the items.py
file and define a new item in the subclass called TestspiderItem
:
from scrapy.item import Item, Field class TestspiderItem(Item): # define the fields for your item here: book = Field()
Most of this crawling logic is given by Scrapy in the pactpub
class inside the spider
folder, so we can extend this to write our spider
. To do this, we have to edit the pactpub.py
file in the spider folder.
Inside the pactpub.py
file, first we import the required modules:
from scrapy.spiders import Spider from scrapy.selector import Selector from pprint import pprint from testSpider.items import TestspiderItem
Then, we have to extend the spider class of the Scrapy to define our pactpubSpider
class. Here we can define the domain and initial URLs for crawling:
# Extend Spider Class class PactpubSpider(Spider): name = "pactpub" allowed_domains = ["pactpub.com"] start_urls = ( 'https://www.pactpub.com/all', )
After that, we have to define the parse method, which will create an instance of TestspiderItem()
that we defined in the items.py
file, and assign this to the items variable.
Then we can add the items to extract, which can be done with XPATH or CSS style selectors.
Here, we are using XPATH selector:
# Define parse def parse(self, response): res = Selector(response) items = [] for sel in res.xpath('//div[@class="book-block"]'): item = TestspiderItem() item['book'] = sel.xpath('//div[@class="book-block-title"]/text()').extract() items.append(item) return items
Now we are ready to run the spider
. We can run it using the following command:
$ scrapy crawl pactpub --output results.json
This will start Scrapy with the URLs we defined and the crawled URLs will be passed to the testspiderItems
and a new instance is created for each item.
E-mail gathering
Using the Python modules discussed previously, we can gather e-mails and other information from the web.
To get e-mail IDs from a website, we may have to write customized scraping scripts.
Here, we discuss a common method of extracting e-mails from a web page with Python.
Let's go through an example. Here, we are using BeautifulSoup
and the requests module:
# Importing Modules from bs4 import BeautifulSoup import requests import requests.exceptions import urlparse from collections import deque import re
Next, we will provide the list of URLs to crawl:
# List of urls to be crawled urls = deque(['https://www.packtpub.com/'])
Next, we store the processed URLs in a set so as not to process them twice:
# URLs that we have already crawled scraped_urls = set()
Collected e-mails are also stored in a set:
# Crawled emails emails = set()
When we start scraping, we will take a URL from the queue and process it, and add it to the processed URLs. Also, we will do it until the queue is empty:
# Scrape urls one by one queue is empty while len(urls): # move next url from the queue to the set of Scraped urls url = urls.popleft() scrapped_urls.add(url)
With the urlparse
module we will get the base URL. This will be used to convert relative links to absolute links:
# Get base url parts = urlparse.urlsplit(url) base_url = "{0.scheme}://{0.netloc}".format(parts) path = url[:url.rfind('/')+1] if '/' in parts.path else url
The content of the URL will be available from try-catch. In case of error, it will go to the next URL:
# get url's content print("Scraping %s" % url) try: response = requests.get(url) except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError): # ignore errors continue
Inside the response, we will search for the e-mails and add the e-mails found to the e-mails set:
# Search e-mail addresses and add them into the output set new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I)) emails.update(new_emails)
After scraping the page, we will get all the links to other pages and update the URL queue:
# find and process all the anchors for anchor in soup.find_all("a"): # extract link url link = anchor.attrs["href"] if "href" in anchor.attrs else '' # resolve relative links if link.startswith('/'): link = base_url + link elif not link.startswith('http'): link = path + link # add the new url to the queue if not link in urls and not link in scraped_urls: urls.append(link)