上QQ阅读APP看书，第一时间看更新

Crawling the web

Given the nature of hyperlink pages, starting from a known place and following links to other pages is a very important tool in your arsenal when scraping the web.

To do so, we crawl a page looking for a short phrase, and we print any paragraph that contains it. We will search only in pages that belong to a single site, for example: only URLs starting with www.somesite.com. We won't follow links to external sites.

Getting ready

This recipe builds on the concepts introduced so far, so it will involve downloading and parsing pages to search for links and then continue downloading.

When crawling the web, remember to set limits when downloading. It's very easy to crawl over too many pages. As anyone checking Wikipedia can confirm, the internet is potentially limitless.

We'll use a prepared example, available in the GitHub repo at https://github.com/PacktPublishing/Python-Automation-Cookbook-Second-Edition/tree/master/Chapter03/test_site. Download the whole site and run the included script:

$ python simple_delay_server.py

This serves the site at the URL http://localhost:8000. You can find this in a browser. It's a simple blog with three entries.

Most of it is uninteresting, but we added a couple of paragraphs that contain the keyword python:

Figure 3.1: A screenshot of the blog

How to do it...

The full script, crawling_web_step1.py, is available on GitHub at the following link: https://github.com/PacktPublishing/Python-Automation-Cookbook-Second-Edition/blob/master/Chapter03/crawling_web_step1.py. The most relevant bits are displayed here:

...
def process_link(source_link, text):
    logging.info(f'Extracting links from {source_link}')
    parsed_source = urlparse(source_link)
    result = requests.get(source_link)
    # Error handling. See GitHub for details
    ...
    page = BeautifulSoup(result.text, 'html.parser')
    search_text(source_link, page, text)
    return get_links(parsed_source, page)
def get_links(parsed_source, page):
    '''Retrieve the links on the page'''
    links = []
    for element in page.find_all('a'):
        link = element.get('href')
        # Validate is a valid link. See GitHub for details
        ...
        links.append(link)
    return links

Search for references to python to return a list with URLs that contain it and the paragraph. Notice there are a couple of errors because of broken links:

$ python crawling_web_step1.py http://localhost:8000/ -p python
Link http://localhost:8000/: --> A smaller article , that contains a reference to Python
Link http://localhost:8000/files/5eabef23f63024c20389c34b94dee593-1.html: --> A smaller article , that contains a reference to Python
Link http://localhost:8000/files/33714fc865e02aeda2dabb9a42a787b2-0.html: --> This is the actual bit with a python reference that we are interested in.
Link http://localhost:8000/files/archive-september-2018.html: --> A smaller article , that contains a reference to Python
Link http://localhost:8000/index.html: --> A smaller article , that contains a reference to Python

Another good search term is crocodile. Try it out:

$ python crawling_web_step1.py http://localhost:8000/ -p crocodile

How it works...

Let's see each of the components of the script:

A loop that goes through all the found links, in the main function:

def main(base_url, to_search):
    checked_links = set()
    to_check = [base_url]
    max_checks = 10
    while to_check and max_checks:
        link = to_check.pop(0)
        links = process_link(link, text=to_search)
        checked_links.add(link)
        for link in links:
            if link not in checked_links:
                checked_links.add(link)
                to_check.append(link)
        max_checks -= 1

Note that there's a retrieval limit of 10 pages, and the code here is checking that any new link to be added is not added already.

Note that these two elements act as limits for the script. We won't download the same link twice and we'll stop at some point.

Downloading and parsing the link, in the process_link function:

def process_link(source_link, text):
    logging.info(f'Extracting links from {source_link}')
    parsed_source = urlparse(source_link)
    result = requests.get(source_link)
    if result.status_code != http.client.OK:
        logging.error(f'Error retrieving {source_link}: {result}')
        return []
    if 'html' not in result.headers['Content-type']:
        logging.info(f'Link {source_link} is not an HTML page')
        return []
    page = BeautifulSoup(result.text, 'html.parser')
    search_text(source_link, page, text)
    return get_links(parsed_source, page)

The code here downloads the file and checks that the status is correct to skip errors such as broken links. This code also checks that the type (as described in Content-Type) is an HTML page to skip PDFs and other formats. Finally, it parses the raw HTML into a BeautifulSoup object.

The code also parses the source link using urlparse, so later, in step 4, it can skip all the references to external sources. urlparse divides a URL into its constituent elements:

>>> from urllib.parse import urlparse
>>> urlparse('http://localhost:8000/files/b93bec5d9681df87e6e8d5703ed7cd81-2.html')
ParseResult(scheme='http', netloc='localhost:8000', path='/files/b93bec5d9681df87e6e8d5703ed7cd81-2.html', params='', query='', fragment='')

The code finds the text to search, in the search_text function:

def search_text(source_link, page, text):
    '''Search for an element with the searched text and print it'''
    for element in page.find_all(text=re.compile(text, flags=re.IGNORECASE)):
        print(f'Link {source_link}: --> {element}')

This searches the parsed object for the specified text. Note that the search is done as a regex and only in the text of the page. It prints the resulting matches, including source_link, referencing the URL where the match was found:

for element in page.find_all(text=re.compile(text)):
    print(f'Link {source_link}: --> {element}')

The get_links function retrieves all links on a page:

def get_links(parsed_source, page):
    '''Retrieve the links on the page'''
    links = []
    for element in page.find_all('a'):
        link = element.get('href')
        if not link:
            continue
        # Avoid internal, same page links
        if link.startswith('#'):
            continue
        if link.startswith('mailto:'):
            # Ignore other links like mailto
            # More cases like ftp or similar may be included here
            continue
        # Always accept local links
        if not link.startswith('http'):
            netloc = parsed_source.netloc
            scheme = parsed_source.scheme
            path = urljoin(parsed_source.path, link)
            link = f'{scheme}://{netloc}{path}'
        # Only parse links in the same domain
        if parsed_source.netloc not in link:
            continue
        links.append(link)
    return links

This searches in the parsed page for all <a> elements and retrieves the href elements, but only elements that have such href elements and that are a fully qualified URL (starting with http) or a local link. This removes links that are not a URL, such as a '#' link or links that are internal to the page.

Keep in mind that some references could have other effects, for example, the mailto: scheme. There is a check to avoid mailto: schemes, but there could be cases like ftp or irc, though they are rare to see in practice.

An extra check is done to check that the links have the same source as the original link; only then are they registered as valid links. The netloc attribute detects whether a link comes from the same URL domain as the parsed URL generated in step 2.

We won't follow links that point to a different address (for example, an http://www.google.com one).

Finally, the links are returned, where they'll be added to the loop described in step 1.

There's more...

Further filters could be enforced; for example, all links that end in .pdf could be discarded, as they likely refer to PDF files:

# In get_links
if link.endswith('pdf'):
  continue

The use of Content-Type can also be determined to parse the returned object in different ways. Keep in mind that Content-Type won't be available without making the request, which means the code cannot skip links without requesting them. A PDF result (Content-Type: application/pdf) won't have a valid response.text object to be parsed, but a PDF result can be parsed in other ways. The same goes for other types, such as a CSV file (Content-Type: text/csv) or a ZIP file that may need to be decompressed (Content-Type: application/zip). We'll see how to deal with those later.

Crawling the web

Getting ready

How to do it...

How it works...

There's more...

See also