Crawling the web
Given the nature of hyperlink pages, starting from a known place and following links to other pages is a very important tool in your arsenal when scraping the web.
To do so, we crawl a page looking for a short phrase, and we print any paragraph that contains it. We will search only in pages that belong to a single site, for example: only URLs starting with www.somesite.com
. We won't follow links to external sites.
Getting ready
This recipe builds on the concepts introduced so far, so it will involve downloading and parsing pages to search for links and then continue downloading.
When crawling the web, remember to set limits when downloading. It's very easy to crawl over too many pages. As anyone checking Wikipedia can confirm, the internet is potentially limitless.
We'll use a prepared example, available in the GitHub repo at https://github.com/PacktPublishing/Python-Automation-Cookbook-Second-Edition/tree/master/Chapter03/test_site. Download the whole site and run the included script:
$ python simple_delay_server.py
This serves the site at the URL http://localhost:8000
. You can find this in a browser. It's a simple blog with three entries.
Most of it is uninteresting, but we added a couple of paragraphs that contain the keyword python
:
Figure 3.1: A screenshot of the blog
How to do it...
- The full script,
crawling_web_step1.py
, is available on GitHub at the following link: https://github.com/PacktPublishing/Python-Automation-Cookbook-Second-Edition/blob/master/Chapter03/crawling_web_step1.py. The most relevant bits are displayed here:... def process_link(source_link, text): logging.info(f'Extracting links from {source_link}') parsed_source = urlparse(source_link) result = requests.get(source_link) # Error handling. See GitHub for details ... page = BeautifulSoup(result.text, 'html.parser') search_text(source_link, page, text) return get_links(parsed_source, page) def get_links(parsed_source, page): '''Retrieve the links on the page''' links = [] for element in page.find_all('a'): link = element.get('href') # Validate is a valid link. See GitHub for details ... links.append(link) return links
- Search for references to
python
to return a list with URLs that contain it and the paragraph. Notice there are a couple of errors because of broken links:$ python crawling_web_step1.py http://localhost:8000/ -p python Link http://localhost:8000/: --> A smaller article , that contains a reference to Python Link http://localhost:8000/files/5eabef23f63024c20389c34b94dee593-1.html: --> A smaller article , that contains a reference to Python Link http://localhost:8000/files/33714fc865e02aeda2dabb9a42a787b2-0.html: --> This is the actual bit with a python reference that we are interested in. Link http://localhost:8000/files/archive-september-2018.html: --> A smaller article , that contains a reference to Python Link http://localhost:8000/index.html: --> A smaller article , that contains a reference to Python
- Another good search term is
crocodile
. Try it out:$ python crawling_web_step1.py http://localhost:8000/ -p crocodile
How it works...
Let's see each of the components of the script:
- A loop that goes through all the found links, in the
main
function:def main(base_url, to_search): checked_links = set() to_check = [base_url] max_checks = 10 while to_check and max_checks: link = to_check.pop(0) links = process_link(link, text=to_search) checked_links.add(link) for link in links: if link not in checked_links: checked_links.add(link) to_check.append(link) max_checks -= 1
Note that there's a retrieval limit of 10 pages, and the code here is checking that any new link to be added is not added already.
Note that these two elements act as limits for the script. We won't download the same link twice and we'll stop at some point.
- Downloading and parsing the link, in the
process_link
function:def process_link(source_link, text): logging.info(f'Extracting links from {source_link}') parsed_source = urlparse(source_link) result = requests.get(source_link) if result.status_code != http.client.OK: logging.error(f'Error retrieving {source_link}: {result}') return [] if 'html' not in result.headers['Content-type']: logging.info(f'Link {source_link} is not an HTML page') return [] page = BeautifulSoup(result.text, 'html.parser') search_text(source_link, page, text) return get_links(parsed_source, page)
The code here downloads the file and checks that the status is correct to skip errors such as broken links. This code also checks that the type (as described in
Content-Type
) is an HTML page to skip PDFs and other formats. Finally, it parses the raw HTML into aBeautifulSoup
object.The code also parses the source link using
urlparse
, so later, in step 4, it can skip all the references to external sources.urlparse
divides a URL into its constituent elements:>>> from urllib.parse import urlparse >>> urlparse('http://localhost:8000/files/b93bec5d9681df87e6e8d5703ed7cd81-2.html') ParseResult(scheme='http', netloc='localhost:8000', path='/files/b93bec5d9681df87e6e8d5703ed7cd81-2.html', params='', query='', fragment='')
- The code finds the text to search, in the
search_text
function:def search_text(source_link, page, text): '''Search for an element with the searched text and print it''' for element in page.find_all(text=re.compile(text, flags=re.IGNORECASE)): print(f'Link {source_link}: --> {element}')
This searches the parsed object for the specified text. Note that the search is done as a
regex
and only in the text of the page. It prints the resulting matches, includingsource_link
, referencing the URL where the match was found:for element in page.find_all(text=re.compile(text)): print(f'Link {source_link}: --> {element}')
- The
get_links
function retrieves all links on a page:def get_links(parsed_source, page): '''Retrieve the links on the page''' links = [] for element in page.find_all('a'): link = element.get('href') if not link: continue # Avoid internal, same page links if link.startswith('#'): continue if link.startswith('mailto:'): # Ignore other links like mailto # More cases like ftp or similar may be included here continue # Always accept local links if not link.startswith('http'): netloc = parsed_source.netloc scheme = parsed_source.scheme path = urljoin(parsed_source.path, link) link = f'{scheme}://{netloc}{path}' # Only parse links in the same domain if parsed_source.netloc not in link: continue links.append(link) return links
This searches in the parsed page for all <a>
elements and retrieves the href
elements, but only elements that have such href
elements and that are a fully qualified URL (starting with http
) or a local link. This removes links that are not a URL, such as a '#
' link or links that are internal to the page.
Keep in mind that some references could have other effects, for example, the mailto:
scheme. There is a check to avoid mailto:
schemes, but there could be cases like ftp
or irc
, though they are rare to see in practice.
An extra check is done to check that the links have the same source as the original link; only then are they registered as valid links. The netloc
attribute detects whether a link comes from the same URL domain as the parsed URL generated in step 2.
We won't follow links that point to a different address (for example, an http://www.google.com one).
Finally, the links are returned, where they'll be added to the loop described in step 1.
There's more...
Further filters could be enforced; for example, all links that end in .pdf
could be discarded, as they likely refer to PDF files:
# In get_links
if link.endswith('pdf'):
continue
The use of Content-Type
can also be determined to parse the returned object in different ways. Keep in mind that Content-Type
won't be available without making the request, which means the code cannot skip links without requesting them. A PDF result (Content-Type: application/pdf
) won't have a valid response.text
object to be parsed, but a PDF result can be parsed in other ways. The same goes for other types, such as a CSV file (Content-Type: text/csv
) or a ZIP file that may need to be decompressed (Content-Type: application/zip
). We'll see how to deal with those later.
See also
- The Downloading web pages recipe, earlier in this chapter, to learn the basics of requesting web pages.
- The Parsing HTML recipe, earlier in this chapter, to learn how to parse elements in HTML.