post

Create a Simple Python Web Scraper to Get Pricing Data

Python LogoThere are many methods in Python to create a web scraper. One of the simplest is using a combination of the built-in requests library (to obtain web pages) and the Beautiful Soup library (to parse the pages and extract data). With my book – Python Business Intelligence Cookbook – being published soon, I was curious how, or if, the pricing my publisher sets changes over time. In order to track it, I created a simple web scraper. Code below…

But First An Explanation

The code is heavily commented and should explain what I did, however the most difficult part of any web scraper, aside from running it in such a way that you aren't banned for “attacking” a website, is knowing how to extract the data from the webpage. This is where the BeautifulSoup4 library comes in.

BeautifulSoup turns a webpage into a parse tree, making the different elements on the page accessible to you. Here's an example line of code that explains what that means:

price_ebook = soup.select('.book-top-pricing-main-ebook-price ')[1].get_text()

In this piece of code, we are getting the text of the second element on the page with the CSS class “.book-top-pricing-main-ebook-price ”  and assigning it to the variable “price_ebook”. Note that this CSS class does have trailing whitespace. What is essentially a typo can easily throw you off,  however including the whitespace is necessary when extracting out the data.

When I was parsing the Packt webpage, I noticed there were two elements with the exact same CSS class. Using the [1] allows me to get the second of those elements. Why does [1] equal the second? The array of elements is zero-based, meaning the first element is [0], the second [1], and so on.

For more on CSS classes I recommend the W3 Schools CSS tutorials.

And Now Here's The Code

Comments

  1. Very nice post on the power of the requests library! One thing, in my opinion, is you should wrap remove_all_whitespace, trim_the_ends, and remove_unneeded_chars all in a single function called something like clean_string(). That way you can remove a lot of duplication.

    Cool stuff!

Leave a Reply

%d bloggers like this: