There are many methods in Python to create a web scraper. One of the simplest is using a combination of the built-in requests library (to obtain web pages) and the Beautiful Soup library (to parse the pages and extract data). With my book – Python Business Intelligence Cookbook – being published soon, I was curious how, or if, the pricing my publisher sets changes over time. In order to track it, I created a simple web scraper. Code below…
But First An Explanation
The code is heavily commented and should explain what I did, however the most difficult part of any web scraper, aside from running it in such a way that you aren't banned for “attacking” a website, is knowing how to extract the data from the webpage. This is where the BeautifulSoup4 library comes in.
BeautifulSoup turns a webpage into a parse tree, making the different elements on the page accessible to you. Here's an example line of code that explains what that means:
price_ebook = soup.select('.book-top-pricing-main-ebook-price ').get_text()
In this piece of code, we are getting the text of the second element on the page with the CSS class “.book-top-pricing-main-ebook-price ” and assigning it to the variable “price_ebook”. Note that this CSS class does have trailing whitespace. What is essentially a typo can easily throw you off, however including the whitespace is necessary when extracting out the data.
When I was parsing the Packt webpage, I noticed there were two elements with the exact same CSS class. Using the  allows me to get the second of those elements. Why does  equal the second? The array of elements is zero-based, meaning the first element is , the second , and so on.
For more on CSS classes I recommend the W3 Schools CSS tutorials.