Building a Webscraper with BeautifulSoup

In this article, I walk you through the building a program takes online book data on books.toscrape.com and stores it in a local SQLite database. You can see all the code for the completed project here.

Introduction

Data exists somewhere, but we want it somewhere else. This is the fundamental problem posed in data engineering; it’s a problem that occurs again and again when dealing with data. In some cases (like ours) that data currently exists on a website and we want to organize it in a tabular way and store it locally (or on the cloud) so that it can easily be retrieved later.

Websites are made of code. When you visit a webpage, you’re requesting access to that code. Our browsers translate this code into interactive visual experiences most of us understand to be websites. If you want to really see what is going on behind the curtain all you need to do is right-click somewhere on a web page and click “Inspect” or “View page source” and you will instantly be greeted by nested trees of information containing text, links, images, and tables.

For this project, I used books.toscrape.com as the data source. This website is designed to be scraped which makes it the perfect place to practice.

Implementation

One way to get a page’s source code in python is to use the requests library.

import requests

page_url = "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
page = requests.get(base_page_url)

We want to check if the website actually received our request. If the website admins prevent our IP address from accessing content or if we mistype the url we’re trying to access we will receive a different status code. You can read more about status codes here.

if page.status_code == 200: #200 means everything went okay!
    pass
else:
    raise Exception("Invalid Status Response Code")

Unfortunately requests leaves us with an unwieldy mess. Having the source code for an entire web page is not extremely helpful by itself. Thankfully, we can use a different library called BeautifulSoup to parse (or sift) through all the HTML and extract the useful nuggets of information we want.

soup = BeautifulSoup(page.text, 'html.parser')

Note the header (title) is <h1>, the price 51.77 is “price_color”, and the availability (22) is “instock_availability”

Now that we have our “soup” we can search for specific tags and then iterate over them. It sometimes takes a little extra wrangling with common functions like replace (replaces certain characters with other characters) and strip (removes white space from either side).

for book_tree in soup.find_all(class_='col-sm-6 product_main'):
    title = book_tree.find("h1").contents[0]
    price = book_tree.find(class_="price_color").contents[0]
    price = price.replace("Â", "")
    stock_status = book_tree.find(class_="instock availability")
    stock_status = stock_status.contents[2].replace("\n", "")
    stock_status = stock_status.lstrip().rstrip()

Once we have this data we want to first store it in a dictionary (in memory) for clarity, flexibility, and ease of use. We include all the relevant info we want to store, like the url page or when we scraped the page (web pages are sometimes updated!).

import datetime

return {"url": url,
        "title": title,
        "price": monetary(price), #monetary turns str to float
        "stock_status": stock_status,
        "date_scraped": datetime.now().strftime("%m/%d/%y")
        }

Finally we want to store that dictionary in a more permanent place. One option is to use pandas and store the data directly in a csv (comma seperated values). However the csv format is not native to data storage and can cause issues when trying to read and write at the same time, or when trying to insert.

Common industrial databases like MySQL or PostgreSQL are designed to handle online requests from multiple users simultaneously, so setting them up is a chore and should only be done if you actually need that functionality.

My choice is to use SQLite, a lightweight local relational database that’s included with python. We first need to make a table, and if the table already exists make sure our code can handle that scenario.

conn = sqlite3.connect(db_path)

# cursor object
cur = conn.cursor()

# Creating table
table = """
        CREATE TABLE Books (
        URL TEXT NOT NULL,
        Title TEXT NOT NULL,
        Price£ FLOAT NOT NULL,
        Stock_Status TEXT NOT NULL,
        Date_Scraped TEXT NOT NULL
        );
        """

try:
# if the table does not exist yet, below line with throw error
    cur.execute(table)
    if verbose==True: 
        print("Table created.")
except sqlite3.OperationalError as e:
    #if the table already exists
    if verbose==True:
        print(e)

    # Close the database connection
    conn.close()

More complicated databases should have primary and foreign keys to reduce redundancy, but here the data is so simple there’s no reason to implement a more complicated structure.

Now that we have a database and a table, we can insert data into it like so:

conn = sqlite3.connect(db_path)
sql = ''' 
      INSERT INTO Books(URL,Title,
              Price£,Stock_Status, Date_Scraped)
              VALUES(?,?,?,?,?) 
      '''
cur = conn.cursor()
data_tuple = (input_dict["url"], 
              input_dict["title"],
              input_dict["price"], 
              input_dict["stock_status"],  
              input_dict["date_scraped"])                               
cur.execute(sql, data_tuple)
conn.commit()
  
    
# Close the database connection
conn.close(

If you can do one book….you can scrape a whole lot. Using a similar methodology to single book information: you grab links to all the books on the browsing page.

base_page_url= "http://books.toscrape.com/catalogue/page-"+str(page)+".html"

url_list = []
page = requests.get(base_page_url)

soup = BeautifulSoup(page.text, 'html.parser')
   for book_tree in soup.find_all(class_='image_container'):
       for link in book_tree.find_all('a'):
           url = "http://books.toscrape.com/catalogue/" + str(link.get('href'))
           url_list.append(url)

And all that’s left is to go through each url and scrape that particular book! It’s a pretty simple implementation using a for-loop. One more detail though is to add a pause between requests. We don’t want to spam a website with requests and look like a denial of service attack.

 for url in url_list:
     add_book_to_database(db_path=db_path,       input_dict=scrape_single_book_page(url)) #scrape each url
     time.sleep(seconds_to_wait_between_scrapes) #wait a little

Done!

Conclusion

Now that wasn’t too hard, was it? Data scraping is an essential tool in any data scientist or engineer’s arsinel. Note that I didn’t include any code to prevent duplicate entry rows. For example, if you only wanted to scrape a website once a day you could check no row already exists today at that url. However, for a real world example there might be reasons to want to scrape the same data multiple times per day which is why I’m leaving it as-is.

You can find the full repo here. Thanks for reading!

Introduction

Implementation

Conclusion

Share this:

Related

Leave a comment Cancel reply