Web Scraping in Python using BeautifulSoup [Tutorial with Code]

Web-scraping can help you collect information from websites in a large scale. In the following tutorial I will explain how you can get started with Web Scraping in Python using Requests and BeautifulSoup modules.

If you want to learn the basics of web scraping or you are looking for a method to scrape without coding you can check out Web Scraping without coding – easiest way to build your own Web Scraper.

Before you get started with web scraping in python, you need to install the following python libraries :
Requests
BS4 (includes BeautifulSoup)

You may also need to install Pandas library for exporting the scraped data to an excel or csv file.

Once the installations are done, you can easily follow the rest of the tutorial.

Getting started with web scraping in python using BeautifulSoup

The code can be divided into 4 parts:

  • Importing the required libraries and modules
  • Building the Web Scraper function
  • Using the Web Scraper function to scrape data from a list of URLs
  • Capturing the Scraped data in an excel

To keep things simple our python code will scrape the following details from a list of 5 URLs: Title, H1 & H2s

Importing the required libraries and modules

We would need to import Requests , BeautifulSoup & Pandas. The code for getting the imports done is short and simple :

import requests
from bs4 import BeautifulSoup
import pandas
from pandas import DataFrame

Building the Web Scraper function

To build the web scraper function, we would need to make use of Requests and BeautifulSoup.

Requests pings an URL like an user does and fetches a response. The content from the response is used by BeautifulSoup to actually scrape the required data from the web page html. The following code does it (note the comments):

#pass the url to be scraped as an input to requests.
#the response is stored in a variable
r=requests.get(url)                 
#content attribute of requests response is 
#stored and prepared for parsing by BeautifulSoup
soup=BeautifulSoup(r.content,"lxml")

BeautifulSoup then does the magic. True to it’s name – it’s beautiful! BeautifulSoup module contains a function called “find_all”. The “find_all” function literally finds all html tags that meets the specifications. The following code will “find” the title, H1 & H2s:

title=soup.find_all("title")
h1=soup.find_all("h1")
h2=soup.find_all("h2")

The find_all function returns python lists (since theoretically there can be more that one instance of each tag). Each of the instances in the list are also wrapped by the tags. To remove the tags and get the text within we would need to use the “text” attribute. However , do remember that you need to apply the attribute to each member of the list (& not the list itself). We can thus modify the above code as follows:

title=[item.text for item in soup.find_all("title")]
h1=[item.text for item in soup.find_all("h1")]
h2=[item.text for item in soup.find_all("h2")]

Now let’s consolidate all the code to a single function. We will later call this function when we want to scrape the content from a particular web page:

def web_scraper(url):
    r=requests.get(url)  
    soup=BeautifulSoup(r.content,"lxml") 
    title=[item.text for item in soup.find_all("title")]
    h1=[item.text for item in soup.find_all("h1")]
    h2=[item.text for item in soup.find_all("h2")] 
    scraped_content=[url,title,h1,h2]
    return scraped_content

In the next section we will learn to use the web_scraper function that we just created to scrape 5 URLs.

Using the Web Scraper function to scrape data from a list of URLs

We will scrape 5 random Investopedia articles. I will first store these URLs in a python list which I will name – you guessed it right – urls!

urls=["https://www.investopedia.com/hedge-funds-pile-into-health-care-4768394",
      "https://www.investopedia.com/peloton-s-risky-business-model-4768768",
      "https://www.investopedia.com/articles/exchangetradedfunds/11/advantages-disadvantages-etfs.asp",
      "https://www.investopedia.com/terms/c/comparative-market-analysis.asp",
      "https://www.investopedia.com/purchasing-a-home-4689702"]

To scrape each of these URLs, we will have to loop through them and then apply the web_scraper function that we created above.

for url in urls:
    X=web_scraper(url)
    print(x)

For now , we would just print the results. In the next section we would learn how to capture the results and store them in an excel sheet.

When we print the results , the output looks like:

Webscraping using BeautifulSoup
Web-scraping using BeautifulSoup

Capturing the Scraped data in an excel or CSV

Wow! We now know that our scraper is working. In some cases (especially H2s), we have no results but we don’t need to worry. In these cases the tags are missing from the URLs.

How do we capture this data in an excel ? Here’s where the pandas library (which we have already imported ) comes to use.

What we need to do first, is modify the above code slightly. Instead of printing the results – we would store them in a list. We then convert this list to a DataFrame using the DataFrame function in pandas.

#store the scraped data in a list
all_results=[]
for url in urls:
    X=web_scraper(url)
    all_results.append(X)
#convert the list to a DataFrame
df = DataFrame(all_results,columns=["URL","Title","H1","H2"])
df 

Now let’s take a look at the output:

Transferring  web-scraped data by BeautifulSoup to Pandas DataFrame
Transferring web-scraped data by BeautifulSoup to Pandas DataFrame

Pretty neat – Huh? The data is ready in a table – waiting to be exported. We need to write a single line of code for the export to excel:

df.to_excel("name_the_file.xlsx")

On the other hand if we want to export the data to CSV, the code would be as follows:

df.to_csv("name_the_file.csv")

Once the execution has completed, we would find an excel or CSV containing all the above data in the same folder as the one where we have the code.

Here’s a screen-shot of the excel:

Data Scraped by BeautifulSoup exported to Excel
Data Scraped by BeautifulSoup exported to Excel

That’s it folks. We have successfully created a web scraper in python using BeautifulSoup, Requests & Pandas that captures data from multiple URLs and stores them in an excel. For convenience , I am putting together the full-code below

Simple Web Scraper in python with BeautifulSoup (full code)

#Import modules & libraries 

import requests
from bs4 import BeautifulSoup
import pandas
from pandas import DataFrame

# Building the Scraper function 

def web_scraper(url):
    r=requests.get(url)  
    soup=BeautifulSoup(r.content,"lxml") 
    title=[item.text for item in soup.find_all("title")]
    h1=[item.text for item in soup.find_all("h1")]
    h2=[item.text for item in soup.find_all("h2")] 
    scraped_content=[url,title,h1,h2]
    return scraped_content

# Scraping a list of URLs and storing them in a DataFrame

urls=["https://www.investopedia.com/hedge-funds-pile-into-health-care-4768394",
      "https://www.investopedia.com/peloton-s-risky-business-model-4768768",
      "https://www.investopedia.com/articles/exchangetradedfunds/11/advantages-disadvantages-etfs.asp",
      "https://www.investopedia.com/terms/c/comparative-market-analysis.asp",
      "https://www.investopedia.com/purchasing-a-home-4689702"]
all_results=[]
for url in urls:
    X=web_scraper(url)
    all_results.append(X)
df = DataFrame(all_results,columns=["URL","Title","H1","H2"])

# export to excel 

df.to_excel("scraped_output.xlsx")

FAQs : BeautifulSoup Web Scraping

How to scrape a website using Python ?

You can scrape a website using Python by using the following modules : Requests & BeautifulSoup.

Can I scrape dynamic pages using BeautifulSoup

BeautifulSoup is used for scraping html based web-page content. You cannot scrape dynamic content using BeautifulSoup. To scrape dynamic content, you need to use browser based scrapers.

Summary
BeautifulSoup Web Scraping | Introduction to Web Scraping in Python
Article Name
BeautifulSoup Web Scraping | Introduction to Web Scraping in Python
Description
Python is one of the best language for web scraping . Learn how to build a simple web scraper in python using BeautifulSoup and requests.
Author
Publisher Name
Digital Marketing Chef
Publisher Logo

1 thought on “Web Scraping in Python using BeautifulSoup [Tutorial with Code]”

Leave a Reply

Your email address will not be published. Required fields are marked *