Web-scraping can help you collect information from websites in a large scale. In the following tutorial I will explain how you can get started with Web Scraping in Python using Requests and BeautifulSoup modules.
If you want to learn the basics of web scraping or you are looking for a method to scrape without coding you can check out Web Scraping without coding – easiest way to build your own Web Scraper.
Before you get started with web scraping in python, you need to install the following python libraries :
– Requests
– BS4 (includes BeautifulSoup)
You may also need to install Pandas library for exporting the scraped data to an excel or csv file.
Once the installations are done, you can easily follow the rest of the tutorial.
Getting started with web scraping in python using BeautifulSoup
The code can be divided into 4 parts:
- Importing the required libraries and modules
- Building the Web Scraper function
- Using the Web Scraper function to scrape data from a list of URLs
- Capturing the Scraped data in an excel
To keep things simple our python code will scrape the following details from a list of 5 URLs: Title, H1 & H2s
Importing the required libraries and modules
We would need to import Requests , BeautifulSoup & Pandas. The code for getting the imports done is short and simple :
import requests
from bs4 import BeautifulSoup
import pandas
from pandas import DataFrame
Building the Web Scraper function
To build the web scraper function, we would need to make use of Requests and BeautifulSoup.
Requests pings an URL like an user does and fetches a response. The content from the response is used by BeautifulSoup to actually scrape the required data from the web page html. The following code does it (note the comments):
#pass the url to be scraped as an input to requests.
#the response is stored in a variable
r=requests.get(url)
#content attribute of requests response is
#stored and prepared for parsing by BeautifulSoup
soup=BeautifulSoup(r.content,"lxml")
BeautifulSoup then does the magic. True to it’s name – it’s beautiful! BeautifulSoup module contains a function called “find_all”. The “find_all” function literally finds all html tags that meets the specifications. The following code will “find” the title, H1 & H2s:
title=soup.find_all("title")
h1=soup.find_all("h1")
h2=soup.find_all("h2")
The find_all function returns python lists (since theoretically there can be more that one instance of each tag). Each of the instances in the list are also wrapped by the tags. To remove the tags and get the text within we would need to use the “text” attribute. However , do remember that you need to apply the attribute to each member of the list (& not the list itself). We can thus modify the above code as follows:
title=[item.text for item in soup.find_all("title")]
h1=[item.text for item in soup.find_all("h1")]
h2=[item.text for item in soup.find_all("h2")]
Now let’s consolidate all the code to a single function. We will later call this function when we want to scrape the content from a particular web page:
def web_scraper(url):
r=requests.get(url)
soup=BeautifulSoup(r.content,"lxml")
title=[item.text for item in soup.find_all("title")]
h1=[item.text for item in soup.find_all("h1")]
h2=[item.text for item in soup.find_all("h2")]
scraped_content=[url,title,h1,h2]
return scraped_content
In the next section we will learn to use the web_scraper function that we just created to scrape 5 URLs.
Using the Web Scraper function to scrape data from a list of URLs
We will scrape 5 random Investopedia articles. I will first store these URLs in a python list which I will name – you guessed it right – urls!
urls=["https://www.investopedia.com/hedge-funds-pile-into-health-care-4768394",
"https://www.investopedia.com/peloton-s-risky-business-model-4768768",
"https://www.investopedia.com/articles/exchangetradedfunds/11/advantages-disadvantages-etfs.asp",
"https://www.investopedia.com/terms/c/comparative-market-analysis.asp",
"https://www.investopedia.com/purchasing-a-home-4689702"]
To scrape each of these URLs, we will have to loop through them and then apply the web_scraper function that we created above.
for url in urls:
X=web_scraper(url)
print(x)
For now , we would just print the results. In the next section we would learn how to capture the results and store them in an excel sheet.
When we print the results , the output looks like:

Capturing the Scraped data in an excel or CSV
Wow! We now know that our scraper is working. In some cases (especially H2s), we have no results but we don’t need to worry. In these cases the tags are missing from the URLs.
How do we capture this data in an excel ? Here’s where the pandas library (which we have already imported ) comes to use.
What we need to do first, is modify the above code slightly. Instead of printing the results – we would store them in a list. We then convert this list to a DataFrame using the DataFrame function in pandas.
#store the scraped data in a list
all_results=[]
for url in urls:
X=web_scraper(url)
all_results.append(X)
#convert the list to a DataFrame
df = DataFrame(all_results,columns=["URL","Title","H1","H2"])
df
Now let’s take a look at the output:

Pretty neat – Huh? The data is ready in a table – waiting to be exported. We need to write a single line of code for the export to excel:
df.to_excel("name_the_file.xlsx")
On the other hand if we want to export the data to CSV, the code would be as follows:
df.to_csv("name_the_file.csv")
Once the execution has completed, we would find an excel or CSV containing all the above data in the same folder as the one where we have the code.
Here’s a screen-shot of the excel:

That’s it folks. We have successfully created a web scraper in python using BeautifulSoup, Requests & Pandas that captures data from multiple URLs and stores them in an excel. For convenience , I am putting together the full-code below
Simple Web Scraper in python with BeautifulSoup (full code)
#Import modules & libraries
import requests
from bs4 import BeautifulSoup
import pandas
from pandas import DataFrame
# Building the Scraper function
def web_scraper(url):
r=requests.get(url)
soup=BeautifulSoup(r.content,"lxml")
title=[item.text for item in soup.find_all("title")]
h1=[item.text for item in soup.find_all("h1")]
h2=[item.text for item in soup.find_all("h2")]
scraped_content=[url,title,h1,h2]
return scraped_content
# Scraping a list of URLs and storing them in a DataFrame
urls=["https://www.investopedia.com/hedge-funds-pile-into-health-care-4768394",
"https://www.investopedia.com/peloton-s-risky-business-model-4768768",
"https://www.investopedia.com/articles/exchangetradedfunds/11/advantages-disadvantages-etfs.asp",
"https://www.investopedia.com/terms/c/comparative-market-analysis.asp",
"https://www.investopedia.com/purchasing-a-home-4689702"]
all_results=[]
for url in urls:
X=web_scraper(url)
all_results.append(X)
df = DataFrame(all_results,columns=["URL","Title","H1","H2"])
# export to excel
df.to_excel("scraped_output.xlsx")
FAQs : BeautifulSoup Web Scraping
You can scrape a website using Python by using the following modules : Requests & BeautifulSoup.
BeautifulSoup is used for scraping html based web-page content. You cannot scrape dynamic content using BeautifulSoup. To scrape dynamic content, you need to use browser based scrapers.


When some one searches for his vital thing, thus he/she wishes to
be available that in detail, so that thing is maintained over here.