How to use IMPORXML to scrape web content to Google Sheets?
IMPORTXML function in Google Sheets is an easy and smart method to capture data from any web URL to a spreadsheet. Unfortunately , not many Google Sheet users know how to use this function well and end up wasting valuable time copy pasting data in a mundane way.
To be honest, IMPORTXML is really easy to use. All you need to understand is the very basics of how HTML or XML is structured. This structure is also really simple to understand.
Basics of HTML/XML
If you check the source code of any web-page (right click & choose “view-source”) you can see the HTML code used to build the page. There are a whole bunch of standard tags used to build a web-page. Some examples are as follows:
<title>Page Title goes here</title>
<body> Wraps the body content</body>
<H1> The main header of the page</H1>
<P> Paragraph content </P>
Capturing a certain bit of content from a page (also known as Web-scraping) requires you to target the tag that wraps the content. For example, if you need to capture the page title – you need to target the <title> tag. IMPORTXML function makes the job really easy & quick .
Formula: IMPORTXML function in Google Sheets
The syntax of the IMPORTXML function is as follows:
= IMPORTXML( URL, XPATH)
URL : URL of the page from where you want to capture the content.
XPATH: The tag on the page that wraps the required content.
To understand better , let’s check the following examples.
Using IMPORTXML to capture Page Titles.
Consider a simple task. You have a list of URLs & you have to scrape the page titles of each. Here’s how you do it using IMPORTXML:
The X-path (in this case “//title”) , tells Google-Sheets, that you are interested in capturing the content within the title tag.
In the below example, I will scrape the page titles of the Wiki pages of a few reputed Indian Cricketers. I have listed the page URLs in Column A. In Column B , I am using the IMPORTXML function to capture the page titles :
Using IMPORTXML to Capture H1 content
Just like the example above, you can capture the main header of any webpage by targeting the H1 tag. The formula again is really simple:
In the above sheet, I will capture the H1 values in column C. Let’s see what’s in store:
Using IMPORTXML to Scrape Span Class content to Google Sheets
Now let’s try to do something that’s a bit more difficult but way too cool. What if we wanted to capture the date of birth value for each of the cricketers – can we do it using IMPORTXML ? The answer is yes! Just that we need to take a different strategy.
Do note that in the previous example, we captured the content inside pre-defined tags. In this case we know the content that we are after – but we don’t know the tag that wraps it. We thus need to follow a few more steps to build the IMPORTXML formula:
Step 1: Locate the Date of Birth Value on the Wiki-page:
Step 2: Using “Inspect Element” find out the tag wrapping the Date of Birth value
We need to right-click on the date of birth value & then choose the “Inspect Element” option. That exposes , the source code of the content & tell’s us which tag wraps the content.
Inspect Element reveals the following source code:
The Date of Birth Value is wrapped by a <span> class. Unlike title & h1, we can have multiple spans in a web-page. To capture this particular span we need to call it out using an identifier. In this case , the identifier is the class name (“bday”).
Step 3: Pass the identified tag to IMPORTXML function
We now need to build the formula. The IMPORTXML function has to target a particular span class using the identifier.
The X-path for the same would be : “//span[@class=’bday’]”
Thus the formula works out to : IMPORTXML(URL, “//span[@class=’bday’]”)
In column D of the same spreadsheet, I am going to capture the Date of birth Values:
The formula is working ! We have successfully captured the Date of Birth Values using a simple IMPORTXML formula!
What is Web Scraping?
Web Scraping is a technique used for extracting large amounts of data from websites. You can then save and analyze the extracted data. Web Scraping has a whole bunch of uses. In fact search engines use web scraping techniques to understand the content of a website . This understanding, helps them rank pages for search queries.
If you are in the field of SEO, web scraping will help you understand your competitors on-page elements in bulk.
Is Web Scraping legal?
Web Scraping as a technique isn’t illegal. Google wouldn’t have become one of world’s top-10 company, had the core technology (web scraping) behind their top earning product been illegal. However, it’s better to avoid scraping websites who don’t appreciate being scraped. Scraping content behind log-in screens or scraping a user’s personal data is also not recommended at all.
While web scraping, it’s also important to ensure that you don’t put too much burden on a website’s server. Most sophisticated websites have methods to identify whether the traffic to their pages is from humans or robots. It they have doubts that they are being pinged by a web scraper they may just block the scraper or put up a captcha screen.
Best Web Scraping tools & codes are successful in mimicking a human user.
IMPORTXML is great for web scraping if the task is relatively simple. For more complicated Web Scraping tasks, you will need to write a code. Python is one of the most popular languages for web scraping. If you are familiar with python. you may check out the this article on web scraping using python.
FAQs : IMPORTXML in Google Sheets
You can pull data from a website to Google Sheets using the IMPORTXML function.
Web Scraping using IMPORTXML allows us to import the on-page elements of our SEO competitors in bulk to Google Sheets, allowing easy view & analysis.