IMPORTXML in Google Sheets – easiest way to build your own Web Scraper

How to use IMPORXML to scrape web content to Google Sheets?

IMPORTXML function in Google Sheets is an easy and smart method to capture data from any web URL to a spreadsheet. Unfortunately , not many Google Sheet users know how to use this function well and end up wasting valuable time copy pasting data in a mundane way.

To be honest, IMPORTXML is really easy to use. All you need to understand is the very basics of how HTML or XML is structured. This structure is also really simple to understand.

Basics of HTML/XML

If you check the source code of any web-page (right click & choose “view-source”) you can see the HTML code used to build the page. There are a whole bunch of standard tags used to build a web-page. Some examples are as follows:

<title>Page Title goes here</title>
<body> Wraps the body content</body>
<H1> The main header of the page</H1>
<H2> Sub-headings</H1>
<P> Paragraph content </P>

Capturing a certain bit of content from a page (also known as Web-scraping) requires you to target the tag that wraps the content. For example, if you need to capture the page title – you need to target the <title> tag. IMPORTXML function makes the job really easy & quick .

Formula: IMPORTXML function in Google Sheets

The syntax of the IMPORTXML function is as follows:
= IMPORTXML( URL, XPATH)

URL : URL of the page from where you want to capture the content.

XPATH: The tag on the page that wraps the required content.

To understand better , let’s check the following examples.

Using IMPORTXML to capture Page Titles.

Consider a simple task. You have a list of URLs & you have to scrape the page titles of each. Here’s how you do it using IMPORTXML:

=IMPORTXML(URL, “//title”)

The X-path (in this case “//title”) , tells Google-Sheets, that you are interested in capturing the content within the title tag.

In the below example, I will scrape the page titles of the Wiki pages of a few reputed Indian Cricketers. I have listed the page URLs in Column A. In Column B , I am using the IMPORTXML function to capture the page titles :

IMPORTXML used for capturing page titles in Google Sheets
IMPORTXML used for capturing page titles in Google Sheets

Using IMPORTXML to Capture H1 content

Just like the example above, you can capture the main header of any webpage by targeting the H1 tag. The formula again is really simple:

=IMPORTXML(URL, “//h1”)

In the above sheet, I will capture the H1 values in column C. Let’s see what’s in store:

 IMPORTXML used for capturing H1  in Google Sheets
IMPORTXML used for capturing H1 in Google Sheets

Using IMPORTXML to Scrape Span Class content to Google Sheets

Now let’s try to do something that’s a bit more difficult but way too cool. What if we wanted to capture the date of birth value for each of the cricketers – can we do it using IMPORTXML ? The answer is yes! Just that we need to take a different strategy.

Do note that in the previous example, we captured the content inside pre-defined tags. In this case we know the content that we are after – but we don’t know the tag that wraps it. We thus need to follow a few more steps to build the IMPORTXML formula:

Step 1: Locate the Date of Birth Value on the Wiki-page:

The Date of Birth is located in the side block titled “Personal Information”

Step 2: Using “Inspect Element” find out the tag wrapping the Date of Birth value

We need to right-click on the date of birth value & then choose the “Inspect Element” option. That exposes , the source code of the content & tell’s us which tag wraps the content.

Inspect Element reveals the following source code:

 IMPORTXML used for capturing Span Class content  in Google Sheets
IMPORTXML used for capturing Span Class content in Google Sheets

The Date of Birth Value is wrapped by a <span> class. Unlike title & h1, we can have multiple spans in a web-page. To capture this particular span we need to call it out using an identifier. In this case , the identifier is the class name (“bday”).

Step 3: Pass the identified tag to IMPORTXML function

We now need to build the formula. The IMPORTXML function has to target a particular span class using the identifier.

The X-path for the same would be : “//span[@class=’bday’]”

Thus the formula works out to : IMPORTXML(URL, “//span[@class=’bday’]”)

In column D of the same spreadsheet, I am going to capture the Date of birth Values:

Column D captures the Date of Birth Values

The formula is working ! We have successfully captured the Date of Birth Values using a simple IMPORTXML formula!

What is Web Scraping?

Web Scraping  is a technique used for extracting large amounts of data from websites. You can then save and analyze the extracted data. Web Scraping has a whole bunch of uses. In fact search engines use web scraping techniques to understand the content of a website . This understanding, helps them rank pages for search queries.

If you are in the field of SEO, web scraping will help you understand your competitors on-page elements in bulk.

Is Web Scraping legal?

Web Scraping as a technique isn’t illegal. Google wouldn’t have become one of world’s top-10 company, had the core technology (web scraping) behind their top earning product been illegal. However, it’s better to avoid scraping websites who don’t appreciate being scraped. Scraping content behind log-in screens or scraping a user’s personal data is also not recommended at all.

While web scraping, it’s also important to ensure that you don’t put too much burden on a website’s server. Most sophisticated websites have methods to identify whether the traffic to their pages is from humans or robots. It they have doubts that they are being pinged by a web scraper they may just block the scraper or put up a captcha screen.

Best Web Scraping tools & codes are successful in mimicking a human user.

Final thoughts:

IMPORTXML is great for web scraping if the task is relatively simple. For more complicated Web Scraping tasks, you will need to write a code. Python is one of the most popular languages for web scraping. If you are familiar with python. you may check out the this article on web scraping using python.

FAQs : IMPORTXML in Google Sheets

How do I pull data from a website in Google Sheets?

You can pull data from a website to Google Sheets using the IMPORTXML function.

Why is web scraping using IMPORTXML useful for SEO ?

Web Scraping using IMPORTXML allows us to import the on-page elements of our SEO competitors in bulk to Google Sheets, allowing easy view & analysis.

Summary
 IMPORTXML Google Sheets | Web Scraping without Coding
Article Name
IMPORTXML Google Sheets | Web Scraping without Coding
Description
Web Scraping can be a fascinating & useful exercise. Learn how to build your own web scraper using IMPORTXML in Google Sheets.
Author
Publisher Name
Digital Marketing Chef
Publisher Logo

8 thoughts on “IMPORTXML in Google Sheets – easiest way to build your own Web Scraper”

  1. Normally I do not read post on blogs, however
    I wish to say that this write-up very compelled me to check out and do it!

    Your writing style has been surprised me. Thank you, very great article.

  2. Long time supporter, and thought I’d drop a comment.

    Your organization’s wordpress site is very sleek – hope you don’t mind me asking what theme you’re using?
    (and don’t mind if I steal it? :P)

    I just launched my small businesses site –also built in wordpress like yours– but the theme slows (!) the
    site down quite a bit.

    In case you have a minute, you can find it by searching for “royal cbd” on Google (would appreciate any feedback)
    – it’s still in the works.

    Keep up the good work– and hope you all take care of yourself during the coronavirus scare!

    ~Justin

  3. Hi,
    The examples shared are fairly simple ones. I tried using the same on a (subscription) website, and it seemed all the classes were coded (e.g. class name asfas-sdgsdh-asdas, for any given element they would also change on every page refresh). Any possibility on scraping that by finding the magical xpath?

Leave a Reply

Your email address will not be published. Required fields are marked *