Different way to scrape stats

As you folks know, I’ve been struggling with collecting all the stats from all the grids every month, and, earlier this year, have almost fully automated the process, using an extremely jury-rigged Filemaker script.

Today, I accidentally stumbled on the fact that Google Spread can be used to do the same thing, without the heavy database workload.

The key is two Google functions — ImportHTML and ImportXML. The former works well if the data is in a table or a list, and you can use the Index function to pull out the table or list element you want. The latter looks for particular “class” tags in the CSS.

Take, for example, the stats page for 3rd Rock Grid, which uses a table.

The following function pulls out the total number of regions, where A1 is where the URL for the stats page is stored:

=index(importhtml(A1,"table",1),3,2)

The stats page for Eureka World doesn’t have a table, so I used the HTML inspector as in this tutorial to get the XPath for the region data.

=index(importxml(A2,"//span[@class='nowrap']"),2)

I’ve created a public Google Sheet with the current stats pages of the active grids. I don’t have time to go through and figure out the syntax for each grid. Since each grid displays its data in slightly different ways, you’d need a different formula for each one.

I went through and did about 20, out of some 155. As you can see in the image below, Genesis Global Journey puts asterisks around their numbers, so I’d have to add some more formula in there to clean that up. And Eureka World’s formula pulls in “Regions: 53” so I’ll have to do some clean-up on those, as well.

A grid stats scraper built in Google Sheets. (Image by Maria Korolov.)

Does anyone want to help out with this? If so, everyone could have a copy of this for their own use, if they want one, and it would make it a lot easier for me to collect the data once a month.

I might even be able to collect the data more often, and pull the statistics automatically into a table that I could then embed in articles, instead of manually having to export the data from my database each month and format it for the web.

One thing I’ve noticed is that Google Sheets sometimes shows the correct looked up value, and other times doesn’t, giving an “#REF!” error message instead for the same formula. Apparently, there is a limit of how many lookups a spreadsheet can do, and I’ve been running it ragged today playing around with the formulas.

Does anyone have any advice on getting this to work better? Thanks!

P.S. If you’re going to play around with the sheet, please make your own working copy first so you have room to work in.

Related Posts

maria@hypergridbusiness.com'

Maria Korolov

Maria Korolov is editor and publisher of Hypergrid Business. She has been a journalist for more than twenty years and has worked for the Chicago Tribune, Reuters, and Computerworld and has reported from over a dozen countries, including Russia and China. Follow me on Twitter @MariaKorolov.

  • Visitor

    Well done, good job 🙂

    You can use SPLIT to cleanup the * or other unwanted characters
    For exemple to remove the : before result for OsGrid total region stats you can use formulas:
    =INDEX(SPLIT(IMPORTXML(B156;”//*[@id=’stats2′]/b/font[2]/font/font[2]/font/font[2]/font[2]/font[2]/font[2]/text()”);”:”);;2)

    PS: /! The syntax of the formulas is different with old / new version of google /!

  • Akshay

    I am able to fetch data from multiple URL’s listed in a single column in google sheets using importhtml/importxml formula. However, when I drag the column with the formula it overwrites the existing data of the previous results. The error that I get is Array result was not expanded because it would overwrite data in the C4.

    Here is a live example to test:: https://docs.google.com/spreadsheets/d/1eFrLYk9T7mNUGRRFpcP_dP0v6tptDZCPTzGdAe4RVlM/edit?usp=sharing

    Sheet1 contains the formula and error I see. (cell – C3,C4,C5)

    Sheet2 is the desired output using importxml/importhtml formula.

    How can obtain the desired result for 100+ URLs using a formula? Thank you in advance.