SCITYLANA: “Hello BigQuery”

Sign-up
and get raw hit-level Google Analytics data on scitylana.com

The clouds are coming in…

Today we announce the BETA version of our first Cloud destination – BigQuery!

But what does this mean?

Well, it means from today we extract hit-level / not sampled data from your favorite FREE GA view and upload it automatically to your BigQuery project. Just like the big boys do with GA 360 but just less expensive.

Swoosh… !

It updates the data automatically behind the scenes every morning. And with NO HANDS – YAY!

Now take that fine reporting tool of yours with a BigQuery connector and put it to use. Use it to connect to your SCITYLANA transformed FREE GA hit-level / not sampled BigQuery database and do reporting on it.

Let it be Microsoft Power BI, Tableau, Google Data Studio or some other tool.

Let’s start the engine…

To get started – log-in to your www.scitylana.com account and authorize with a BigQuery account with modify permissions (Owner, Editor or Admin) in your data extraction settings. Click the Authorize BigQuery Access button:

BigQueryAuthorize

Now enter your Google Project ID, in my case it is scitylana-1048

BigQueryProjectID

Save it – and off you go. If you already have extracted data on the disk, these files will immediately be uploaded to BigQuery.

Do note that the app will continue to download the files from GA on to your hard-drive. From here the SCITYLANA app will upload the files to BigQuery.

While we wait for the files to be uploaded – let’s set-up the first report.

Connect to BigQuery from Power BI

The easiest way to do this is through our Power BI template for BigQuery. Simply download the template, open it (you must have Microsoft Power BI) and sign in to BigQuery when prompted.

You can also build your report from scratch in the following way.

  1. Open PBI Desktop and click Get Data

PowerBIGetData

2. Select Google BigQuery (Beta) connector

PBIBigQuery

3. Click Connect

4. Expand from your project id to your tables and views.

PBIBQDataset

5. Select second view from the top and click Load

6. Select DirectQuery

PBIDirectQuery

7. Click OK

8. Power BI now creates a model and should end up listing all the dimensions and metrics in the Fields list.

PBIBQFields

9. Right-click the table called “VIEW ya-da ya-da” and select New Measure

10. Write M_users = DISTINCTCOUNT(VIEW ya-da ya-da…[sl_userId]) where “VIEW ya-da ya-da” is the id of your own view. E.g. M_users = DISTINCTCOUNT(VIEW136604982DAYS007[sl_userId])

11. Check your new view M_users in the fields list. (Find it using the search)

12. And VIOLÁ – you get your first chart

PBIBQChart

13. Now search for the date dimension. And check it. Now we get…

PBIBQChartTime

14. I hope you can take the rest from here. 🙂

Nerdy details for the interested

OK, what have we done with your BigQuery account?

Well in the BigQuery Cloud console you can see we have added the following:

BigQueryOnline

(The screenshot displays the output for Google Analytics View 136604982 –  scitylana.com)

For each view you’ll get a date partitioned table (the one with the blue icon) named after your Google Analytics view idThe reason why we partition the data by date is because Google recommends it for getting better query performance. Currently this is the only way BigQuery can partition a table.

Table Schema

Here is a snippet of the schema definition.

BigQuerySchema

The sl_ dimensions (e.g. sl_userId) are some of the extra stuff SCITYLANA adds to the dataset. _PARTITIONTIME is BigQuery internal column used for date partitioning. Read more about it here.

All Google Analytics metrics a re-named from ga: to M_ e.g. M_bounces is the re-named version of ga:bounces. All dimensions are re-named from ga: to E.g. ga:userType is called userType.

BigQuery Views to help

Additionally we created 7 BigQuery views. BigQuery Views are a subset of data, a stored SQL query if you like. We made these helper views for your convenience.

In the data view you get integration between the SCITYLANA hit-level table and BigQuery’s public date table called bigquery-public-data.common_us.date_greg.

This view comes in 6 variations. Latest 7 days and latest 14, 30, 90, 180, 365 days. This is to get better query performance when using them with a BigQuery DirectQuery connector in e.g. Microsoft Power BI. The views have the following format, VIEWXXXXXXXXXDAYSYYY, where XXXXXXXXX is the view id and YYY is the number of days it queries. E.g. VIEW120558169DAYS030 which is returning the last 30 days of hit-level data for GA view with id 120558169.

Another view lets you see the data partitions behind the partitioned table for your convenience. E.g. LIST120558169PARTITIONS

This view is not super important, but it’s practical when you need to get an idea of how many days have been uploaded to the partitioned table.

 

Please feel free to comment here or write us, support@scitylana.com

Sign-up
and get raw hit-level Google Analytics data on scitylana.com

Why Google Analytics connectors are useless

I see many users of graphical dashboard tools using built-in Google Analytics connectors (like Power BI, QlikView, and Tableau) having trouble understanding what the numbers really mean – and why they differ from the Google Analytics dashboard.

Maybe you’ve had the same frustrating experience as many others:

  • A user of Tableau reports a “mismatch” between GA and Tableau
  • A user of Power BI complains that his GA data “don’t make sense”
  • A user of QlikView wonders why his GA data “is not matching”
  • A user of Power BI reports that his imported GA data “is wrong

If you have observed something similar, or just if you’re wondering what the heck is going on here, then please read on.

I’m going to settle once and for all why GA connectors are so confusing, and why you should probably avoid them altogether.

The core of the problem

The reason why Power BI, QlikView, and Tableau have become so popular in recent years is not only because they are superior at visual reporting, but also because they use so-called “in-memory databases” which work blazingly fast, giving the user an exploration tool with sub-second response time.

For these in-memory databases to work as intended, data-sets should be as “original” or “raw” as possible. No grouping, aggregating, or filtering. Just the events, line-by-line, imported from your sales tables, hit tables, etc. This gives you maximum freedom of analysis while the in-memory database will happily crunch the growing volume of data.

The problem? GA connectors do not deliver raw data!

Instead, they import GA metrics that are already aggregated (summed, counted, calculated, etc.). And while this works fine in a few cases, users are potentially misled to think these metrics are always reliable to use in their dashboard tool.

They are NOT!

You need the non-aggregated data behind these numbers to avoid faulty double aggregations.

Let me explain.

Beware of double aggregation!

Consider this overview of how two people, let’s call them John and Carla, visit a website during a week. While Carla visits only once, John makes four visits on four different days, as seen in the table below:

 Day Number of Users The Names of Users
Monday 2 John, Carla
Tuesday 1 John
Wednesday 0
Thursday 0
Friday 0
Saturday 1 John
Sunday 1 John
Total 5 2

Now ask yourself, “What is the total number of users during this week?” Looking at the names of users in the rightmost column, the total is obviously only two: John and Carla.

However, this is not the picture we get from GA connectors. The GA connector only shows the other columns: Day and Number of Users (or “Users per Day”). If we try to roll them up to the week, the result is five (2 + 1 + 0 + 0 + 0 + 1 + 1 = 5).

Do you see what happened here? Summing up users across days leads to the wrong result – in this case, five instead of two!

Why is that?

Semi-additive metrics and distinct count

Here’s the thing, Users is an example of what database geeks (like me) call semi-additive metrics.

It’s a metric you can sum meaningfully across some dimensions, but not all, and the time dimension is not one of them.

The technical explanation for this is that Users per Day constitutes a data stream with repeated elements (i.e., a user may repeatedly appear during the week, like John). Calculating users across days will, therefore, most likely result in multiple counts of the same user, leading to a meaningless total such as the final number 5 in the table above.

The way to handle data streams with repeated elements is to use  distinct count instead of sum.

This is what Google Analytics does behind the scenes when it returns a query involving users, and that’s why you will always see the correct totals in Google Analytics.

The problem is, that distinct count is only possible if you have access to the raw data (e.g., cookie IDs of users or, in my example, the names in the table above). You can’t derive distinct count from a table which is already aggregated. Yet this is exactly what users of Power BI, Tableau, QlikView, etc., are led to believe when they use the tools’ built-in GA connectors.

And that’s the reason they get so confused and frustrated!

Is there a workaround?

“Okay,” you say, “But can’t I just avoid importing Users per Day? Can’t I just import the totals I need directly from Google Analytics?”

Well, yes, you can – but you’ll be severely limited in what you can then do in your dashboard.

Obviously, you wouldn’t be able to add a time-period selector in your dashboard, which can be annoying to end-users. But you would also have to be careful about other types of filters, since Users is a non-additive metric not just across time, but across several other dimensions as well.

For example, you can’t sum Users meaningfully across traffic sources such as your ad campaigns, emails, organic traffic, etc. You also can’t sum Users across landing pages or other types of URLs.

And mind you, Users is not the only semi- or non-additive metric in Google Analytics. Sessions is another example, as well as all calculated rates and ratios such as your Goal Conversion Rate, Bounce Rate, Average Session Duration, Average Order Value, and so on.

So, yes, you can try to circumvent the problem of semi- and non-additive metrics in your dashboard, but it would prevent you from taking full advantage of its features to the point where it probably becomes useless.

Un-aggregate your GA data with Scitylana

The frustrating experience many people have using GA connectors in dashboards is one reason we created Scitylana.

This tool lets you extract data from your free Google Analytics account, transforming them into a raw, hit-level form. This is the perfect format to have when you load data into Power BI, Tableau, QlikView, Targit, and others.

It gives you maximum freedom to analyze, visualize, and integrate GA data with other data sources. It enables you to slice and dice the data and to use whichever dimension you like as a filter.

With Scitylana you never have to worry about semi-additive metrics or distinct count.

The best thing? Signing up is free, and you can do it right now in just a couple of seconds: www.scitylana.com

Validating your SCITYLANA implementation

When I test a SCITYLANA implementation for one of our customers I start out by checking the front page of the site in question and investigate what is sent to Google Analytics.

I expect to find a custom dimension with a special SCITYLANA value set. We say it’s tagged with SCITYLANA. In this post I will describe how I would check the implementation on scitylana.com (our own site)

I know we are using custom dimension index 1 (cd1) for the SCITYLANA tag (you can find your index in either your data extraction config or in the GA Admin). So I expect to find a hit sent to GA with a URL parameter containg a structure like

cd1: sl=1u=...&t=...

The … is just representing some value that we don’t care about in this context. So please ignore it for now.

To investigate I will use Google Chrome. So I open up Chrome at https://www.scitylana.com

I press F12 and get access to the Developer Toolbar. The toolbar consist of multiple tabs. One called Network, which i’ll select.

Chrome network collect

I write analytics.com/collect in the filter.

I click on the first request/hit in the list. And scroll down to the bottom on the right hand side. I get this view…

Chrome Network hit details

Now I see the line marked in yellow where I can see the the SCITYLANA tag is set in custom dimension 1 (cd1)

Now I will check the second hit sent to GA the same way.

Next: Test special events

We have scroll tracking enabled on our site so I would need to ensure that the hit sent on scrolling past a section on the page is also tagged with the SCITYLANA custom dimension value.

Chrome Network scroll hit details

After scrolling past a section on the page the site sent another GA hit. And I can see that the hit has the SCITYLANA tag set.

But what if the SCITYLANA tag wasn’t set?

You can implement the SCITYLANA tag primarily in 2 ways

GTM (Universal Analytics tags)

You need to ensure that ALL your Universal Analytics tags has the SCITYLANA custom dimension set.

SCITYLANA GA tracking plugin

You need to identify all the Universal Analytics tracking scripts you have on your site. Maybe you have multiple “master” templates each having a copy of the Universal Analytics tracking script. Make sure the SCITYLANA GA tracking plugin is in place in all templates.

Maybe your site is hosted on different platforms. A blog on wordpress and support on Zendesk etc. Add the SCITYLANA GA tracking plugin code everywhere.

If you are in need of help, shout out to us at support@scitylana.com or write a comment.

 

What if I could just Ctrl+Copy ALL of my Google Analytics data and Ctrl+Paste them anywhere I would like?

copytable

That would be great, certainly fun and probably very powerful!

Well… some people have already been doing this for a while.

The challenge: GA data are aggregated and not super suited for moving anywhere else. I’m not stating that you can’t move it already, but just that aggregated data has a lot of limitations. Behind the scenes, GA tracking data (hits) are counted, grouped, packed and stored in a fixed number of buckets. This is actually pretty smart and practical… when these buckets fit my certain needs.

The frustration: But if they don’t fit my needs?

Tada…

The solution: SCITYLANA makes sure you can access the original hits and enables you to Ctrl+copy/paste non-aggregated GA data to a destination of your choice. We recommend everyone to store the data in a database and make some really great business decisions using your favorite visualization tool. We have already seen our customers using the hit level data in Microsoft Power BI and Excel. We hope to see it in Tableau and Data Studio soon too.

Psst…! You can try it here