Uncategorized

what really happens when you unnest customDimentions in ga_sessions_* (hint data inaccuracy)

I wanted to share with a problem i found while working in BigQuery and google analytics 360 data: ga_sessions_*

note: a public data set of google analytics, you can replace it in the queries below if you want:

bigquery-public-data:google_analytics_sample.ga_sessions_20170101

The following query

select count(distinct fullvisitorid) as users,
count(distinct CONCAT(cast(visitid as string) ,fullVisitorID )) as number_of_sessions,
DATE_TRUNC(parse_DATE(‘%Y%m%d’,date),month) AS month_year
from `myProject.ga_sessions_*` t, t.hits h , h.customDimensions as hcd
where (_TABLE_SUFFIX BETWEEN ‘20181231’ and ‘20181231’)
and not (device.operatingSystem = ‘(not set)’
AND totals.screenviews = 0)
group by 3

returns 1724266 distinct fullvisitorid rows.

when use the same query only  remove the “, h.customDimensions as hcd” from the FROM section :

select count(distinct fullvisitorid) as users,
count(distinct CONCAT(cast(visitid as string) ,fullVisitorID )) as number_of_sessions,
DATE_TRUNC(parse_DATE(‘%Y%m%d’,date),month) AS month_year
from `myProject.ga_sessions_*` t, t.hits h
where (_TABLE_SUFFIX BETWEEN ‘20181231’ and ‘20181231’)
and not (device.operatingSystem = ‘(not set)’
AND totals.screenviews = 0)
group by 3

i get 1754580 distinct fullvisitorid rows.

What is the reason unnset of a custom dimension causes this distinct values difference?

what happens when i do select * from a,b,c in bigQuery?

As you probably know when you add “, something” to the FROM clause you are doing a CROSS JOIN between the first element and the recently added. For example:

select * FROM table tablename as a, table tablename2 as b

This will perform a cross join between the tables tablename and tablename2.

Then, when you do a CROSS JOIN with null values, this rows will be suppressed.
Let me show you an example:

#standardSQL
WITH
data AS (
(SELECT
“primes under 15” AS description,
[1,      3,      5,      7,      11,      13] AS primes_array)
UNION ALL
SELECT
“primes under 16” AS description,
[2] AS primes_array)

SELECT
description, prime
FROM
data
CROSS JOIN
UNNEST (primes_array) AS prime

This query returns 7 rows, however the next one returns only 6:

#standardSQL
WITH
data AS (
(SELECT
“primes under 15” AS description,
[1,      3,      5,      7,      11,      13] AS primes_array)
UNION ALL
SELECT
“primes under 16” AS description,
[ ] AS primes_array)

SELECT
description, prime
FROM
data
CROSS JOIN
UNNEST (primes_array) AS prime

As you can see the only difference is in the line “[2] AS primes_array” which changes to a null value, “[ ] AS primes_array”. Then, this data is discarded with the CROSS JOIN, so when you do the query make sense that you see a difference in the results if there’s any NULL value within that “customDimensions”.


Credit for finding the problem: Evyatar Aviram

good read about unnest:

http://firebase.googleblog.com/2017/03/bigquery-tip-unnest-function.html

Contact me:

Want more big data quality content? Join our meetup, subscribe to youtube channels

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

Leave a Reply