I wanted to share with a problem i found while working in BigQuery and google analytics 360 data: ga_sessions_*
note: a public data set of google analytics, you can replace it in the queries below if you want:
bigquery-public-data:google_analytics_sample.ga_sessions_20170101
The following query
select count(distinct fullvisitorid) as users,
count(distinct CONCAT(cast(visitid as string) ,fullVisitorID )) as number_of_sessions,
DATE_TRUNC(parse_DATE(‘%Y%m%d’,date),month) AS month_year
from `myProject.ga_sessions_*` t, t.hits h , h.customDimensions as hcd
where (_TABLE_SUFFIX BETWEEN ‘20181231’ and ‘20181231’)
and not (device.operatingSystem = ‘(not set)’
AND totals.screenviews = 0)
group by 3
returns 1724266 distinct fullvisitorid rows.
when use the same query only remove the “, h.customDimensions as hcd” from the FROM section :
select count(distinct fullvisitorid) as users,
count(distinct CONCAT(cast(visitid as string) ,fullVisitorID )) as number_of_sessions,
DATE_TRUNC(parse_DATE(‘%Y%m%d’,date),month) AS month_year
from `myProject.ga_sessions_*` t, t.hits h
where (_TABLE_SUFFIX BETWEEN ‘20181231’ and ‘20181231’)
and not (device.operatingSystem = ‘(not set)’
AND totals.screenviews = 0)
group by 3
i get 1754580 distinct fullvisitorid rows.
What is the reason unnset of a custom dimension causes this distinct values difference?
what happens when i do select * from a,b,c in bigQuery?
As you probably know when you add “, something” to the FROM clause you are doing a CROSS JOIN between the first element and the recently added. For example:
select * FROM table tablename as a, table tablename2 as b
This will perform a cross join between the tables tablename and tablename2.
Then, when you do a CROSS JOIN with null values, this rows will be suppressed.
Let me show you an example:
#standardSQL
WITH
data AS (
(SELECT
“primes under 15” AS description,
[1, 3, 5, 7, 11, 13] AS primes_array)
UNION ALL
SELECT
“primes under 16” AS description,
[2] AS primes_array)SELECT
description, prime
FROM
data
CROSS JOIN
UNNEST (primes_array) AS prime
This query returns 7 rows, however the next one returns only 6:
#standardSQL
WITH
data AS (
(SELECT
“primes under 15” AS description,
[1, 3, 5, 7, 11, 13] AS primes_array)
UNION ALL
SELECT
“primes under 16” AS description,
[ ] AS primes_array)SELECT
description, prime
FROM
data
CROSS JOIN
UNNEST (primes_array) AS prime
As you can see the only difference is in the line “[2] AS primes_array” which changes to a null value, “[ ] AS primes_array”. Then, this data is discarded with the CROSS JOIN, so when you do the query make sense that you see a difference in the results if there’s any NULL value within that “customDimensions”.
Credit for finding the problem: Evyatar Aviram
good read about unnest:
http://firebase.googleblog.com/2017/03/bigquery-tip-unnest-function.html
Contact me:
- Omid Vahdaty
- +972–54–2384178
Want more big data quality content? Join our meetup, subscribe to youtube channels
- https://www.meetup.com/AWS-Big-Data-Demystified/
- https://www.meetup.com/Big-Data-Demystified/
- Big Data Demystified YouTube
- AWS Big Data Demystified YouTube
- https://big-data-demystified.ninja/
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me: