Uncategorized

what is the cheapest ways to parse GA_sessions at big query? What is the fastest way to parse GA_sessions at big query? What is the simplest way to parse GA_sessions at big query?

Before we begin , be sure to be familiar with the Google analytics schema. Notice the different Record types, especially the hits record.

Option one: sub query

The simplest solution would be using sub query for each custom dimension, processing 3.18 GB. The problem is that if you you a view on top of it, the cost does not behave as you would expect. so be careful, and test everything.

select
fullVisitorId ,
(select value from h.customDimensions where index = 1) as UID,
(select distinct value from h.customDimensions where index = 36) as App_Edition
FROM
1234.ga_sessions_20180501 t, t.hits h

For example

Option 2: User Defined Function:

using used defined function with SQL. it involves baseis coding. This would be the slowest solution, but unlikely to hit a resources problem.

The query below processed 3.18 GB

CREATE TEMPORARY FUNCTION getCustomDimension(cd ARRAY>, index INT64)
RETURNS STRING
LANGUAGE js AS “””
for(var i = 0; i < cd.length; i++) {
var item = cd[i];
if(item.index == index) {
return item.value
}
}
return ”;
“””;

SELECT
gs_sessions.fullVisitorId,
getCustomDimension(hits.customDimensions, 1) AS customDimension1_uid,
getCustomDimension(hits.customDimensions, 36) AS customDimension36_edition
FROM 1234.ga_sessions_20180501 gs_sessions,
UNNEST(gs_sessions.hits) hits

Option 3: With table as select

this option is the cheapest… 570 mb was processed, However. with a large number of custom dimensions , you might hit an error of “Error: Query exceeded resource limits. 86676.81298774751 CPU seconds were used,this query must use less than 55000.0 CPU seconds.” and there is nothing you can do about it . furthermore – the solution is not elegant in terms of maintenance, and it will slow down the more custom dimensions you have.

WITH CD AS (
SELECT
with_ga_sessions.fullVisitorId,
with_ga_sessions.visitId,
with_hits.hitNumber,
with_cd.index,
with_cd.value
FROM 1234.ga_sessions_20180501 with_ga_sessions,
UNNEST(with_ga_sessions.hits) with_hits,
UNNEST(with_ga_sessions.customDimensions) with_cd
WHERE with_cd.index IN (1, 36)
)
SELECT
ga_sessions.visitId,
CD_1.value AS customDimension1_uid,
CD_36.value AS customDimension36_edition
FROM 1234.ga_sessions_20180501 ga_sessions,
UNNEST(ga_sessions.hits) hits
LEFT JOIN CD CD_1 ON ga_sessions.fullVisitorId = CD_1.fullVisitorId
AND ga_sessions.visitId = CD_1.visitId
AND hits.hitNumber = CD_1.hitNumber
AND CD_1.index = 1
LEFT JOIN CD CD_36 ON ga_sessions.fullVisitorId = CD_36.fullVisitorId
AND ga_sessions.visitId = CD_36.visitId
AND hits.hitNumber = CD_36.hitNumber
AND CD_36.index =36

Option4: Flatten everything using AirFlow or any other orchestration tool.

This is the most expensive way, as each day we overwrite the entire history, as the custom dimension may change in some use case. if they don’t, then this approach may be be simple, cheap and fast. The idea is simple – convert each record to a bunch of columns and utilize the columnar power of bigquery. make sure the target flat table is partitioned by date. Notice that currently each table in BigQuery has a limitation of 2000 partition which means about 5.5 years. so the below is not good enough , consider sharded tables based on years. The storage here will be flattened, which means, anot more storage will be saved in BQ compared to GA_sessions storage only (about doble), but the benefit will be in compute time as the the table is columnar, and most queries require a short amount of columns.

CREATE  table `myProject.myDataSet.GA_FLAT` 
  ( date date,
  
 hits_type string,
 
 hits_page_pagePath string,
 hits_page_pagePathLevel1	string,
 hits_page_pagePathLevel2	string,
 hits_page_pagePathLevel3		string,
 hits_page_pagePathLevel4		string,
 
 hits_page_hostname string,
 hits_page_pageTitle string,
 hits_appInfo_appId string,
 hits_appInfo_appName string,
 hits_appInfo_screenName string,
 
 hits_eventInfo_eventCategory string,
 hits_eventInfo_eventAction string,
 hits_eventInfo_eventLabel string,
 hits_eventInfo_eventValue string,
 
    uid string,
    app_edition string,
 
 
    clientId string,
    fullVisitorId string,
    visitorId string,
    userId string,
    visitNumber string,
    visitId string,
    visitStartTime string,
    totals_bounces string,
    totals_hits string,
    totals_newVisits string,
    totals_pageviews string,
    totals_screenviews string,
    totals_sessionQualityDim string,
    totals_timeOnScreen string,
    totals_timeOnSite string,
    totals_totalTransactionRevenue string,
    totals_transactionRevenue string,
    totals_transactions string,
    totals_UniqueScreenViews string,
    totals_visits string,
    trafficSource_adContent string,
    rafficSource_adwordsClickInfo_adGroupId string,
    trafficSource_adwordsClickInfo_adNetworkType string,
    trafficSource_adwordsClickInfo_campaignId string,
    trafficSource_adwordsClickInfo_creativeId string,
    trafficSource_adwordsClickInfo_criteriaId string,
    trafficSource_adwordsClickInfo_criteriaParameters string,
    trafficSource_adwordsClickInfo_customerId string,
    trafficSource_adwordsClickInfo_gclId string,
    trafficSource_adwordsClickInfo_isVideoAd string,
    trafficSource_adwordsClickInfo_page string,
    trafficSource_adwordsClickInfo_slot string,
    trafficSource_adwordsClickInfo_targetingCriteria_boomUserlistId string,
    trafficSource_campaign string,
    trafficSource_campaignCode string,
    trafficSource_isTrueDirect string,
    trafficSource_keyword string,
    trafficSource_medium string,
    trafficSource_referralPath string,
    trafficSource_source string,
    socialEngagementType string,
    channelGrouping string,
    device_browser string,
    device_browserSize string,
    device_browserVersion string,
    device_deviceCategory string,
    device_mobileDeviceInfo string,
    device_mobileDeviceModel string,
    device_mobileInputSelector string,
    device_mobileDeviceMarketingName string,
    device_operatingSystem string,
    device_operatingSystemVersion string,
    device_isMobile string,
    device_mobileDeviceBranding string,
    device_flashVersion string,
    device_javaEnabled string,
    device_language string,
    device_screenColors string,
    device_screenResolution string,
    geoNetwork_continent string,
    geoNetwork_subContinent string,
    geoNetwork_country string,
    geoNetwork_region string,
    geoNetwork_metro string,
    geoNetwork_city string,
    geoNetwork_cityId string,
    geoNetwork_latitude string,
    geoNetwork_networkDomain string
 )
 partition by date

SELECT
PARSE_DATE('%Y%m%d', REPLACE(_TABLE_SUFFIX, "intraday_", "")) as date, 
type as hit_type,
 h.page.pagePath 	  as hits_page_pagePath,
 h.page.pagePathLevel1 as hits_page_pagePathLevel1,
 h.page.pagePathLevel2 as hits_page_pagePathLevel2,
 h.page.pagePathLevel3 as hits_page_pagePathLevel3,
 h.page.pagePathLevel4 as hits_page_pagePathLevel4,
h.page.hostname as hit_page_hostname,
h.page.pageTitle as hits_page_pageTitle,
h.appInfo.appId as hit_appInfo_appId,
h.appInfo.appName as hit_appInfo_appName,
h.appInfo.screenName as hit_appInfo_screenName,
h.eventInfo.eventCategory as hits_eventInfo_eventCategory ,
h.eventInfo.eventAction   as hits_eventInfo_eventAction ,
h.eventInfo.eventLabel    as hits_eventInfo_eventLabel ,
h.eventInfo.eventValue    as hits_eventInfo_eventValue ,
(select value from h.customDimensions where index = 1) as CD_1,
(select distinct value from h.customDimensions where index = 35) as CD_35,
clientId	,
fullVisitorId	,
visitorId	,
userId	,
visitNumber	,
visitId	,
visitStartTime	,
totals.bounces as totals_bounces,
totals.hits	as totals_hits,
totals.newVisits as totals_newVisits	,
totals.pageviews	as totals_pageviews,
totals.screenviews	as totals_screenviews,
totals.sessionQualityDim	as totals_sessionQualityDim,
totals.timeOnScreen	as totals_timeOnScreen,
totals.timeOnSite	as totals_timeOnSite,
totals.totalTransactionRevenue	as totals_totalTransactionRevenue,
totals.transactionRevenue	as totals_transactionRevenue,
totals.transactions	 as totals_transactions,
totals.UniqueScreenViews	as totals_UniqueScreenViews,
totals.visits  as totals_visits,
trafficSource.adContent as trafficSource_adContent	,
trafficSource. adwordsClickInfo.adGroupId as trafficSource_adwordsClickInfo_adGroupId	,
trafficSource. adwordsClickInfo.adNetworkType as trafficSource_adwordsClickInfo_adNetworkType	,
trafficSource. adwordsClickInfo.campaignId as trafficSource_adwordsClickInfo_campaignId	,
trafficSource. adwordsClickInfo.creativeId as trafficSource_adwordsClickInfo_creativeId	,
trafficSource. adwordsClickInfo.criteriaId as trafficSource_adwordsClickInfo_criteriaId	,
trafficSource. adwordsClickInfo.criteriaParameters as trafficSource_adwordsClickInfo_criteriaParameters,
trafficSource. adwordsClickInfo.customerId	as trafficSource_adwordsClickInfo_customerId,
trafficSource. adwordsClickInfo.gclId	as trafficSource_adwordsClickInfo_gclId,
trafficSource. adwordsClickInfo.isVideoAd	as trafficSource_adwordsClickInfo_isVideoAd,
trafficSource. adwordsClickInfo.page as trafficSource_adwordsClickInfo_page	,
trafficSource. adwordsClickInfo.slot as trafficSource_adwordsClickInfo_slot	,
trafficSource. adwordsClickInfo.targetingCriteria.boomUserlistId	as trafficSource_adwordsClickInfo_targetingCriteria_boomUserlistId,
trafficSource.campaign	as trafficSource_campaign,
trafficSource.campaignCode as trafficSource_campaignCode	,
trafficSource.isTrueDirect as trafficSource_isTrueDirect	,
trafficSource.keyword	as trafficSource_keyword,
trafficSource.medium	as trafficSource_medium,
trafficSource.referralPath as trafficSource_referralPath	,
trafficSource.source	as trafficSource_source,
socialEngagementType	  ,
channelGrouping	,
device.browser as device_browser	,
device.browserSize	 as device_browserSize	,
device.browserVersion	 as device_browserVersion,
device.deviceCategory	 as device_deviceCategory,
device.mobileDeviceInfo	as device_mobileDeviceInfo,
device.mobileDeviceModel	as device_mobileDeviceModel,
device.mobileInputSelector	as device_mobileInputSelector,
device.mobileDeviceMarketingName	as device_mobileDeviceMarketingName,
device.operatingSystem	as device_operatingSystem,
device.operatingSystemVersion	as device_operatingSystemVersion,
device.isMobile	as device_isMobile,
device.mobileDeviceBranding	as device_mobileDeviceBranding,
device.flashVersion as device_flashVersion	,
device.javaEnabled	as device_javaEnabled,
device.language	 as device_language,
device.screenColors	as device_screenColors	,
device.screenResolution	 as device_screenResolution	, 
geoNetwork.continent	as geoNetwork_continent	,
geoNetwork.subContinent	as geoNetwork_subContinent	,
geoNetwork.country	as geoNetwork_country	,
geoNetwork.region	as geoNetwork_region	,
geoNetwork.metro	as geoNetwork_metro	,
geoNetwork.city	as geoNetwork_city	,
geoNetwork.cityId	as geoNetwork_cityId	,
geoNetwork.latitude	as geoNetwork_latitude	,
geoNetwork.longitude as geoNetwork_longitude	,
geoNetwork.networkDomain as geoNetwork_networkDomain,
geoNetwork.networkLocation as geoNetwork_networkLocation
FROM
  `myProject.MyDataSet.ga_sessions_*` t, t.hits h
  

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

Leave a Reply