Before we begin , be sure to be familiar with the Google analytics schema. Notice the different Record types, especially the hits record.
Option one: sub query
The simplest solution would be using sub query for each custom dimension, processing 3.18 GB. The problem is that if you you a view on top of it, the cost does not behave as you would expect. so be careful, and test everything.
select
fullVisitorId ,
(select value from h.customDimensions where index = 1) as UID,
(select distinct value from h.customDimensions where index = 36) as App_Edition
FROM
1234.ga_sessions_20180501 t, t.hits h
For example
Option 2: User Defined Function:
using used defined function with SQL. it involves baseis coding. This would be the slowest solution, but unlikely to hit a resources problem.
The query below processed 3.18 GB
CREATE TEMPORARY FUNCTION getCustomDimension(cd ARRAY>, index INT64)
RETURNS STRING
LANGUAGE js AS “””
for(var i = 0; i < cd.length; i++) {
var item = cd[i];
if(item.index == index) {
return item.value
}
}
return ”;
“””;
SELECT
gs_sessions.fullVisitorId,
getCustomDimension(hits.customDimensions, 1) AS customDimension1_uid,
getCustomDimension(hits.customDimensions, 36) AS customDimension36_edition
FROM 1234
.ga_sessions_20180501
gs_sessions,
UNNEST(gs_sessions.hits) hits
Option 3: With table as select
this option is the cheapest… 570 mb was processed, However. with a large number of custom dimensions , you might hit an error of “Error: Query exceeded resource limits. 86676.81298774751 CPU seconds were used,this query must use less than 55000.0 CPU seconds.” and there is nothing you can do about it . furthermore – the solution is not elegant in terms of maintenance, and it will slow down the more custom dimensions you have.
WITH CD AS (
SELECT
with_ga_sessions.fullVisitorId,
with_ga_sessions.visitId,
with_hits.hitNumber,
with_cd.index,
with_cd.value
FROM 1234
.ga_sessions_20180501
with_ga_sessions,
UNNEST(with_ga_sessions.hits) with_hits,
UNNEST(with_ga_sessions.customDimensions) with_cd
WHERE with_cd.index IN (1, 36)
)
SELECT
ga_sessions.visitId,
CD_1.value AS customDimension1_uid,
CD_36.value AS customDimension36_edition
FROM 1234
.ga_sessions_20180501
ga_sessions,
UNNEST(ga_sessions.hits) hits
LEFT JOIN CD CD_1 ON ga_sessions.fullVisitorId = CD_1.fullVisitorId
AND ga_sessions.visitId = CD_1.visitId
AND hits.hitNumber = CD_1.hitNumber
AND CD_1.index = 1
LEFT JOIN CD CD_36 ON ga_sessions.fullVisitorId = CD_36.fullVisitorId
AND ga_sessions.visitId = CD_36.visitId
AND hits.hitNumber = CD_36.hitNumber
AND CD_36.index =36
Option4: Flatten everything using AirFlow or any other orchestration tool.
This is the most expensive way, as each day we overwrite the entire history, as the custom dimension may change in some use case. if they don’t, then this approach may be be simple, cheap and fast. The idea is simple – convert each record to a bunch of columns and utilize the columnar power of bigquery. make sure the target flat table is partitioned by date. Notice that currently each table in BigQuery has a limitation of 2000 partition which means about 5.5 years. so the below is not good enough , consider sharded tables based on years. The storage here will be flattened, which means, anot more storage will be saved in BQ compared to GA_sessions storage only (about doble), but the benefit will be in compute time as the the table is columnar, and most queries require a short amount of columns.
CREATE table `myProject.myDataSet.GA_FLAT` ( date date, hits_type string, hits_page_pagePath string, hits_page_pagePathLevel1 string, hits_page_pagePathLevel2 string, hits_page_pagePathLevel3 string, hits_page_pagePathLevel4 string, hits_page_hostname string, hits_page_pageTitle string, hits_appInfo_appId string, hits_appInfo_appName string, hits_appInfo_screenName string, hits_eventInfo_eventCategory string, hits_eventInfo_eventAction string, hits_eventInfo_eventLabel string, hits_eventInfo_eventValue string, uid string, app_edition string, clientId string, fullVisitorId string, visitorId string, userId string, visitNumber string, visitId string, visitStartTime string, totals_bounces string, totals_hits string, totals_newVisits string, totals_pageviews string, totals_screenviews string, totals_sessionQualityDim string, totals_timeOnScreen string, totals_timeOnSite string, totals_totalTransactionRevenue string, totals_transactionRevenue string, totals_transactions string, totals_UniqueScreenViews string, totals_visits string, trafficSource_adContent string, rafficSource_adwordsClickInfo_adGroupId string, trafficSource_adwordsClickInfo_adNetworkType string, trafficSource_adwordsClickInfo_campaignId string, trafficSource_adwordsClickInfo_creativeId string, trafficSource_adwordsClickInfo_criteriaId string, trafficSource_adwordsClickInfo_criteriaParameters string, trafficSource_adwordsClickInfo_customerId string, trafficSource_adwordsClickInfo_gclId string, trafficSource_adwordsClickInfo_isVideoAd string, trafficSource_adwordsClickInfo_page string, trafficSource_adwordsClickInfo_slot string, trafficSource_adwordsClickInfo_targetingCriteria_boomUserlistId string, trafficSource_campaign string, trafficSource_campaignCode string, trafficSource_isTrueDirect string, trafficSource_keyword string, trafficSource_medium string, trafficSource_referralPath string, trafficSource_source string, socialEngagementType string, channelGrouping string, device_browser string, device_browserSize string, device_browserVersion string, device_deviceCategory string, device_mobileDeviceInfo string, device_mobileDeviceModel string, device_mobileInputSelector string, device_mobileDeviceMarketingName string, device_operatingSystem string, device_operatingSystemVersion string, device_isMobile string, device_mobileDeviceBranding string, device_flashVersion string, device_javaEnabled string, device_language string, device_screenColors string, device_screenResolution string, geoNetwork_continent string, geoNetwork_subContinent string, geoNetwork_country string, geoNetwork_region string, geoNetwork_metro string, geoNetwork_city string, geoNetwork_cityId string, geoNetwork_latitude string, geoNetwork_networkDomain string ) partition by date
SELECT PARSE_DATE('%Y%m%d', REPLACE(_TABLE_SUFFIX, "intraday_", "")) as date, type as hit_type, h.page.pagePath as hits_page_pagePath, h.page.pagePathLevel1 as hits_page_pagePathLevel1, h.page.pagePathLevel2 as hits_page_pagePathLevel2, h.page.pagePathLevel3 as hits_page_pagePathLevel3, h.page.pagePathLevel4 as hits_page_pagePathLevel4, h.page.hostname as hit_page_hostname, h.page.pageTitle as hits_page_pageTitle, h.appInfo.appId as hit_appInfo_appId, h.appInfo.appName as hit_appInfo_appName, h.appInfo.screenName as hit_appInfo_screenName, h.eventInfo.eventCategory as hits_eventInfo_eventCategory , h.eventInfo.eventAction as hits_eventInfo_eventAction , h.eventInfo.eventLabel as hits_eventInfo_eventLabel , h.eventInfo.eventValue as hits_eventInfo_eventValue , (select value from h.customDimensions where index = 1) as CD_1, (select distinct value from h.customDimensions where index = 35) as CD_35, clientId , fullVisitorId , visitorId , userId , visitNumber , visitId , visitStartTime , totals.bounces as totals_bounces, totals.hits as totals_hits, totals.newVisits as totals_newVisits , totals.pageviews as totals_pageviews, totals.screenviews as totals_screenviews, totals.sessionQualityDim as totals_sessionQualityDim, totals.timeOnScreen as totals_timeOnScreen, totals.timeOnSite as totals_timeOnSite, totals.totalTransactionRevenue as totals_totalTransactionRevenue, totals.transactionRevenue as totals_transactionRevenue, totals.transactions as totals_transactions, totals.UniqueScreenViews as totals_UniqueScreenViews, totals.visits as totals_visits, trafficSource.adContent as trafficSource_adContent , trafficSource. adwordsClickInfo.adGroupId as trafficSource_adwordsClickInfo_adGroupId , trafficSource. adwordsClickInfo.adNetworkType as trafficSource_adwordsClickInfo_adNetworkType , trafficSource. adwordsClickInfo.campaignId as trafficSource_adwordsClickInfo_campaignId , trafficSource. adwordsClickInfo.creativeId as trafficSource_adwordsClickInfo_creativeId , trafficSource. adwordsClickInfo.criteriaId as trafficSource_adwordsClickInfo_criteriaId , trafficSource. adwordsClickInfo.criteriaParameters as trafficSource_adwordsClickInfo_criteriaParameters, trafficSource. adwordsClickInfo.customerId as trafficSource_adwordsClickInfo_customerId, trafficSource. adwordsClickInfo.gclId as trafficSource_adwordsClickInfo_gclId, trafficSource. adwordsClickInfo.isVideoAd as trafficSource_adwordsClickInfo_isVideoAd, trafficSource. adwordsClickInfo.page as trafficSource_adwordsClickInfo_page , trafficSource. adwordsClickInfo.slot as trafficSource_adwordsClickInfo_slot , trafficSource. adwordsClickInfo.targetingCriteria.boomUserlistId as trafficSource_adwordsClickInfo_targetingCriteria_boomUserlistId, trafficSource.campaign as trafficSource_campaign, trafficSource.campaignCode as trafficSource_campaignCode , trafficSource.isTrueDirect as trafficSource_isTrueDirect , trafficSource.keyword as trafficSource_keyword, trafficSource.medium as trafficSource_medium, trafficSource.referralPath as trafficSource_referralPath , trafficSource.source as trafficSource_source, socialEngagementType , channelGrouping , device.browser as device_browser , device.browserSize as device_browserSize , device.browserVersion as device_browserVersion, device.deviceCategory as device_deviceCategory, device.mobileDeviceInfo as device_mobileDeviceInfo, device.mobileDeviceModel as device_mobileDeviceModel, device.mobileInputSelector as device_mobileInputSelector, device.mobileDeviceMarketingName as device_mobileDeviceMarketingName, device.operatingSystem as device_operatingSystem, device.operatingSystemVersion as device_operatingSystemVersion, device.isMobile as device_isMobile, device.mobileDeviceBranding as device_mobileDeviceBranding, device.flashVersion as device_flashVersion , device.javaEnabled as device_javaEnabled, device.language as device_language, device.screenColors as device_screenColors , device.screenResolution as device_screenResolution , geoNetwork.continent as geoNetwork_continent , geoNetwork.subContinent as geoNetwork_subContinent , geoNetwork.country as geoNetwork_country , geoNetwork.region as geoNetwork_region , geoNetwork.metro as geoNetwork_metro , geoNetwork.city as geoNetwork_city , geoNetwork.cityId as geoNetwork_cityId , geoNetwork.latitude as geoNetwork_latitude , geoNetwork.longitude as geoNetwork_longitude , geoNetwork.networkDomain as geoNetwork_networkDomain, geoNetwork.networkLocation as geoNetwork_networkLocation FROM `myProject.MyDataSet.ga_sessions_*` t, t.hits h
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me: