I took BigQuery ML for a test drive. Below are the details of a basic POC with k-means clustering. The idea was to divide the users of Investing.com into a number of groups and see how different groups of users behave in terms of basic web traffic KPI’s.
If you want to learn more about k-means clustering, check out our blog about Machine Learning Essentials
BigQuery ML k-means clustering- a step-by-step example:
Naturally all code snippets below are committed to our Big Data Demystified GitHub
The basic BQ ML k-means training
CREATE OR REPLACE MODEL BQ_ML.users_clusters OPTIONS (model_type='kmeans', num_clusters=4, standardize_features = TRUE) AS ( SELECT SUM(PVs)AS pvs, SUM(sessions) AS sessions, COUNT(*) AS visits FROM `MyProject.DATA.Daily_Stats` WHERE date < '2019-12-01' AND date > '2019-11-01' GROUP BY users )
After the Kmeans model is created , prediction is used as follows:
SELECT CENTROID_ID, pvs, sessions, visits FROM ML.PREDICT( MODEL BQ_ML.users_clusters, ( SELECT SUM(PVs)AS pvs, SUM(sessions) AS sessions, COUNT(*) AS visits FROM `MyProject.DATA.Daily_Stats` WHERE date < '2019-11-01' AND date > '2019-10-01' GROUP BY users LIMIT 100000000 ) )
SELECT CENTROID_ID, #min(pvs) as min_pvs, CAST (AVG(pvs)AS int64) AS avg_pvs, #max(pvs) as max_pvs, #min(sessions) as min_sessions, CAST(AVG(sessions) AS int64) AS avg_sessions, #max(sessions) as max_sessions CAST(AVG(visits) AS int64) AS avg_visits FROM BQ_ML.test3 GROUP BY CENTROID_ID
Another way to analyze results (based on GCP documentations):
WITH T AS ( SELECT centroid_id, ARRAY_AGG(STRUCT(feature AS name, ROUND(numerical_value,1) AS value) ORDER BY centroid_id) AS cluster FROM ML.CENTROIDS(MODEL BQ_ML.users_clusters) GROUP BY centroid_id ) SELECT CONCAT('Cluster#', CAST(centroid_id AS STRING)) AS centroid, ( SELECT value FROM UNNEST(cluster) WHERE name = 'pvs') AS pvs, ( SELECT value FROM UNNEST(cluster) WHERE name = 'sessions') AS sessions, ( SELECT value FROM UNNEST(cluster) WHERE name = 'visits') AS visits FROM T ORDER BY centroid
Review results (notice the results are not real, just for example)
SELECT CENTROID_ID, pvs, sessions, visits FROM BQ_ML.test3 LIMIT 10
By The way, A very important note about BQ ML costs:
When creating a model, if you are only using BigQuery Console, Google will only display the amount of data scanned. In my case it was 53.5GB, which sounds cheap. However, what Google doesn’t display outright is how many times it reads the data – which is several times over….
If you you use superQuery the cost will be shown clearly, and will probably surprise you:
- BQ ML is a very nice way to quickly get insights from data without the use of a data scientist. The recommended use case for BQ ML – benchmarking against the model written by your in-house data scientist.
- Use BQ ML only for exploration, not for production (recommended by Google as well).
- You should be worried about the costs of using BQ ML – superQuery is a nice way to get visibility into the costs of BQ ML
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me: