I took BigQuery ML for a test drive. Below are the details of a basic POC with k-means clustering. The idea was to divide the users of Investing.com into a number of groups and see how different groups of users behave in terms of basic web traffic KPI’s.
If you want to learn more about k-means clustering, check out our blog about Machine Learning Essentials
BigQuery ML k-means clustering- a step-by-step example:
Naturally all code snippets below are committed to our Big Data Demystified GitHub
The basic BQ ML k-means training
CREATE OR REPLACE MODEL
BQ_ML.users_clusters OPTIONS (model_type='kmeans',
num_clusters=4,
standardize_features = TRUE) AS (
SELECT
SUM(PVs)AS pvs,
SUM(sessions) AS sessions,
COUNT(*) AS visits
FROM
`MyProject.DATA.Daily_Stats`
WHERE
date < '2019-12-01'
AND date > '2019-11-01'
GROUP BY
users )
After the Kmeans model is created , prediction is used as follows:
SELECT
CENTROID_ID,
pvs,
sessions,
visits
FROM
ML.PREDICT( MODEL BQ_ML.users_clusters,
(
SELECT
SUM(PVs)AS pvs,
SUM(sessions) AS sessions,
COUNT(*) AS visits
FROM
`MyProject.DATA.Daily_Stats`
WHERE
date < '2019-11-01'
AND date > '2019-10-01'
GROUP BY
users
LIMIT
100000000 ) )
Analize results:
SELECT CENTROID_ID, #min(pvs) as min_pvs, CAST (AVG(pvs)AS int64) AS avg_pvs, #max(pvs) as max_pvs, #min(sessions) as min_sessions, CAST(AVG(sessions) AS int64) AS avg_sessions, #max(sessions) as max_sessions CAST(AVG(visits) AS int64) AS avg_visits FROM BQ_ML.test3 GROUP BY CENTROID_ID
Another way to analyze results (based on GCP documentations):
WITH
T AS (
SELECT
centroid_id,
ARRAY_AGG(STRUCT(feature AS name,
ROUND(numerical_value,1) AS value)
ORDER BY
centroid_id) AS cluster
FROM
ML.CENTROIDS(MODEL BQ_ML.users_clusters)
GROUP BY
centroid_id )
SELECT
CONCAT('Cluster#', CAST(centroid_id AS STRING)) AS centroid,
(
SELECT
value
FROM
UNNEST(cluster)
WHERE
name = 'pvs') AS pvs,
(
SELECT
value
FROM
UNNEST(cluster)
WHERE
name = 'sessions') AS sessions,
(
SELECT
value
FROM
UNNEST(cluster)
WHERE
name = 'visits') AS visits
FROM
T
ORDER BY
centroid
Review results (notice the results are not real, just for example)
SELECT CENTROID_ID, pvs, sessions, visits FROM BQ_ML.test3 LIMIT 10

By The way, A very important note about BQ ML costs:
When creating a model, if you are only using BigQuery Console, Google will only display the amount of data scanned. In my case it was 53.5GB, which sounds cheap. However, what Google doesn’t display outright is how many times it reads the data – which is several times over….

If you you use superQuery the cost will be shown clearly, and will probably surprise you:

summary
- BQ ML is a very nice way to quickly get insights from data without the use of a data scientist. The recommended use case for BQ ML – benchmarking against the model written by your in-house data scientist.
- Use BQ ML only for exploration, not for production (recommended by Google as well).
- You should be worried about the costs of using BQ ML – superQuery is a nice way to get visibility into the costs of BQ ML

——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:
