K Means Via BQ ML Demystified | Big Data Demystified

I took BigQuery ML for a test drive. Below are the details of a basic POC with k-means clustering. The idea was to divide the users of Investing.com into a number of groups and see how different groups of users behave in terms of basic web traffic KPI’s.

If you want to learn more about k-means clustering, check out our blog about Machine Learning Essentials

BigQuery ML k-means clustering- a step-by-step example:

Naturally all code snippets below are committed to our Big Data Demystified GitHub

The basic BQ ML k-means training

CREATE OR REPLACE MODEL
  BQ_ML.users_clusters OPTIONS (model_type='kmeans',
    num_clusters=4,
    standardize_features = TRUE) AS (
  SELECT
    SUM(PVs)AS pvs,
    SUM(sessions) AS sessions,
    COUNT(*) AS visits
  FROM
    `MyProject.DATA.Daily_Stats`
  WHERE
    date < '2019-12-01'
    AND date > '2019-11-01'
  GROUP BY
    users )

After the Kmeans model is created , prediction is used as follows:

SELECT
  CENTROID_ID,
  pvs,
  sessions,
  visits
FROM
  ML.PREDICT( MODEL BQ_ML.users_clusters,
    (
    SELECT
      SUM(PVs)AS pvs,
      SUM(sessions) AS sessions,
      COUNT(*) AS visits
    FROM
      `MyProject.DATA.Daily_Stats`
    WHERE
      date < '2019-11-01'
      AND date > '2019-10-01'
    GROUP BY
      users
    LIMIT
      100000000 ) )

Analize results:

SELECT
  CENTROID_ID,
  #min(pvs) as min_pvs,
  CAST (AVG(pvs)AS int64) AS avg_pvs,
  #max(pvs) as max_pvs,
  #min(sessions) as min_sessions,
  CAST(AVG(sessions) AS int64) AS avg_sessions,
  #max(sessions) as max_sessions
  CAST(AVG(visits) AS int64) AS avg_visits
FROM
  BQ_ML.test3
GROUP BY
  CENTROID_ID

Another way to analyze results (based on GCP documentations):

WITH
  T AS (
  SELECT
    centroid_id,
    ARRAY_AGG(STRUCT(feature AS name,
        ROUND(numerical_value,1) AS value)
    ORDER BY
      centroid_id) AS cluster
  FROM
    ML.CENTROIDS(MODEL BQ_ML.users_clusters)
  GROUP BY
    centroid_id )
SELECT
  CONCAT('Cluster#', CAST(centroid_id AS STRING)) AS centroid,
  (
  SELECT
    value
  FROM
    UNNEST(cluster)
  WHERE
    name = 'pvs') AS pvs,
  (
  SELECT
    value
  FROM
    UNNEST(cluster)
  WHERE
    name = 'sessions') AS sessions,
  (
  SELECT
    value
  FROM
    UNNEST(cluster)
  WHERE
    name = 'visits') AS visits
FROM
  T
ORDER BY
  centroid

Review results (notice the results are not real, just for example)

SELECT
  CENTROID_ID,
  pvs,
  sessions,
  visits
FROM
  BQ_ML.test3
LIMIT
  10

By The way, A very important note about BQ ML costs:

When creating a model, if you are only using BigQuery Console, Google will only display the amount of data scanned. In my case it was 53.5GB, which sounds cheap. However, what Google doesn’t display outright is how many times it reads the data – which is several times over….

If you you use superQuery the cost will be shown clearly, and will probably surprise you:

summary

BQ ML is a very nice way to quickly get insights from data without the use of a data scientist. The recommended use case for BQ ML – benchmarking against the model written by your in-house data scientist.
Use BQ ML only for exploration, not for production (recommended by Google as well).
You should be worried about the costs of using BQ ML – superQuery is a nice way to get visibility into the costs of BQ ML

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

Leave a ReplyCancel reply

Discover more from Big Data Demystified