Blog

Big Query, BQ ML, Machine Learning

K Means Via BQ ML Demystified

I took BigQuery ML for a test drive. Below are the details of a basic POC with k-means clustering. The idea was to divide the users of Investing.com into a number of groups and see how different groups of users behave in terms of basic web traffic KPI’s.

If you want to learn more about k-means clustering, check out our blog about Machine Learning Essentials

BigQuery ML k-means clustering- a step-by-step example:

Naturally all code snippets below are committed to our Big Data Demystified GitHub

The basic BQ ML k-means training

CREATE OR REPLACE MODEL
  BQ_ML.users_clusters OPTIONS (model_type='kmeans',
    num_clusters=4,
    standardize_features = TRUE) AS (
  SELECT
    SUM(PVs)AS pvs,
    SUM(sessions) AS sessions,
    COUNT(*) AS visits
  FROM
    `MyProject.DATA.Daily_Stats`
  WHERE
    date < '2019-12-01'
    AND date > '2019-11-01'
  GROUP BY
    users )

After the Kmeans model is created , prediction is used as follows:

SELECT
  CENTROID_ID,
  pvs,
  sessions,
  visits
FROM
  ML.PREDICT( MODEL BQ_ML.users_clusters,
    (
    SELECT
      SUM(PVs)AS pvs,
      SUM(sessions) AS sessions,
      COUNT(*) AS visits
    FROM
      `MyProject.DATA.Daily_Stats`
    WHERE
      date < '2019-11-01'
      AND date > '2019-10-01'
    GROUP BY
      users
    LIMIT
      100000000 ) )

Analize results:

SELECT
  CENTROID_ID,
  #min(pvs) as min_pvs,
  CAST (AVG(pvs)AS int64) AS avg_pvs,
  #max(pvs) as max_pvs,
  #min(sessions) as min_sessions,
  CAST(AVG(sessions) AS int64) AS avg_sessions,
  #max(sessions) as max_sessions
  CAST(AVG(visits) AS int64) AS avg_visits
FROM
  BQ_ML.test3
GROUP BY
  CENTROID_ID

Another way to analyze results (based on GCP documentations):

WITH
  T AS (
  SELECT
    centroid_id,
    ARRAY_AGG(STRUCT(feature AS name,
        ROUND(numerical_value,1) AS value)
    ORDER BY
      centroid_id) AS cluster
  FROM
    ML.CENTROIDS(MODEL BQ_ML.users_clusters)
  GROUP BY
    centroid_id )
SELECT
  CONCAT('Cluster#', CAST(centroid_id AS STRING)) AS centroid,
  (
  SELECT
    value
  FROM
    UNNEST(cluster)
  WHERE
    name = 'pvs') AS pvs,
  (
  SELECT
    value
  FROM
    UNNEST(cluster)
  WHERE
    name = 'sessions') AS sessions,
  (
  SELECT
    value
  FROM
    UNNEST(cluster)
  WHERE
    name = 'visits') AS visits
FROM
  T
ORDER BY
  centroid

Review results (notice the results are not real, just for example)

SELECT
  CENTROID_ID,
  pvs,
  sessions,
  visits
FROM
  BQ_ML.test3
LIMIT
  10

By The way, A very important note about BQ ML costs:

When creating a model, if you are only using BigQuery Console, Google will only display the amount of data scanned. In my case it was 53.5GB, which sounds cheap. However, what Google doesn’t display outright is how many times it reads the data – which is several times over….

If you you use superQuery the cost will be shown clearly, and will probably surprise you:

summary

  1. BQ ML is a very nice way to quickly get insights from data without the use of a data scientist. The recommended use case for BQ ML – benchmarking against the model written by your in-house data scientist.
  2. Use BQ ML only for exploration, not for production (recommended by Google as well).
  3. You should be worried about the costs of using BQ ML – superQuery is a nice way to get visibility into the costs of BQ ML
sq-logotype-dark-transparent.png

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

NoSQL

Couchbase Data Platform | Big Data Demystified

Couchbase data platform


Couchbase is a popular open source NoSQL platform used by giants like Apple, LinkedIn, Walmart, Visa and many others and runs on-premise or in a public/hybrid/multi cloud.
Couchbase has a sub-millisecond K/V cache integrated with a document based DB, a unique and many more services and features.
In this session we will talk about the unique architecture of Couchbase, its unique N1QL language – a SQL-Like language that is ANSI compliant, the services and features Couchbase offers and demonstrate some of them live.
We will also discuss what makes Couchbase different than other popular NoSQL platforms like MongoDB, Cassandra, Redis, DynamoDB etc.
At the end we will talk about the next version of Couchbase (6.5) that will be released later this year and about Couchbase 7.0 that will be released next year.

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

airflow

Airflow Demystified | Big Data Demystified

In this lecture I will share with you how we use Airflow in investing.com

Some more blogs about Airflow

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

avro, Big Query

AVRO and BQ example

Creating the schema from an AVRO file could be done using a python operator [1].

It will be quite similar to the process that you are following on the step 6 of the blog attached [2], but instead of specifying the avro.schema.url we will specify the avro.schema.literal.

First, we have to extract the avro schema from the avro file located in GCS. I have made the python code [3], that will output the json of the avro schema.

Now that we have the schema we should create the Hive Job in order to create the table, I have made an small test with the API [4], the REST request should look similar to [4].

────────────────────

[1]: https://cloud.google.com/composer/docs/how-to/using/writing-dags#pythonoperator

[2]: https://big-data-demystified.ninja/2018/05/27/how-to-export-data-from-google-big-query-into-aws-s3-emr-hive/

[3]:

from google.cloud import storage
from avro.datafile import DataFileReader
from avro.io import DatumReader
import sys
import json


bucket_name = "<Bucket Name>"
blob_name = "<Bucket Path>/<AVRO FILE NAME>"
downloaded_blob = "downloaded_blob.avro"

client = storage.Client()
bucket = client.lookup_bucket(bucket_name)
blob = bucket.blob(blob_name)
blob.download_to_filename(downloaded_blob)

with DataFileReader(open(downloaded_blob, 'rb'), DatumReader()) as avf:
schema = avf.meta['avro.schema'].decode("utf-8")
print(schema) # example: {"type":"record","name":"Root","fields":[{"name":"string_field_0","type":["null","string"]}]}


[4]: https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/projects.regions.jobs/submit

[5]:

{
"job": {
"hiveJob": {
"queryList": {
"queries": [
"CREATE EXTERNAL TABLE transactions \nSTORED AS AVRO LOCATION 'gs://<BUCKET NAME>/<PATH TO AVRO FOLDER>/*' TBLPROPERTIES ('avro.schema.literal'='<OUTPUT OF THE PYTHON SCRIPT>')"
]
}
},
"placement": {
"clusterName": "<CLUSTER NAME>"
},
"reference": {
"jobId": "<JOB NAME>"
}
}
}

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS EMR, Hive

Cherry pick source files in Hive external table example

Cool way to filter files on your bucket for an external table on hive !

CREATE EXTERNAL TABLE mytable1 ( a string, b string, c string )
STORED AS TEXTFILE
LOCATION 's3://my.bucket/' 'folder/2009.*\.bz2$';

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/