Big Query, cost reduction

BigQuery Cheat Sheet

How much data in a specific dataset in GCP BigQuery?

select sum(size_bytes)/(1024*1024*1024) as size_GB
from DATA_LAKE_TRANSFORMATION.__TABLES__ 

how much data in BQ project?

SELECT
  dataset_id,
  count(*) AS tables,
  SUM(row_count) AS total_rows,
  SUM(size_bytes) AS size_bytes
FROM ( 
  SELECT * FROM `dataset1.__TABLES__` UNION ALL
  SELECT * FROM `dataset2.__TABLES__` UNION ALL
)
GROUP BY 1
ORDER BY size_bytes DESC

Parsing Google Analytics custom dimensions

SELECT
(SELECT distinct value FROM h.customDimensions where index=4) as cd_4
FROM
  `MyProject.MyDataSEt.ga_sessions_*` t, t.hits h

Notice , that here you may want to notice the costs. Relevant article to different approaches to parse google analytics .

Uncategorised

GCP Cost Reduction in a nutshell

GCP Cost Reduction in a nutshell

Author: Ilan Rosen

In this article we will cover key factors in reducing costs over BigQuery from writing queries properly and building data architecture to support one goal – reducing costs.

Queries in BQ Cost Money

Pricing in BQ is in a pay as you go model : $5 per 1TB scanned when executing your query

Rule of thumb – in most of the cases, if your query costs more than $1 you apparently doing something wrong.

Be in control of your Cost – This tool can help you dramatically with that :

Do’s and Don'ts

  1. Try to avoid using “Select *”,  
    This way you avoid paying for data scan of columns that you don’t really need in your query result.
  2. Using the “Limit” clause on a query will not affect the amount of data scanned.
  3. Use the “Table Preview” option in order to see few lines of the table (instead of using  limit) – This is Free
  4. There is also a more extreme protection level under the Query settings – Advanced options section :  “Maximum bytes billed”
    Using this will limit the bytes billed for this query. If this query will have bytes billed beyond this limit, the query will be failed (without incurring a charge).
  5.  use the default table expiration time to remove the data when it’s no longer needed. – This will save Storage Costs.
  6. Use streaming inserts only if your data must be immediately available to query from. In other words – Don’t use streaming inserts unless you have to ! It costs money.
  7. Quotas – You can define a hard limit of data scan of a daily level or even a project level – but not on a user level. That’s a good practice to set a daily budget that will keep you in the safe zone in terms of cost.

Performance (& Cost) Key Factors

  1. Partition your tables – the significantly big ones !
    This will improve both query performance and cost.
  2. There are several ways to partition tables the most common and most of the time also  most cost and performance effective is the  time-partitioned tables method.
  3. Denormalize when possible. – Bigquery by its nature is designed for one huge table. Implementing a normalized table relations will result with many joins and overhead in terms of performance and cost.
  4. Use “Order By” only in the outermost query or within window clauses (analytic functions). Push complex operations to the end of the query.
  5. Avoid self-joins. Use a window function instead.
  6. Same thing applies on Cross-Joins

Cost reducing Data Architecture Guidelines

  1. As mentioned at the top of the article – The key principle that guides us when designing a cost effective architecture is to reduce the amount of data scanned by the end user.
  2. In most cases the end user might be the analyst or a business user using a BI platform.
  3. We want to reduce the amount of data scanned both in terms of cost and lets not forget performance.
    You don’t want your CEO to wait for a dashboard more than a few seconds until it’s presented with fresh data.
  4.   The way of reducing the data is by a simple layering method that starts with the raw data – ingestion layer and ends with a very small portion of it designed and grouped specifically for the BI tool of your choice – The Presentation layer
  5. In between those layers you should do all the cleansing / transforming /joining /aggregating in order to support the top layer to be optimized for fast and cheap queries for your BI visualizations.
  6. We strongly recommend decoupling your compute and storage.
    A common solution to make sure  of that is by using Airflow to orchestrate all of your data pipelines.
  7. The data operations on each level can be executed by a tool of your choice. In order to simplify the process, a good practice is to start with implementing it solely with internal views.
  8. This is how it looks like eventually :

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

AWS S3

How to resolve s3fuse writing files as root ?

How to resolve s3fuse writing files as root ?

When you install s3fuse to mount AWS S3 bucket as local folder in your linux machine,  you may encounter a scenario – where file are uploaded to the S3 bucket and the ownership of your files on your folder are marked as owned by root user instead of the normal ubuntu user.

The problem is that s3fuse has mounted the file as root user and no other user was configured. Another source of problem would be an old version.

The solution to s3fuse permissions problems:

  1. The first step is to change the config file of fuse and unmark “user_allow_other
sudo nano /etc/fuse.conf

2. Find out the uid and gid  of your ubuntu linux user:

id -u ubuntu
id -g ubuntu

3. Umount your partition if it is already mounted  and re-mount the partition with -o allow_other, and UID, GID from section 2 :

s3fs myBucket:/ /home/ubuntu/airflow/dags -o allow_other -o use_rrs -o allow_other -o use_cache=/tmp -o uid=1000 -o gid=1000

4. Test via uploading a file to s3 bucket and see the permission on the same file one the linux OS level:

ls -la

Sources:

https://kb.iu.edu/d/adwf

https://www.howtoinstall.me/ubuntu/18-04/s3fs/

https://stackoverflow.com/questions/9764401/change-user-ownership-of-s3fs-mounted-buckets

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

Uncategorised

Data Engineering Use Cases Demystified | Online meetup

Data Engineering Use Cases Demystified | Online meetup

What will you learn :

– we are going to cover the basics of creating an architecture, and then we will build our own architecture LIVE on a made up use cases.

– we are going to deal with the following questions:
Suggest a Data lake architecture – how would you do it?
Batch / Streaming?
What are the Orchestration challenges?
Which Technology stack to implement the above?
Are the existing data sources good enough to get started?
Is  your use case ROI positive?

Meetup video

Meetup slides:


——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

Register our meetup:

Subscribe our Youtube channel:

Power BI

How to Connect Power BI and BigQuery Billing?

How to Connect Power BI and BigQuery Billing?

Working with BigQuery as my data source is great but when connecting it to PBI it can be a bit discouraging sometimes, you have to prepare “views” in BigQuery and you have to make sure the data is set EXACTLY how you want it, otherwise you’ll have to do some fancy data-juggling in PBI. 

Recently I’ve came across a new problem, where my PBI dashboard stopped working, just like that out of the blue I received the following massage:

Data source error: {"error":{"code":"ModelRefresh_ShortMessage_ProcessingError","pbi.error":{ "code":"ModelRefresh_ShortMessage_ProcessingError","parameters":{},"det ails":[{"code":"Message","detail":{"type":1,"value":"ODBC: ERROR [HY000] [Microsoft][BigQuery] (100) Error interacting with REST API: Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/troubleshooting- errors"}}],"exceptionCulprit":1}}} Table: myTable. 

This is weird, what do you mean by :”Your project exceeded quota for free query bytes scanned”, My BigQuery project has a billing account, and I pay every cent, since when anything is “free” with these things?

A short investigation led me to a PBI documentation on BigQuery/BowerBI connection. Here I discovered that PowerBI “By default, Power BI uses the first project from the list returned for the user” which means my dashboard was connected to the BigQuery public (and yes, free) data sets. Great, now how to fix it? This matter is things get weird, since all they give you is this encrypted line: 

“Specifying the following option in the underlying M in the Source step, which can be customized by using Power Query Editor in Power BI Desktop:" 

Source = GoogleBigQuery.Database([BillingProject="Include- Billing-Project-Id-Here"]) 

What is “underlying M”, and where exactly do I need to add this line?!?! After further investigation I came across this article which finally shed some light on the matter. The solution is as follow: Under “Transform Data” you need to open the “Advanced Editor” under the “View” tab. 

You need to change the (null) value as the PBI documentation explains to
[BillingProject=”Include-Billing-Project-Id-Here”]
WAIT, there’s more – you need to do this for each table you have (fun) and manually set the
billing project on all the tables that are connected to BigQuery

————————————————————————————

Author:   Or Blan, Director of Data, Ynet.