GCP Cost Reduction in a nutshell

Author: Ilan Rosen

In this article we will cover key factors in reducing costs over BigQuery from writing queries properly and building data architecture to support one goal – reducing costs.

Queries in BQ Cost Money

Pricing in BQ is in a pay as you go model : $5 per 1TB scanned when executing your query

Rule of thumb – in most of the cases, if your query costs more than $1 you apparently doing something wrong.

Be in control of your Cost – This tool can help you dramatically with that :

BigQuery Mate – (free chrome extension)

Do’s and Don'ts

Try to avoid using “Select *”,
This way you avoid paying for data scan of columns that you don’t really need in your query result.
Using the “Limit” clause on a query will not affect the amount of data scanned.
Use the “Table Preview” option in order to see few lines of the table (instead of using limit) – This is Free
There is also a more extreme protection level under the Query settings – Advanced options section : “Maximum bytes billed”
Using this will limit the bytes billed for this query. If this query will have bytes billed beyond this limit, the query will be failed (without incurring a charge).
use the default table expiration time to remove the data when it’s no longer needed. – This will save Storage Costs.
Use streaming inserts only if your data must be immediately available to query from. In other words – Don’t use streaming inserts unless you have to ! It costs money.
Quotas – You can define a hard limit of data scan of a daily level or even a project level – but not on a user level. That’s a good practice to set a daily budget that will keep you in the safe zone in terms of cost.

Performance (& Cost) Key Factors

Partition your tables – the significantly big ones !
This will improve both query performance and cost.
There are several ways to partition tables – the most common and most of the time also most cost and performance effective is the time-partitioned tables method.
Denormalize when possible. – Bigquery by its nature is designed for one huge table. Implementing a normalized table relations will result with many joins and overhead in terms of performance and cost.
Use “Order By” only in the outermost query or within window clauses (analytic functions). Push complex operations to the end of the query.
Avoid self-joins. Use a window function instead.
Same thing applies on Cross-Joins

Cost reducing Data Architecture Guidelines

As mentioned at the top of the article – The key principle that guides us when designing a cost effective architecture is to reduce the amount of data scanned by the end user.
In most cases the end user might be the analyst or a business user using a BI platform.
We want to reduce the amount of data scanned both in terms of cost and lets not forget performance.
You don’t want your CEO to wait for a dashboard more than a few seconds until it’s presented with fresh data.
The way of reducing the data is by a simple layering method that starts with the raw data – ingestion layer and ends with a very small portion of it designed and grouped specifically for the BI tool of your choice – The Presentation layer
In between those layers you should do all the cleansing / transforming /joining /aggregating in order to support the top layer to be optimized for fast and cheap queries for your BI visualizations.
We strongly recommend decoupling your compute and storage.
A common solution to make sure of that is by using Airflow to orchestrate all of your data pipelines.
The data operations on each level can be executed by a tool of your choice. In order to simplify the process, a good practice is to start with implementing it solely with internal views.
This is how it looks like eventually :

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn: