Big Query, cost reduction

BigQuery Cheat Sheet

How much data in a specific dataset in GCP BigQuery?

select sum(size_bytes)/(1024*1024*1024) as size_GB
from DATA_LAKE_TRANSFORMATION.__TABLES__ 

how much data in BQ project?

SELECT
  dataset_id,
  count(*) AS tables,
  SUM(row_count) AS total_rows,
  SUM(size_bytes) AS size_bytes
FROM ( 
  SELECT * FROM `dataset1.__TABLES__` UNION ALL
  SELECT * FROM `dataset2.__TABLES__` UNION ALL
)
GROUP BY 1
ORDER BY size_bytes DESC

Parsing Google Analytics custom dimensions

SELECT
(SELECT distinct value FROM h.customDimensions where index=4) as cd_4
FROM
  `MyProject.MyDataSEt.ga_sessions_*` t, t.hits h

Notice , that here you may want to notice the costs. Relevant article to different approaches to parse google analytics .

Big Query

BigQuery CLI Cheat Sheet

This blog will be update from time to time. this is a "cut the bullshit give me what i need blog"

load CSV to BigQuery , skip header, allow quoted new lines, truncate before loading example
bq --location US load --source_format CSV --replace=true --skip_leading_rows 1 --allow_quoted_newlines --quote "" MyDataset.myTable gs://myBucket/*

Load to BigQuery with schema definition example

bq --location=us load --replace=true --autodetect --skip_leading_rows 1 --source_format=CSV myProject:DATA_LAKE.tablr gs://bucket/* col3:STRING,col2:STRING,col1:STRING	

Load JSON to BigQuery example

bq --location=us load --replace=true --autodetect --source_format=NEWLINE_DELIMITED_JSON myProject:DATA_LAKE.tablr gs://bucket/* 				


——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

Big Query

How to debug “Error while reading data, error message: Failed to parse JSON: Unexpected end of string; Unexpected end of string; Expected key” ?

I read this post, and there is a suggested solution, this should explain the error:

BigQuery only accepts new-line delimited JSON, which means one complete JSON object per line

and read this article about new line JSON: What is JSON new line format?

regular JSON:

[
{"key":"value1"},
{"key":"value2"},
]

new line JSON format :

{"key":"value1"}
{"key":"value2"}

If using the above post to is not enough to load to BigQuery, using JQ is not enough. I suggest some debugging options:

  1. Ensure JSON is valid , search on google or click on the url: “json online viewer
  2. you may have an invalid JSON key. remove keys until you get it right – on load to big query
  3. use regex : remove all spaces/new lines/special characters until you get some like “{key=value}” with reductant spaces
sed 's/ //g' example.json > example_output.json

or

cat example.json | tr '\n' ' ' > example_output.json

After you apply the above transformation, try to load the JSON to BQ and will either succeed or find the next problem in the JSON.

common json parsing error for big query in my case it was that

  1. in bigQuery data type nested arrays (array of arrays) is not supported, although it is a valid JSON.

2. that json {} contained inside each line/value a newline , simply replace it with space.


——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

architecture, Big Query

When to use BigQuery? When not to use BigQuery?

When to use BigQuery?

  1. 80% of all data use case would be a good reason to use BigQuery.
  2. If you are not sure – start with BigQuery
  3. If you are using google analytics or firebase
  4. When you have a director connector that load the data for you to BigQuery
  5. When your production is in GCP

When not to use BigQuery [Notice the below are the other 20% or less – Extreme cases which are not common]:

  1. When are processing more than one 5GB compressed file per hour (load limit of GCP BigQuery per file). Try opening a compressed file of 6GB and splitted into smaller files… your one hour window of time will be shining as the file uncompressed into something huge of ±20 GB ….
  2. When you are hitting an error in BigQuery that says your query is consuming too much resources, ona weekly basis, and there nothing you can do about that.
  3. When you need to self Join Billion of records on a regular basis.
  4. When you you are using complex Window functions. You are likely to get an error of too many resource are being user for your Query and there nothing you can do except rewriting your query.

So what are the alternative to BigQuery:

  1. Hadoop ecosystem: Data Proc / Cloudera
  2. SQream DB , a database designed to handle huge files, massive joins at surprising amount of speed, simplicity and cost effectiveness.

Big Query

BigQuery Error : UPDATE or DELETE statement over table would affect rows in the streaming buffer, which is not supported

in case you get this error, you are most likely tryto detelte data which falls in range of windo of time the streaming insert is using.

Streaming insert is a services that allow ingestion of events in real time to big query.

something it take upto 90 mins to ingest the data. thus…. try to delete data from further back in time for example

DELETE FROM `MyProject.MyDataset.MyTable` 
WHERE id LIKE '%BigDataDemystified%'
AND time < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 180 MINUTE)