Blog

architecture, Big Query, cost reduction, GCP Big Data Demystified, superQuery

80% Cost Reduction in Google Cloud BigQuery

The second in series of lectures GCP Big Data Demystified. In this lecture I will share with how I saved 80% of BigQuery monthly billing of investing.com. Lectures slides:

Videos from the meetup:

Link to previous lecture GCP Big Data Demystified #1

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS Redshift

Redshift Spectrum

Redshift spectrum is a feature that allows you to query tables from s3 storage using Redshift cluster.

To start use Redshift spectrum we need first to create Redshift schema (usually refer as database)

First we will create IAM role to allow us to connect our Redshift to Athena data catalog

Step 1: create a schema(database)

create external schema spectrum

from data catalog

database ‘spectrumdb

region ‘us-east-1’ 

iam_role ‘arn:aws:iam::986725234098:role/Spectrum_test’

create external database if not exists;

Be aware that the schema name you made (spectrum) is how you query in the future the database

Now we can create table in the schema(database)

Step 2: create table

create external table spectrum.workers(

Name varchar,

Sex varchar,

Age integer,

Height integer,

Weight integer,

DOB date

)

row format delimited

fields terminated by ‘,’

stored as textfile

location ‘s3://folder/folder/’

table properties (‘skip.header.line.count’=’1’);

Here we create table name workers in schema(database) spectrum

In the parenthesis we define the columns of the table and their data type (can see all data type in the link below)

Fields terminated by what is the delimiter of the data. (example csv is comma)

Stored as textfile/parquet (you can read more about parquet here)

location specified where the data is located

table properties here there are many option to do like:define the compression, skip headers, row count to view all the properties go to the link below

https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html#r_CREATE_EXTERNAL_TABLE-parameters

Step 3: query the table

After we understand the concept we can query the table just like we query a normal table

Select * from spectrum.workers

A nice video that explain how to use Redshift spectrum from AWS in Hebrew

airflow

Airflow Performance and Best Practices Demystified

Airflow Performance and Best Practices Demystified

Lecturer: Omid Vahdaty 4.5.2021

Airflow architecture, performance tuning for an unstable cluster, cost implications and the varied configuration options available to resolve weird airflow issues, how to use Cloudwatch to monitor Airflow performance.

Video

Slides


——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn – Omid Vahdaty:

AWS Redshift

Data Engineering Course-Redshift Demystified

Data Engineering Course | Redshift Demystified

Author: Omid Vahdaty 2.5.2021


——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

Big Query

bigquery bq load error- “cannot determine table described”

BigQuery bq load error- "cannot determine table described"

Author: Omid Vahdaty 21.4.2021

If you are are getting this error , it is an authentication and authorization issue, simply log out and log in again. e.g if you are using cloud shell – close it and reopen.

Commands like the below will describe your project and data set – but still wont sent the cmd to BQ API:

bq show mydataset.my_test
bq show mydataset 

You can also try adding explicit project id as follows:

projectid:myset.my_test


——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

airflow

Airflow Exception: “raise InvalidToken cryptography.fernet.InvalidToken”

Airflow Exception: "raise InvalidToken cryptography.fernet.InvalidToken"

Author: Omid Vahdaty 5.4.2021

If you get this error of invalid token, it is because Airflow is using Fernet. Airflow encrypt all the passwords for its connections in the backend database.

Somehow Airflow backend is using previous fernet key and you have generated a key  which you have created in a new connection.

My recommendation is to do the following first:

This will help in deleting all the existing records in your backend db. NOTICE – this will delete all your Airflow connections and variables you inputted manually:

airflow resetdb
airflow initdb

This will initialize backend db like a fresh install db. Airflow may shout about missing variables.
Start Airflow and enter missing variables  one by one.

Then start airflow web server and scheduler.


——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn: