Redshift Spectrum

11th May 202111th May 2021 arielyos

Redshift spectrum is a feature that allows you to query tables from s3 storage using Redshift cluster.

To start use Redshift spectrum we need first to create Redshift schema (usually refer as database)

First we will create IAM role to allow us to connect our Redshift to Athena data catalog

Step 1: create a schema(database)

create external schema spectrum

from data catalog

database ‘spectrumdb‘

region ‘us-east-1’

iam_role ‘arn:aws:iam::986725234098:role/Spectrum_test’

create external database if not exists;

Be aware that the schema name you made (spectrum) is how you query in the future the database

Now we can create table in the schema(database)

Step 2: create table

create external table spectrum.workers(

Name varchar,

Sex varchar,

Age integer,

Height integer,

Weight integer,

DOB date

)

row format delimited

fields terminated by ‘,’

stored as textfile

location ‘s3://folder/folder/’

table properties (‘skip.header.line.count’=’1’);

Here we create table name workers in schema(database) spectrum

In the parenthesis we define the columns of the table and their data type (can see all data type in the link below)

Fields terminated by what is the delimiter of the data. (example csv is comma)

Stored as textfile/parquet (you can read more about parquet here)

location specified where the data is located

table properties here there are many option to do like:define the compression, skip headers, row count to view all the properties go to the link below

https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html#r_CREATE_EXTERNAL_TABLE-parameters

Step 3: query the table

After we understand the concept we can query the table just like we query a normal table

Select * from spectrum.workers

A nice video that explain how to use Redshift spectrum from AWS in Hebrew

AWS athena

AWS Athena how to work with JSON

17th June 20206th August 2020 arielyos

AWS Athena how to work with JSON

Author: Ariel Yosef

How to query a nested json in AWS Athena

Json can contain nested values. those values can be represented as “key=value” or “array of values” or “array of key=value”

How to query “key=value”

name	age	cars
John	30	{car1=Ford, car2=BMW, car3=Fiat}
maria	21	{car1=Ford, car2=BMW, car3=Fiat}
sonia	25	{car1=Ford, car2=BMW, car3=Fiat}

We can see the column “cars” has a column with “keys and values”

In order to query that, we will simply query the keys and values column (cars) and add “.key”.

Let’s say we want to extract the car2 from column cars

SELECT name , age ,cars.car2 as car2, FROM our_table

note: “as” statement just gives the column name.

Note: the car2 is the key we want to get from the original table.

name	age	car2
John	30	BMW
maria	21	BMW
sonia	25	BMW

How to Query an array:

department	users
engineering	[Sharon, John, Bob, Sally]

We need to unnest the array column and to connect it with the original table.

SELECT department, names FROM dataset
CROSS JOIN UNNEST(users) as t(names)

department	users
engineering	Sharon
engineering	John
engineering	Bob
engineering	Sally

Here we extract the values from the array and attach them to the old table.

How to query Nested JSON with array of key values:

city	worker
NYC	[{name=Sally, department=engineering, age=31}, {name=John, department=finance, age=27}, {name=Amy, department=devops, age=41}]

To query that kind of data, we need first to unnest the array values and then select the column we want.

First we will add the unnested column by cross join and unnest:

select city,worker_unnested from dataset
cross join unnest(dataset.worker) as t(worker2)

city	worker_unnested
NYC	{name=Sally, department=engineering, age=31}
NYC	{name=John, department=finance, age=27}
NYC	{name=Amy, department=devops, age=41}

Now we can add to our query the specific column we want to query by using the “.key”:

select city,worker_unnested.name,worker_unnested.department,worker_unnested.age from dataset
cross join unnest(dataset.worker) as t(worker_unnested)

city	worker_unnested.name	worker_unnested.department	worker_unnested.age
NYC	Sally	engineering	31
NYC	John	finance	27
NYC	Amy	devops	41

Watch another relevant post: AWS Athena Cheat Sheet

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

AWS athena

AWS Athena Cheat sheet

17th June 202012th October 2020 arielyos

AWS Athena Cheat sheet

Author: Ariel Yosef

In AWS Athena the application reads the data from S3 and all you need to do is define the schema and the location the data is stored in s3, i.e create tables.

AWS Athena also saves the results of the queries you make , So you will be asked to define the results bucket before you start working with AWS Athena.

How to create a table over CSV in AWS Athena (read data from s3 csv)

CREATE EXTERNAL TABLE <table_name>(
  `col1` string, 
  `col2` int, 
  `col3` date (yyyy-mm-dd format), 
  `col4` timestamp (yyyy-mm-dd  hh:mm:ss format),
  `col5` boolean)
  ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ','
  LOCATION
  's3://bucket/folder'

NOTE: ROW FORMAT defines how Athena is going to read the file.

You can also define a SERDE to tell Athena how to parse the data.

You can watch all supported SERDE in Amazon Athena here

How to create a table over json files – AWS Athena

CREATE EXTERNAL TABLE <table_name>(
  `col1` string, 
  `col2` int, 
  `col3` date (yyyy-mm-dd format), 
  `col4` timestamp (yyyy-mm-dd  hh:mm:ss format),
  `col5` boolean)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
  LOCATION
  's3://bucket/folder'

To delete table

DROP TABLE <table_name>

How to create json with nested values on AWS Athena

CREATE EXTERNAL TABLE <table_name>(
  `col1` struct<car1:string,car2:string,car3:string>, 
  `col2` int, 
  `col3` date (yyyy-mm-dd format), 
  `col4` timestamp (yyyy-mm-dd  hh:mm:ss format),
  `col5` boolean)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
  LOCATION
  's3://bucket/folder'

There are few more things Athena can do. It can transform your files into another hadoop storage format such as parquet, avro, and orc. We will use it for our queries, to reduce costs and improve performance.
Before we get into that, we need to understand CTAS.

CTAS – How to create a table from another table

CREATE TABLE <table_name> as 
Select * from <database.table_name>

Now after we understand that we can use CTAS and apply the transformation and compression queries, we can ran the following example:

CREATE TABLE <table_name>
WITH (
      format = 'PARQUET',
      parquet_compression = 'SNAPPY',
      external_location = 's3://bucket/folder/')
AS SELECT * 
FROM <database.table_name>;

Notice:

Available compressions are: 4mc, snappy, lzo, lz4, bzip2 and gzip
Available storage formants: parquet, avro and orc
PPT to compare the different file formats.
CTAS examples of AWS Athena documentation.

	AVRO	PRAQUET	ORC
Schema Evlution	Most efficient, as the schema is stores as JSON with the file score: 10/10	Scheman evolution is expensive as it needs to be read and merged across all the parquet files score:3/10	Schema evolution here is limited to adding new columns and a few cases of column type-widening score:5/10
Compression	less efficient score:5/10	Best with Snappy score:7/10	Best with ZLIB score:5/10
Splitability	Partially score:7/10	Partially score:7/10	Fully score:10/10
ROW/Column Oriented	Row	Row	Column
Optimied Processing Engines	Kafka	Spark	Hive, Presto
Read/Write	Write	Read	Read

Insert into

insert into  select * from <database.table_name>

Example for partitioning table done by Athena:

CREATE TABLE 
WITH (
      format = 'PARQUET',
      parquet_compression = 'SNAPPY',
      external_location = 's3://bucket/folder/'
      partitioned_by = ARRAY['col1']['col2']
)
AS SELECT * 
FROM <database.table_name>;

Watch our Post AWS Athena how to work with JSON

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

Qlik installation on AWS cloud windows-server

24th May 202020th April 2021 arielyos

Qlik installation on AWS/Google cloud windows-server

Author: Ariel Yosef 20.4.2021

The download for qlik installation

https://da3hntz84uekx.cloudfront.net/QlikSense/13.72/0/_MSI/Qlik_Sense_setup.exe

The download qlik update

https://da3hntz84uekx.cloudfront.net/QlikSense/13.72/1/_MSI/Qlik_Sense_Update.exe

Before you start installing qlik create a new user account

server manager >> tools >> computer management >> local users and groups >> users >> right click >> “new user” >> give it name and pass >> right click on the new user >> properties >> go to tab member of >> click add >> Administrators >> click ok >> mark User >> click remove DONE!

Small notes about the guide when installing with AWS/GCP

1.Make sure the hostname is correct

On step 10 you are required to place your hostname.

Qlik does not recognize the hostname properly

Make sure you place the local machine hostname

To do so, go to cmd write “hostname” and the result past to the installation

2. the installation should be on Administrator user
On step 15 you need the full path to the user you want the services will run on.
Copy the hostname add “/” and add the username you just created.
Should look like that host_name/new_user

Beside that, the installation is mostly clicking next

A full guide for the Qlik installation

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn: