Redshift Spectrum | Big Data Demystified

Redshift spectrum is a feature that allows you to query tables from s3 storage using Redshift cluster.

To start use Redshift spectrum we need first to create Redshift schema (usually refer as database)

First we will create IAM role to allow us to connect our Redshift to Athena data catalog

Step 1: create a schema(database)

create external schema spectrum

from data catalog

database ‘spectrumdb‘

region ‘us-east-1’

iam_role ‘arn:aws:iam::986725234098:role/Spectrum_test’

create external database if not exists;

Be aware that the schema name you made (spectrum) is how you query in the future the database

Now we can create table in the schema(database)

Step 2: create table

create external table spectrum.workers(

Name varchar,

Sex varchar,

Age integer,

Height integer,

Weight integer,

DOB date

)

row format delimited

fields terminated by ‘,’

stored as textfile

location ‘s3://folder/folder/’

table properties (‘skip.header.line.count’=’1’);

Here we create table name workers in schema(database) spectrum

In the parenthesis we define the columns of the table and their data type (can see all data type in the link below)

Fields terminated by what is the delimiter of the data. (example csv is comma)

Stored as textfile/parquet (you can read more about parquet here)

location specified where the data is located

table properties here there are many option to do like:define the compression, skip headers, row count to view all the properties go to the link below

https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html#r_CREATE_EXTERNAL_TABLE-parameters

Step 3: query the table

After we understand the concept we can query the table just like we query a normal table

Select * from spectrum.workers

A nice video that explain how to use Redshift spectrum from AWS in Hebrew

Leave a ReplyCancel reply

Discover more from Big Data Demystified