AWS EMR

AWS EMR Demystified

AWS EMR Demystified - parts1-4

Lecturer: Omid Vahdaty, November 2022

Basically I will be teaching AWS EMR / Hadoop inside out and answer any questions you may have.

1. Introduction to AWS EMR (Hadoop managed service)
2. Introduction to AWS Networking and S3 and different types of Hadoop storage, and any required AWS jargon to handle this meeting.
3. Introduction to AWS Glue, Athena, and how it all connects.
4. Running your first PySpsark Job.
5. Challenges with transformation of data using AWS EMR.
6.  Pros and cons of AWS EMR Architecture in your data lake.

Part 1

Video

Slides

Part 2

Video

Slides

Part 3

Video

Slides

Part 4

Video


——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn – Omid Vahdaty:

AWS EMR, dataproc

Jupyter Demystified

Jupyter Demystified

Author: Omid Vahdaty 13.10.2021

Following my combustion around managed jupyter offerings in AWS and GCP I have created a simple research to clarify the differences of  jupyter notebook, jupyter hub and jupyter labs. In addition, I added some bootstrapping script instructions to manage admin users. Based on this research , I had the pleasure of correcting AWS official documentations.  Let me know if you find this useful! 


——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

apache, architecture, AWS EMR, EMR, Hive, presto, Spark, zeppelin

AWS EMR and Hadoop Demystified – Comprehensive training program suggestion for Data Engineers in 200KM/h

This blog Assumes prior knowledge, this to help the reader design training program to newbies on AWS EMR Hadoop. Naturally, my big data perspective is applied here. This blog is FAR FROM BEING PERFECT.

Learn the following in rising order of importance (in my humble opinion).

Quick introduction to big data in 200 KM/h

Beyond the basics….

Hive vs presto Demystified

Hive Demystified

EMR Zeppelin & Zeppelin

EMR Yarn Demystified

EMR Spark Demystified

EMR Livy demystified

EMR Spark and Zeppelin demystified

Rstudio and SparkR demystified

EMR spark Application logging

EMR Monitoring Demystified | EMR Ganglia

EMR spark tuning demystified

EMR Oozie demystified (not common, use airflow instead)

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

architecture, AWS athena, AWS Aurora, AWS Big Data Demystified, AWS EMR, AWS Redshift, AWS S3, EMR, SageMaker, Security

AWS Demystified – Comprehensive training program suggestion for newbies in 200KM/h

This blog Assume prior knowledge, this to help the reader design training program to newbies on AWS. Naturally, my big data perspective is applied here.

Learn the following in rising order of importance (in my humble opinion).

General lightweight introduction to AWS & AWS Big Data :

  1. Create AWS user on AWS account

Start with this. get the obvious out of the way.

  1. AWS S3 (GCS), AWS S3 cli

Be sure to understand the basics, upload, download, copy, move, rsnc from both the GUI and AWS CLI. only then go to other advanced features such as life cycle policy, storage tiers, encyrptions etc.

  1. AWS EC2(Elastic Compute), how to create a machine, how to connect via ssh

Be sure to use T instance to play a round, choose amazon linux or ubuntu. Notice the different OS users name required to ssh to each machine.

Be sure to understand what is

  1. SSH
  2. SSH tunnel
  1. AWS security groups how to add ip and port

Without this section you wont be to access web/ssh machines.

  1. AWS VPC (virtual private network) , only if you feel comfortable around network architecture, otherwise skipt this topic.
  1. AWS RDS (mySQL,aurora)

create Mysql , login, create table, insert data from S3, export data from s3.

understand the difference between AWS RDS aurora and AWS RDS mysql

  1. AWS Redshift learn how to create cluster connect to cluster, and run query , understand the basic architecture.
  2. AWS Athena, how to create external table, how to insert data, partitions, MSCK repair table, parquet, AVRO, querying nested data.
  1. AWS Glue

be sure to understand the 2 role of AWS glue: shared metastore and auto ETL features.

  1. AWS Kinesis, Stream, Firehose, Analytics, you need to understand messaging and streaming. I covered off topic subject here such as flume and kafka.
  1. AWS EMR – (need to talk about with omid).

This is HARD CORE material. highly advanced materials for enterprise grade lectures.

Be sure to understand when to use Hive and when to use SparkSQL and when to use Spark Core. Moreover, the difference between the different node: core node, task node, master node.

  1. AWS IAM, groups, user, role based architecture, encryption at rest, at motion, based policy, identity based policy. By now you should have a basic understanding of what is IAM about. Identity, Authentication, authorization. although, there are many fine grain security issues you need to understand. At the very minimum be sure to understand what is the difference between a role and user, how to write a custom policy, and what is resource based policy vs identity based policy.
  1. AWS Cloud best practices, light wight
  1. AWS ELB, ALB, Auto scaling. [Advanced]
  1. AWS Route53, TBD
  2. AWS security
  1. AWS Lambda
  2. AWS Sage Maker

TBD subject:

API Gateway

AWS cognito.


——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS athena, AWS Aurora, AWS Big Data Demystified, AWS EMR, AWS Lambda, AWS Redshift, Hive, meetup, Uncategorised

200KM/h overview on Big Data in AWS – Part 2

in this lecture we are going to cover AWS Big Data PaaS technologies used to model and visualize data using a suggested architecture and some basic big data architecture rule of thumbs.

For more meetups:
https://www.meetup.com/Big-Data-Demystified/

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/