Installing an AWS EMR cluster tutorial
Authors: Omid Vahdaty & Ariel Yosef 30.7.2020
What is Amazon EMR?
Short description of Apache open source project supported by EMR
Core Hadoop technologies
Hadoop – An open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
Tez – An extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data.
Zookeeper – An open source Apache project that provides a centralized service for providing configuration information, naming, synchronization and group services over large clusters in distributed systems. The goal is to make these systems easier to manage with improved, more reliable propagation of changes.
Hive – A data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Spark Core – The main spark API to write code in languages such as Scala, Java, PySpark (Python).
Spark SQL – API to write Hive SQL queries, which are automatically and seamlessly translated into Spark Core, the simplest way to use Spark for not developers.
Transient Cluster – A cluster which boots up only for a specific automation task and then dies when done. (opposite of 24X7 clusters)
When starting work with EMR, I recommend at least to know in general what every product is doing. for more information about using your EMR in your architecture.
Optional Web Client
Hue – A Web interface for analyzing data via SQL, Configured to work natively with Hive, Presto, and SparkSQL.
Zeppelin – An open source web based notebook – enables running data pipeline orchestration in a combination of technologies – such as Bash, SparkSQL, Hive and Spark core. Also contains features such as collaboration, Graph visualization of the query results and basic scheduling.
Peripheral Technologies
Ganglia – A scalable, distributed monitoring tool for high-performance computing systems, clusters and networks. The software is used to view either live or recorded statistics covering metrics such as CPU load averages or network utilization for many nodes.
TensorFlow – An open-source software library for high-performance numerical computation that is used mostly for deep learning and other computationally intensive machine learning tasks.
Pig – A high-level platform for creating scripting programs that runs on Apache Hadoop.
Sqoop – A Java-based, console-mode application designed for transferring bulk data between Apache Hadoop and non-Hadoop datastores, such as relational databases, NoSQL databases and data warehouses.
EMR Installation Wizard
To start EMR installation go to the EMR console here. And click Go to advanced options
EMR Wizard step 1- Software and Steps
- Software Configuration – We will choose the EMR release and the product we want to install.
- Multiple Master Nodes – On 24X 7 clusters, mark this checkbox because if you lose the master node – you lose the cluster and lose the data on HDFS. You may end up in unexpected results of your data pipeline. On transient clusters it is less important.
- AWS Glue Data Catalog – We can choose if we want to use Glue data Catalog. AWS Glue will allow us to query external database tables. For example, from Athena.
- Edit software setting – In order to change the installation configuration,
Read the instructions here. - Steps – You can configure automation steps.
EMR Wizard step 2- Hardware
1. Cluster Composition
This section allows to modify your cluster instance:
a. Uniform instance groups – The more recommended option due to the configuration of executor (CPU and RAM) utilization.
In this option you select the instance configuration(you will choose the number of instances in section).
b. Instance fleets -You can configure each instance independently.
1.1.Allocation Strategy – Related to the Instance fleets. more details here
2. Networking
You pick the VPC and Subnet of the cluster.
3. Cluster Nodes and Instances
i. Master Node –
Choose on demand – to ensure cluster stability – If you lose the master node – you lose the cluster, lose the data on HDFS,
and may end up in unexpected results of your data pipeline.
ii. Core Node –
1. No need more than 2 instances, as most of the data is expected to be on AWS S3. unfortunately, you can remove the
core node completely.
2. Replication factor is relevant to HDFS, not to S3.
iii. Task Node-
Use task node for Auto Scaling, as it has no HDFS local storage, and it is used only to add more compute power
dynamically.
iv. General resources recommendations-
1. Use EC2 R family instances. The reason is that about 50% of the cluster memory is spent on JAVA, YARN, and OS
overhead.
2. Be consistent in the instance types across the cluster – do not mix as they may affect cluster utilization.
v. Cost & Performance tuning
1. Will be discussed in a later stage. For your reference –
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html
2. Basically – there are several ways to do so, and you just need to configure the cluster.
3. Keep a spot instances for later stages of cost optimization, the complexity of using EMR is HIGH!
4. Skip autoscaling for now. We will cover this at a later stage for performance and cost optimization.
5. Master node HA is available – There is no need for a transient cluster – only for clusters which are 24X7 with local
data on HDFS.
4. Cluster scaling
Choose min/max core/task. AWS EMR will change it automatically depending on the demand usage.
5. EBS Root
Volume – Choose the hard disk for the cluster.
EMR Wizard step 3- General Cluster Settings
EMR Wizard step 4- Security
- EC2 key pair– Choose the key to connect the cluster.
- Permissions– Choose the role for the cluster (EMR will create new if you did not specified).
- Security configuration – skip for now, used to setup encryption at rest and in motion.
- EC2 security groups– Choose the security group for the Master instance and the Slave instance (EMR will create new if you did not specified).
EMR Wizard- Good to know
You can export your EMR cluster creation process to a AWS CLI command via pressing the “AWS CLI EXPORT” button on the console.
——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,
feel free to contact me via LinkedIn:
1 thought on “Installing an AWS EMR cluster tutorial”
great stuff