Installing an AWS EMR cluster tutorial​

Authors: Omid Vahdaty & Ariel Yosef 30.7.2020

What is Amazon EMR?

Amazon EMR is a managed cluster platform that simplifies running Hadoop frameworks. 

EMR contains a long list of Apache open source products.

To watch the full list of supported products and their variations click here.

You can find AWS documentation for EMR products here

Short description of Apache open source project supported by EMR

Core Hadoop technologies

Hadoop – An open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Tez – An extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data.

Zookeeper – An open source Apache project that provides a centralized service for providing configuration information, naming, synchronization and group services over large clusters in distributed systems. The goal is to make these systems easier to manage with improved, more reliable propagation of changes.

HiveA data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

Spark CoreThe main spark API to write code in languages such as Scala, Java, PySpark (Python).

Spark SQL – API to write Hive SQL queries, which are automatically and seamlessly  translated into Spark Core, the simplest way to use Spark for not developers. 

Transient Cluster – A cluster which boots up only for a specific automation task and then dies when done. (opposite of 24X7 clusters)

When starting work with EMR, I recommend at least to know in general what every product is doing. for more information about using your EMR in your architecture.

Optional Web Client

HueA Web interface for analyzing data via SQL, Configured to work natively with Hive, Presto, and SparkSQL.

Zeppelin – An open source web based notebook  – enables running data pipeline orchestration in a combination of technologies – such as Bash, SparkSQL, Hive and Spark core.  Also contains features such as collaboration, Graph visualization of the query results and basic scheduling.

 

Peripheral Technologies

Ganglia – A scalable, distributed monitoring tool for high-performance computing systems, clusters and networks. The software is used to view either live or recorded statistics covering metrics such as CPU load averages or network utilization for many nodes.

TensorFlow – An open-source software library for high-performance numerical computation that is used mostly for deep learning and other computationally intensive machine learning tasks.

Pig – A high-level platform for creating scripting programs that runs on Apache Hadoop.

Sqoop – A Java-based, console-mode application designed for transferring bulk data between Apache Hadoop and non-Hadoop datastores, such as relational databases, NoSQL databases and data warehouses.

EMR Installation Wizard

To start EMR installation go to the EMR console here. And click Go to advanced options

EMR Wizard step 1- Software and Steps

    1. Software ConfigurationWe will choose the EMR release and the product we want to install.
    2. Multiple Master Nodes – On 24X 7 clusters, mark this checkbox because if you lose the master node – you lose the cluster and lose the data on HDFS. You may end up in unexpected results of your data pipeline. On transient clusters it is less important.
    3. AWS Glue Data Catalog – We can choose if we want to use Glue data Catalog. AWS Glue will allow us to query external database tables. For example, from Athena.
    4. Edit software setting – In order to change the installation configuration,
      Read the instructions here.
    5. Steps – You can configure automation steps.

EMR Wizard step 2- Hardware

1. Cluster Composition

 This section allows to modify your cluster instance:
a. Uniform instance groups – The more recommended option due to the configuration of executor (CPU and RAM) utilization.
In this option you select the instance configuration(you will choose the number of instances in section).
b. Instance fleets -You can configure each instance independently. 
   1.1.Allocation Strategy – Related to the  Instance fleets. more details here

 

2. Networking

 You pick the VPC and Subnet of the cluster.

3. Cluster Nodes and Instances

           i. Master Node
                Choose on demand – to ensure cluster stability – If you lose the master node – you lose the cluster, lose the data on HDFS,
                and may end up in unexpected results of your data pipeline.
           ii. Core Node
             1. No need more than 2 instances, as most of the data is expected to be on AWS S3. unfortunately, you can remove the
                  core node completely. 
             2. Replication factor is relevant to HDFS, not to S3.
           iii. Task Node- 
                 Use task node for Auto Scaling, as it has no HDFS local storage, and it is used only to add more compute power 
                 dynamically.
           iv. General resources recommendations-
                 1. Use EC2 R family instances. The reason is that about 50% of the cluster memory is spent on JAVA, YARN, and OS
                     overhead.
                 2. Be consistent in the instance types across the cluster – do not mix as they may affect cluster utilization.
            v. Cost & Performance tuning 
                        1. Will be discussed in a later stage. For your reference –
                            
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html
                        2. Basically – there are several ways to do so, and you just need to configure the cluster. 
                        3. Keep a spot instances for later stages of cost optimization, the complexity of using EMR is HIGH! 
                        4. Skip autoscaling for now. We will cover this at a later stage for performance and cost optimization.
                        5. Master node HA is available – There is no need for a transient cluster – only for clusters which are 24X7 with local
                             data on HDFS.

4. Cluster scaling

Choose min/max core/task. AWS EMR will change it automatically depending on the demand usage.

5. EBS Root

Volume – Choose the hard disk for the cluster.

 

EMR Wizard step 3- General Cluster Settings

  1. Genetal Options- nothing to add.
  2. Tags- create tags. read about them here.
  3. Additional Options- read here.
  4. Bootstrap Actions- you can configure scripts that will run while the instance is starting.

For example:
Yum install “any-program” -y
Nice and short video  that shows how to use it:

EMR Wizard step 4- Security

  1. EC2 key pair– Choose the key to connect the cluster.
  2. Permissions– Choose the role for the cluster (EMR will create new if you did not specified).
  3. Security configuration – skip for now, used to setup encryption at rest and in motion.
  4. EC2 security groups– Choose the security group for the Master instance and the Slave instance (EMR will create new if you did not specified).

EMR Wizard- Good to know

You can export your EMR cluster creation process to a AWS CLI command via pressing the “AWS CLI EXPORT” button on the console.

——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

Leave a Reply