All you need to know about AWS EMR Presto

Author: Omid Vahdaty 2.7.2018

A list of good reading materials to help you getting started on Presto

JDBC, in memory, sometimes faster than AWS Athena.

https://aws.amazon.com/blogs/big-data/analyze-data-with-presto-and-airpal-on-amazon-emr/

Presto uses External table in s3

http://www.jiayul.me/tutorial/2016/07/19/querying-s3-data-using-hive-and-presto.html

Syntax limitations compared with hive

INSERT OVERWRITE Statements are NOT Supported.
Presto does not currently support INSERT OVERWRITE Statements. Please delete table before INSERT INTO. See the detail here.
Presto announced support cost-based JOIN optimizations meaning, JOINs are automatically reordered based on table size. Unless you are using the latest version, Please make sure that smaller tables are on the right hand size of JOIN, and they must fit in memory, Otherwise out of memory exceptions will cause the query to fail.

Best practices

http://docs.qubole.com/en/latest/user-guide/presto/best-practices.html

Partitions

https://stackoverflow.com/questions/20185271/is-presto-hive-partition-aware

If you are looking for Hive like Dynamic partitions? It’s Not supported in presto 🙁

Tuning Articles:

https://qubole.zendesk.com/hc/en-us/articles/210266303-How-To-Presto-Tuning

https://docs.treasuredata.com/articles/presto-performance-tuning

How to use Presto with AWS EMR:

Presto with Airpal– Airpal has many helpful features like highlighting syntax, export results to CSV for download etc. Airpal provides the ability to find tables, see metadata, browse sample rows, write and edit queries, then submit queries all in a web interface. Please note that running an extra Airpal server will lead to extra EC2 costs.
Presto with Hue– You can use Presto with hue (hue-4.0.1) on EMR (version 5.9.0 or later). Hue provides SQL editor for running your presto queries in a web interface similar to Airpal. (There may be a difference in features provided by hue as compared to Airpal). Hue is a better option than using Airpal as per my understanding, as you can install hue as a part of EMR installation.
Presto on EMR CLI– You can run presto using command line interface and monitor your queries using presto web UI. You can open “MASTER_NODE_IP:8889“(default) to monitor your cluster details. To enable web interfaces for EMR cluster, kindly refer (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-ui-console.html)
Use Athena instead of Presto on EMR– You can also use AWS Athena if you want to process data present in S3. Amazon Athena is an interactive query service that makes it easy to analyse data in Amazon S3. Athena internally uses Presto as SQL query engine.
Use Presto when you want to reduce costs on your AWS Athena service.

Presto reading Hive partitions including dynamic partitioning

Presto has full support for Hive partitions including dynamic partitioning.

https://github.com/prestodb/presto/blob/1e49d9b125b6897d5014b64f38355605dfe9318d/presto-hive/src/main/java/com/facebook/presto/hive/HivePartitionKey.java

https://github.com/prestodb/presto/blob/886cdf90f4e5b331afcebdde91eae5cfe2a2834d/presto-hive/src/main/java/com/facebook/presto/hive/HiveWriterFactory.java

On EMR, when you install Presto on your cluster, EMR installs Hive as well. Presto uses the Hive metastore to map database tables to their underlying files.

The INSERT query into an external table on S3 is also supported by the service. To query data from Amazon S3, you will need to use the Hive connector that ships with the Presto installation.

Scheduling job in Presto

As per my understanding, you can use one of the following methods:

You can create a shell script and submit it as a step to the cluster. For example, you can create a script. For more details on submitting step to a cluster.

=====
#!/bin/bash
presto-cli –catalog hive –schema default –execute “select count(*) from TABLE_NAME;”
=====
Use a shell action to schedule an oozie workflow on EMR cluster(oozie needs to be installed as part of EMR cluster).
This blog explains how to use oozie workflows.
You can save your queries in hue and then run those saved queries in hue console.

Working example with Hive and Presto:

Create table via hive.
Select via Presto.

Presto-CLI documentations

presto-cli –catalog hive –execute “select * from t”

Good lecture on new features

Cost reduction on Presto:

Try my cost reduction article on AWS Athena, it may have usefull tips.

——————————————————————————————————————————
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,

feel free to contact me via LinkedIn:

architecture, AWS, Security

AWS Enterprise grade Networking & security| what are your options to protect your big data?

27th June 201813th November 2019 Omid

VPC features

Static private IP address
Elastic Network Interfaces : possible to bind multiple Elastic Network Interfaces to a single instance
Internal Elastic Load Balancers
Advanced Network Access Control
Setup a secure bastion host
DHCP options
Predictable internal IP ranges
Moving NICs and internal IPs between instances
VPN connectivity
Heightened security etc

VPC examples

Public facing VPC
Public and Private setup VPC
VPC with Public and Private Subnets and Hardware VPN Access
VPC with Private Subnets and Hardware VPN Access
Software based VPN access.

VPN options at AWS

Hardware based (virtual private Gateway)
- Public and private subnets
- Private subnet only
Software Based:
- AWS does not provide or maintain software VPN appliances; however, you can choose from a range of products provided by partners and open source communities.
- Requires an instance
Cloudhub (many site to site connections )
DirectConnect , private network connection (DC to DC).
Notice you could have redundant tunnels 🙂
You can use BGP for dynamic routing.

Options to upload securely FROM VPC to Outside

VPC endpoint (NAT, and ACL rules)
- You can define Routing. E.g from VPC to s3.
VPC peering
- to connect to VPC groups. Even on different accounts.
NAT gateways,
- to enable instances in a private subnet to connect to the Internet or other AWS services, but prevent the Internet from initiating a connection with those instance
Internet Gateways
- An Internet gateway is a horizontally scaled, redundant, and highly available VPC component that allows communication between instances in your VPC and the Internet. It therefore imposes no availability risks or bandwidth constraints on your network traffic.

Secured connection from WWW to VPC

Bastion host: server in the middle
- Simple, straight forward IP+port to easily pass a FW from a DC
- Non HA.
- Increased latency, 2 hops architecture… 🙁
Proxy server (socks)
- Will be good for any future usage such as Streaming.
- need to maintain a proxy cluster
VPN tunnel
- Need to maintain private LAN IP’s on both end points
- Slower in upload
Endpoints

VPC private subnet + Virtual Private Gateway

Virtual Private Gateway. Simple VPN connector from AWS side.
Customer Gateway: simple VPN connection from client side. (physical or software)
customer gateway must initiate the tunnels
If your VPN connection experiences a period of idle time (usually 10 seconds, depending on your configuration), the tunnel may go down
To prevent this, you can use a network monitoring tool to generate keepalive pings; for example, by using IP SLA
Start here: http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_VPN.html
https://docs.openvpn.net/how-to-tutorialsguides/administration/extending-vpn-connectivity-to-amazon-aws-vpc-using-aws-vpc-vpn-gateway-service/

VPC peering

Apparently VPC peering is available only for connecting between VPCs in the same region (it can be cross accounts but has to be in the same region):

http://docs.aws.amazon.com/AmazonVPC/latest/PeeringGuide/Welcome.html

For connecting VPCs in different regions there are several architectural options you can read about it in the following blog:

https://aws.amazon.com/answers/networking/aws-multiple-region-multi-vpc-connectivity/

Great Read | How to Define VPC with private and public subnet

http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html

This blog , if you go through it step by step and implement it – you will know aws networking inside out!

VPC best practices

Get your Amazon VPC combination right: Select the right Amazon VPC architecture first. You need to decide the right Amazon VPC & VPN setup combination based on your current and future requirements
Choose your CIDR Blocks: Amazon VPC can have contain from 16 to 65536 IP addresses
Isolate according to your Use case:
- Create separate Amazon VPC for Development , Staging and Production environment
- Create one Amazon VPC with Separate Subnets/Security/isolated NW groups for Production
Securing Amazon VPC :
- Secure your Amazon VPC using Firewall virtual appliance,
- You can configure Intrusion Prevention or Intrusion Detection virtual appliances and secure the protocols and take preventive/corrective actions in your VPC
Configure VM encryption tools which encrypts your root and additional EBS volumes.
Configure Privileged Identity access management solutions
Enable the cloud trail to audit in the VPC environments ACL policy’s.
Apply anti virus for cleansing specific EC2 instances inside VPC.
Configure Site to Site VPN for securely transferring information between Amazon VPC in different regions or between Amazon VPC to your On premise Data center
Follow the Security Groups and NW ACL’s best practices listed below

Always span your Amazon VPC across multiple subnets in Multiple Availability zones inside a Region.
Good security practice is that to have only public subnet with route table which carries route to internet gateway. Apply this wherever applicable.
Keep your Data closer

Allow and Deny Network ACL
- First network ACL: Allow all the HTTP and HTTPS outbound traffic on public internet facing subnet.
- Second network ACL: Deny all the HTTP/HTTPS traffic. Allow all the traffic to Squid proxy serve

Restricting Network ACL : Block all the inbound and outbound ports. Only allow application request ports.

- Create route tables only when needed and use the Associations option to map subnets to the route table in your Amazon VPC
- Use Amazon VPC Peering

Security group – least privileges by design.

Technical Notes to pay attention on AWS VPC networking | Summery

Public subnet
- Must have access to to WWW
- Must have auto assign public IP
- Has a private IP as well 🙂
Route table per subnet (private/public)
Dont forget to associate subnet to routing table.
Security Group→ instance level→ for white list → cross AZ → statefull – both directions in one rule
Network ACL → subnet level → blacklist → not cross AZ→ stateless , one definition per one direction
NAT gateway must be defined in the public subnet — > needs access to WWW.
Dont forget to add AZ’s.
Dont forget to Add S3 Endpoint

Summery

there many good options to protect your data, simply knowing VPC features is not enough, you have to know your big data components as well.
designing a network is non trivial, consult someone before your start , very hard to change something once your are deep inside the process.
have you data in mind, security and access management when you design your network and security.

Need to learn more about aws big data (demystified)?

Contact me via linked in Omid Vahdaty
website: https://amazon-aws-big-data-demystified.ninja/
Join our meetup, FB group and youtube channel
- Join our meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
- Join our facebook group https://www.facebook.com/groups/amazon.aws.big.data.demystified/
- subscribe to our youtube channel https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

AWS, Security

AWS S3 Security Introduction and Access management

27th June 201813th November 2019 Omid

General Security Concepts | Good to know!

- protecting data while
  - in-transit (as it travels to and from Amazon S3) , 2 ways:
    - by using SSL
  - at rest (while it is stored on disks in Amazon S3 data centers) 2 ways:
    - Server Side encryption. (SSE)
    - client-side encryption.
  - In use:
    - Hashing…. (dictionary attack?)
    - Hashing with key
    - Any Encryption

S3 Encryption Types

Server Side
- encrypt your object before saving it on S3 disks
- decrypt it when you download the objects from S3.
Client Side
- Client-side encryption refers to encrypting data before sending it to Amazon S3
  - Use an AWS KMS-managed customer master key
  - Use a client-side master key (you can use AWS JAVA SDK)
  - Disadvantage: Less matching the AWS ecosystem. You need to manage keys.

Client side master key

Your client-side master keys and your unencrypted data are never sent to AWS
manage your own encryption keys
If you lose them, you won’t be able to decrypt your data.
When uploading an object
- You provide a client-side master key to the Amazon S3 encryption client
- for each object,encryption client locally generates a one-time-use symmetric key
- The client uploads the encrypted data key and its material description as part of the object metadata
- The material description helps the client later determine which client-side master key to use for decryption
- The client then uploads the encrypted data to Amazon S3 and also saves the encrypted data key as object metadata
When downloading an object
- The client first downloads the encrypted object from Amazon S3 along with the metadata
- Using the material description in the metadata, the client first determines which master key to use to decrypt the encrypted data key.

Server Side Encryption (SSE)

- Server-side encryption is about data encryption at rest
  - Amazon S3 encrypts your data at the object level as it writes it to disks
  - decrypts it for you when you access it.
  - As long as you authenticate your request and you have access permissions
  - You can’t apply different types of server-side encryption to the same object simultaneously.

3 methods for SSE

Server-Side Encryption with Customer-Provided Keys (SSE-C)
You manage the encryption keys and Amazon S3 manages the encryption, as it writes to disks, and decryption, when you access your objects
- S3-Managed Keys (SSE-S3)
- AWS KMS-Managed Keys (SSE-KMS)

S3-Managed Keys (SSE-S3)

Each object is encrypted with a unique key employing strong multi-factor encryption
it encrypts the key itself with a master key that it regularly rotates
256-bit Advanced Encryption Standard (AES-256), to encrypt
most compatible with ecosystem solutions, and provides strong encryption mechanism.

No limitations on velocity and free of charge

AWS KMS-Managed Keys (SSE-KMS)

Similar to SSE-S3
There are separate permissions for the use of an envelope key (that is, a key that protects your data’s encryption key)
provides you with an audit trail of when your key was used and by whom
you have the option to create and manage encryption keys yourself, or use a default key that is unique to you, the service you’re using, and the region you’re working in.
- Costs money: https://aws.amazon.com/kms/pricing/

Limited in velocity : http://docs.aws.amazon.com/kms/latest/developerguide/limits.html

Client Side KMS–Managed Customer Master Key (CMK)

you provide only an AWS KMS customer master key ID (CMK ID)
you don’t have to worry about providing any encryption keys to the Amazon S3 encryption client (for example, the AmazonS3EncryptionClient in the AWS SDK for Java). 2options
- A plain text version
- A cipher blob
unique data encryption key for each object it uploads.

Additional AWS s3 Safeguards

VPN (site to site)
Identity based policy (IAM)
IP
resource based policy , e.g. :Write Only permissions.

Resource Based Policy on s3

Adding Bucket level permission (not object level) policy example

{

     “Effect”: “Allow”,

     “Principal”: {

               “AWS”: “arn:aws:iam::21111111:root”

           },

     “Action”: [“s3:ListBucket“],

     “Resource”: [“arn:aws:s3:::bucketName“]

   },

Deny Headers of unencrypted objects policy example

{
               “Sid”: “DenyUnEncryptedObjectUploads“,
               “Effect”: “Deny“,
               “Principal”: “*”,
               “Action”: “s3:PutObject“,
               “Resource”: “arn:aws:s3:::<bucket_name>/*”,
               “Condition”: {
                       “Null”: {
                              “s3:x-amz-server-side-encryption“: true
                       }
              }
          }

Deny non AWS s3 SSE encryption policy example

{
               “Sid”: “DenyIncorrectEncryptionHeader”,
               “Effect”: “Deny“,
               “Principal”: “*”,
               “Action”: “s3:PutObject“,
               “Resource”: “arn:aws:s3:::<bucket_name>/*”,
               “Condition”: {
                       “StringNotEquals”: {
                              “s3:x-amz-server-side-encryption”: “AES256“
                        }
               }
          },

Deny non KMS objects policy example

{
               “Sid”: “DenyIncorrectEncryptionHeader“,
               “Effect”: “Deny“,
               “Principal”: “*”,
               “Action”: “s3:PutObject“,
               “Resource”: “arn:aws:s3:::<bucket_name>/*”,
               “Condition”: {
                   “StringNotEquals”: {
                         “s3:x-amz-server-side-encryption”: “aws:kms“
                            }
                  }
          },

Identity based policy via IAM on s3

allow s3 read only on all buckets

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:Get*",
        "s3:List*"
      ],
      "Resource": "*"
    }
  ]
}

allow s3 write only on spesific buckets

 {
 "Version": "2012-10-17",
 "Statement": [
 {
 "Effect": "Allow",
 "Action": "s3:*",
 "Resource": [
 "arn:aws:s3:::myBucekt1",
 "arn:aws:s3:::myBucekt1/*",
 "arn:aws:s3:::myBucekt2",
 "arn:aws:s3:::myBucekt2/*",
 ]
 }
 ]
}

Protecting s3 bucket from accidental delete (protect bucket delete, and policy delete)

{

    “Version”: “2012-10-17”,

    “Statement”: [

        {

            “Sid”: “Stmt1503850588772”,

            “Effect”: “Deny”,

            “Principal”: “*”,

            “Action”: “s3:DeleteBucket”,

            “Resource”: “arn:aws:s3:::walla-mail-bigfiles-eu-west-1-sse”,

            “Condition”: {

                “StringNotLike”: {

                    “aws:userId”: [

                        “xxxxxx:*”,

                        “12345”

                    ]

                }

            }

        },

        {

            “Effect”: “Deny”,

            “Principal”: “*”,

            “Action”: “s3:PutBucketPolicy”,

            “Resource”: “arn:aws:s3:::walla-mail-bigfiles-eu-west-1-sse”,

            “Condition”: {

                “StringNotLike”: {

                    “aws:userId”: [

                        “xxxxxxxx:*”,

                        “12345”

                    ]

                }

            }

        }

    ]

}

Policy to Deny put / delete of s3 policy from anyone but the admin

{
“Sid”: “Stmt1503999310000”,
“Effect”: “Deny”,
“NotPrincipal”: {
“AWS”: “arn:aws:iam::506754145427:user/omid”
},
“Action”: [
“s3:PutBucketPolicy”,
“s3:DeleteBucketPolicy”
],
“Resource”: “arn:aws:s3:::walla-anagog-eu-west-1”
}

Note : json validator to help debug syntax errors in jso

https://jsonformatter.curiousconcept.com/

A quick note about life cycle policy + no delete bucket policy

I would like to confirm that if you have a policy that denies any user/principle the ability to delete an object or it’s version, you can still have a lifecycle rule to expire these objects and the policy will not prevent the lifecycle rule to execute. Lifecycle policy works from backend to process objects and will not engage API calls. So that it is not affected by bucket policy. In your scenario, even if bucket policy is defined to deny all delete operations, lifecycle policy will delete objects after it expires in 120 days and not before the set lifecycle time of 120days

Conclusion

you have many ways to protect your data
we covered all the encryption options in AWS.
resource based policy and identity based policy are great, but you can also read about account segregation to have some more flexibility in terms of Data governance.

Need to learn more about aws big data (demystified)?

Contact me via linked in Omid Vahdaty
website: https://amazon-aws-big-data-demystified.ninja/
Join our meetup, FB group and youtube channel
- Join our meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
- Join our facebook group https://www.facebook.com/groups/amazon.aws.big.data.demystified/
- subscribe to our youtube channel https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber

——————————————————————————————————————————

https://www.linkedin.com/in/omid-vahdaty/

AWS EMR

How to increase disk space on master node root partition in EMR

21st June 201813th November 2019 Omid

I have tested the following solution on my side using the r4.4xlarge instance type

Please find the steps-

Step-1) Increase the root volume of the EMR master node

To navigate to the EMR cluster’s master node’s root EBS volume the following steps can be taken:

– Open the EMR cluster in the EMR console
– Expand the hardware dropdown
– Click on the Instance Group ID that is labelled as the MASTER
– Click on the EC2 instance ID shown in this table, which will open the master node in the EC2 console
– In the “Description” tab in the information panel at the bottom of the console, scroll down and click on the linked device for the “Root device” entry
– The EBS volume will now open in the EC2 console, this is the root EBS Volume for the master node, and should look like the screenshot below
– Now you should be able to choose the “Modify Volume” action from the “Actions” dropdown, and change the volume size!

In this case, I adjusted the size of the EBS volume from 10GB to 50GB, simple as that! Alternatively, you could give the customer the exact CLI command to do this rather than trying to guide them through this process on the console:

aws ec2 modify-volume –region us-east-1 –volume-id vol-xxxxxxxxxxxxxxxxxxxx –size 50 –volume-type gp2

Step-2) Login to the Master node with SSH and check run the following command to check the newly attached size information under /dev/xvda.

lsblk
df -h

Step-3) However, it is important to note that at this point you will still not see the additional space on the file system (root volume /dev/xvda1). Run the following command to add the following space to root volume.

sudo /usr/bin/cloud-init -d single -n growpart
sudo /usr/bin/cloud-init -d single -n resizefs

Step-4) Now run below command to see “/” volume is increased to 50GB.

df -h
lsblk

Step-5) After the volume is increased, you can do a test create a sample test file and see the size of “/” volume increased its usage.

sudo fallocate -l 10G /test

(This will create a test file inside “/” with 10GB in size)

df -h

(Verify root volume mount point increased its usage)

Step-6) After verifying, just delete sample /test file.

sudo rm -rf /test

Note: Please back up any important file or configuration files before performing the operation.

Want to get more content about big data?

Contact me via linked in Omid Vahdaty
website: https://amazon-aws-big-data-demystified.ninja/
subscribe to our AWS Big Data Demystified youtube channel
subscribe to our Big Data Demystified youtube channel

——————————————————————————————————————————

https://www.linkedin.com/in/omid-vahdaty/

architecture, AWS, Data Engineering

Questions and answers on AWS EMR Jupiter

11th June 201813th November 2019 Omid

1. Can we connect from the jupiter notebook to: Hive, SparkSQL, Presto

EMR release 5.14.0 is the first to include JupyterHub. You can see all available applications within EMR Release 5.14.0 listed here [1].

2. Are there any interpreters for scala, pyspark

When you create a cluster with JupyterHub on EMR, the default Python 3 kernel for Jupyter, and the PySpark, SparkR, and Spark kernels for Sparkmagic are installed on the Docker container. You can use these kernels to run ad-hoc Spark code and interactive SQL queries using Python, R, and Scala. You can install additional kernels within the Docker container manually i.e. you can install additional kernels, additional libraries and packages and then import them for the appropriate shell [2].

3. Is there any option to connect from jupiter notebook via JDBC / secured JDBC connection?

The latest JDBC drivers can be found here [3]. You will also find an example here that uses SQL Workbench/J as a SQL client to connect to a Hive cluster in EMR.

You can download and install the necessary drivers from the links available here [4]. You can add JDBC connectors at cluster launch using the configuration classifications. An example of presto classifications and an example of configuring a cluster with the PostgreSQL JDBC can be seen here [5].

4. What would be steps to bootstrap cluster with jupyter notebooks

aws dedicated blog post states [6], aws provide a bootstrap action [7] to install Jupyter on the following path:

‘s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh’

5. Any way to save the jupiter notebook on a persistent storage like s3 automatically like in zeppelin?

By default, this is not available, however, you may be able to create your own script to achieve this.

EMR enables you to run a script at any time during step processing in your cluster. You specify a step that runs a script either when you create your cluster or you can add a step if your cluster is in the WAITING state [8].

6. Is there a way to add HTTPS to the Jupiter notebook GUI? if so how?

By default, JupyterHub on EMR uses a self-signed certificate for SSL encryption using HTTPS. Users are prompted to trust the self-signed certificate when they connect.

You can use a trusted certificate and keys of your own. Replace the default certificate file, server.crt, and key file server.key in the /etc/jupyter/conf/ directory on the master node with certificate and key files of your own. Use the c.JupyterHub.ssl_key and c.JupyterHub.ssl_cert properties in the jupyterhub_config.py file to specify your SSL materials [9].

You can read more about this in the Security Settings section of the JupyterHub documentation [10].

7. Is there a way to work with API & CMD of jupyter?

As is the case with all AWS services, you can create an EMR cluster with JupyterHub using the AWS Management Console, AWS Command Line Interface, or the EMR API [11].

8. Where is the config path of jupiter nootbook ?

/etc/jupyter/conf/

You can customize the configuration of JupyterHub on EMR and individual user notebooks by connecting to the cluster master node and editing configuration files [12].

As mentioned above, we provide a bootstrap action [7] to install Jupyter on the following path:

‘s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh’

9. Any common issues with jupyter?

here are a number of considerations you need to consider:

User notebooks and files are saved to the file system on the master node. This is ephemeral storage that does not persist through cluster termination. When a cluster terminates, this data is lost if not backed up. We recommend that you schedule regular backups using cron jobs or another means suitable for your application.

In addition, configuration changes made within the container may not persist if the container restarts. We recommend that you script or otherwise automate container configuration so that you can reproduce customizations more readily [13].

10. Orchestration options for jupiter notebook ? i.e how to schedule a notebook to run daily

JupyterHub and related components run inside a Docker container named jupyterhub that runs the Ubuntu operating system. There are several ways for you to administer components running inside the container [14].

Please note that customisations you perform within the container may not persist if the container restarts. We recommend that you script or otherwise automate container configuration so that you can reproduce customisations more readily.

11. User / Group / credentials management in jupiter notebook?

You can use one of two methods for users to authenticate to JupyterHub so that they can create notebooks and, optionally, administer JupyterHub.

The easiest method is to use JupyterHub’s pluggable authentication module (PAM). However, JupyterHub on EMR also supports the LDAP Authenticator Plugin for JupyterHub for obtaining user identities from an LDAP server, such as a Microsoft Active Directory server [15].

You can find instructions and examples for adding users with PAM here [16] and LDAP here [17].

12. notebook collaborations features?

TBD.

13. import/export options?

As stated above, you can install additional kernels within the Docker container manually i.e. you can install additional kernels, additional libraries and packages and then import them for the appropriate shell [2].

14. any other connections build in Jupyter?

As stated above, EMR release 5.14.0 is the first to include JupyterHub and will include all available EMR applications within EMR Release 5.14.0.

15. Working seamlessly with AWS GLUE in terms share meta store?

If you are asking for example with regards to configuring Hive to use the Glue Data Catalog as its metastore, you can indeed do this since EMR version 5.8.0 or later [18].

Finally, I have included the following for your reference:

1. JupyterHub Components

The following diagram depicts the components of JupyterHub on EMR with corresponding authentication methods for notebook users and the administrator [19].

2. Sagemaker

As you are more than likely aware. AWS have recently launched a ML notebook service called SageMaker which uses Jupyter notebooks only. As Sagemaker is integrated with other AWS services you can achieve greater control. For example, with Sagemaker you can utilize the IAM service to control user access. You can also connect to it from an EMR cluster, for example EMR version 5.11.0 [20] added the aws-sagemaker-spark-sdk component to Spark, which installs Amazon SageMaker Spark and associated dependencies for Spark integration with Amazon SageMaker. You can read more

You can use Amazon SageMaker Spark to construct Spark machine learning (ML) pipelines using Amazon SageMaker stages. If this is of interest to you, you can read more about it here [21] and on the SageMaker Spark Readme on GitHub [22].

Resources

Need to learn more about aws big data (demystified)?

Contact me via linked in Omid Vahdaty
website: https://amazon-aws-big-data-demystified.ninja/
Join our meetup, FB group and youtube channel
Join our meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
Join our facebook group https://www.facebook.com/groups/amazon.aws.big.data.demystified/
subscribe to our youtube channel https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber

——————————————————————————————————————————

https://www.linkedin.com/in/omid-vahdaty/