Questions and answers on AWS EMR Jupiter

1. Can we connect from the jupiter notebook to: Hive, SparkSQL, Presto

EMR release 5.14.0 is the first to include JupyterHub. You can see all available applications within EMR Release 5.14.0 listed here [1].

2. Are there any interpreters for scala, pyspark

When you create a cluster with JupyterHub on EMR, the default Python 3 kernel for Jupyter, and the PySpark, SparkR, and Spark kernels for Sparkmagic are installed on the Docker container. You can use these kernels to run ad-hoc Spark code and interactive SQL queries using Python, R, and Scala. You can install additional kernels within the Docker container manually i.e. you can install additional kernels, additional libraries and packages and then import them for the appropriate shell [2].

3. Is there any option to connect from jupiter notebook via JDBC / secured JDBC connection?

The latest JDBC drivers can be found here [3]. You will also find an example here that uses SQL Workbench/J as a SQL client to connect to a Hive cluster in EMR.

You can download and install the necessary drivers from the links available here [4]. You can add JDBC connectors at cluster launch using the configuration classifications. An example of presto classifications and an example of configuring a cluster with the PostgreSQL JDBC can be seen here [5].

4. What would be steps to bootstrap cluster with jupyter notebooks

aws dedicated blog post states [6], aws provide a bootstrap action [7] to install Jupyter on the following path:

‘s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh’

5. Any way to save the jupiter notebook on a persistent storage like s3 automatically like in zeppelin?

By default, this is not available, however, you may be able to create your own script to achieve this.

EMR enables you to run a script at any time during step processing in your cluster. You specify a step that runs a script either when you create your cluster or you can add a step if your cluster is in the WAITING state [8].

6. Is there a way to add HTTPS to the Jupiter notebook GUI? if so how?

By default, JupyterHub on EMR uses a self-signed certificate for SSL encryption using HTTPS. Users are prompted to trust the self-signed certificate when they connect.

You can use a trusted certificate and keys of your own. Replace the default certificate file, server.crt, and key file server.key in the /etc/jupyter/conf/ directory on the master node with certificate and key files of your own. Use the c.JupyterHub.ssl_key and c.JupyterHub.ssl_cert properties in the jupyterhub_config.py file to specify your SSL materials [9].

You can read more about this in the Security Settings section of the JupyterHub documentation [10].

7. Is there a way to work with API & CMD of jupyter?

As is the case with all AWS services, you can create an EMR cluster with JupyterHub using the AWS Management Console, AWS Command Line Interface, or the EMR API [11].

8. Where is the config path of jupiter nootbook ?

/etc/jupyter/conf/

You can customize the configuration of JupyterHub on EMR and individual user notebooks by connecting to the cluster master node and editing configuration files [12].

As mentioned above, we provide a bootstrap action [7] to install Jupyter on the following path:

‘s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh’

9. Any common issues with jupyter?

here are a number of considerations you need to consider:

User notebooks and files are saved to the file system on the master node. This is ephemeral storage that does not persist through cluster termination. When a cluster terminates, this data is lost if not backed up. We recommend that you schedule regular backups using cron jobs or another means suitable for your application.

In addition, configuration changes made within the container may not persist if the container restarts. We recommend that you script or otherwise automate container configuration so that you can reproduce customizations more readily [13].

10. Orchestration options for jupiter notebook ? i.e how to schedule a notebook to run daily

JupyterHub and related components run inside a Docker container named jupyterhub that runs the Ubuntu operating system. There are several ways for you to administer components running inside the container [14].

Please note that customisations you perform within the container may not persist if the container restarts. We recommend that you script or otherwise automate container configuration so that you can reproduce customisations more readily.

11. User / Group / credentials management in jupiter notebook?

You can use one of two methods for users to authenticate to JupyterHub so that they can create notebooks and, optionally, administer JupyterHub.

The easiest method is to use JupyterHub’s pluggable authentication module (PAM). However, JupyterHub on EMR also supports the LDAP Authenticator Plugin for JupyterHub for obtaining user identities from an LDAP server, such as a Microsoft Active Directory server [15].

You can find instructions and examples for adding users with PAM here [16] and LDAP here [17].

12. notebook collaborations features?

TBD.

13. import/export options?

As stated above, you can install additional kernels within the Docker container manually i.e. you can install additional kernels, additional libraries and packages and then import them for the appropriate shell [2].

14. any other connections build in Jupyter?

As stated above, EMR release 5.14.0 is the first to include JupyterHub and will include all available EMR applications within EMR Release 5.14.0.

15. Working seamlessly with AWS GLUE in terms share meta store?

If you are asking for example with regards to configuring Hive to use the Glue Data Catalog as its metastore, you can indeed do this since EMR version 5.8.0 or later [18].

Finally, I have included the following for your reference:

1. JupyterHub Components

The following diagram depicts the components of JupyterHub on EMR with corresponding authentication methods for notebook users and the administrator [19].

2. Sagemaker

As you are more than likely aware. AWS have recently launched a ML notebook service called SageMaker which uses Jupyter notebooks only. As Sagemaker is integrated with other AWS services you can achieve greater control. For example, with Sagemaker you can utilize the IAM service to control user access. You can also connect to it from an EMR cluster, for example EMR version 5.11.0 [20] added the aws-sagemaker-spark-sdk component to Spark, which installs Amazon SageMaker Spark and associated dependencies for Spark integration with Amazon SageMaker. You can read more

You can use Amazon SageMaker Spark to construct Spark machine learning (ML) pipelines using Amazon SageMaker stages. If this is of interest to you, you can read more about it here [21] and on the SageMaker Spark Readme on GitHub [22].

Resources

Need to learn more about aws big data (demystified)?

Contact me via linked in Omid Vahdaty
website: https://amazon-aws-big-data-demystified.ninja/
Join our meetup, FB group and youtube channel
Join our meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
Join our facebook group https://www.facebook.com/groups/amazon.aws.big.data.demystified/
subscribe to our youtube channel https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber

——————————————————————————————————————————

I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

https://www.linkedin.com/in/omid-vahdaty/

Need to learn more about aws big data (demystified)?

Leave a ReplyCancel reply

Discover more from Big Data Demystified