All you need to know about AWS EMR Presto
Author: Omid Vahdaty 2.7.2018
A list of good reading materials to help you getting started on Presto
JDBC, in memory, sometimes faster than AWS Athena.
Presto uses External table in s3
Syntax limitations compared with hive
- INSERT OVERWRITE Statements are NOT Supported.
Presto does not currently support INSERT OVERWRITE Statements. Please delete table before INSERT INTO. See the detail here.
- Presto announced support cost-based JOIN optimizations meaning, JOINs are automatically reordered based on table size. Unless you are using the latest version, Please make sure that smaller tables are on the right hand size of JOIN, and they must fit in memory, Otherwise out of memory exceptions will cause the query to fail.
If you are looking for Hive like Dynamic partitions? It’s Not supported in presto 🙁
How to use Presto with AWS EMR:
- Presto with Airpal– Airpal has many helpful features like highlighting syntax, export results to CSV for download etc. Airpal provides the ability to find tables, see metadata, browse sample rows, write and edit queries, then submit queries all in a web interface. Please note that running an extra Airpal server will lead to extra EC2 costs.
- Presto with Hue– You can use Presto with hue (hue-4.0.1) on EMR (version 5.9.0 or later). Hue provides SQL editor for running your presto queries in a web interface similar to Airpal. (There may be a difference in features provided by hue as compared to Airpal). Hue is a better option than using Airpal as per my understanding, as you can install hue as a part of EMR installation.
- Presto on EMR CLI– You can run presto using command line interface and monitor your queries using presto web UI. You can open “MASTER_NODE_IP:8889“(default) to monitor your cluster details. To enable web interfaces for EMR cluster, kindly refer (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-ui-console.html)
- Use Athena instead of Presto on EMR– You can also use AWS Athena if you want to process data present in S3. Amazon Athena is an interactive query service that makes it easy to analyse data in Amazon S3. Athena internally uses Presto as SQL query engine.
- Use Presto when you want to reduce costs on your AWS Athena service.
Presto reading Hive partitions including dynamic partitioning
Presto has full support for Hive partitions including dynamic partitioning.
On EMR, when you install Presto on your cluster, EMR installs Hive as well. Presto uses the Hive metastore to map database tables to their underlying files.
The INSERT query into an external table on S3 is also supported by the service. To query data from Amazon S3, you will need to use the Hive connector that ships with the Presto installation.
Scheduling job in Presto
As per my understanding, you can use one of the following methods:
- You can create a shell script and submit it as a step to the cluster. For example, you can create a script. For more details on submitting step to a cluster.
presto-cli –catalog hive –schema default –execute “select count(*) from TABLE_NAME;”
- Use a shell action to schedule an oozie workflow on EMR cluster(oozie needs to be installed as part of EMR cluster).
This blog explains how to use oozie workflows.
- You can save your queries in hue and then run those saved queries in hue console.
Working example with Hive and Presto:
- Create table via hive.
- Select via Presto.
presto-cli –catalog hive –execute “select * from t”
Good lecture on new features
Cost reduction on Presto:
Try my cost reduction article on AWS Athena, it may have usefull tips.
I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way.
If you have any comments, thoughts, questions, or you need someone to consult with,
feel free to contact me via LinkedIn: