Analyze Data with Presto and Airpal on Amazon EMR

Songzhi Liu is a Professional Services Consultant with AWS

You can now launch Presto version 0.119 on Amazon EMR, allowing you to easily spin up a managed EMR cluster with the Presto query engine and run interactive analysis on data stored in Amazon S3. You can integrate with Spot instances, publish logs to an S3 bucket, and use EMR’s configure API to configure Presto. In this post, I’ll show you how to set up a Presto cluster and use Airpal to process data stored in S3.

What is Presto?

Presto is a distributed SQL query engine optimized for ad hoc analysis. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions. Presto can run on multiple data sources, including Amazon S3. Presto’s execution framework is fundamentally different from that of Hive/MapReduce: Presto has a custom query and execution engine where the stages of execution are pipelined, similar to a directed acyclic graph (DAG), and all processing occurs in memory to reduce disk I/O. This pipelined execution model can run multiple stages in parallel and streams data from one stage to another as the data becomes available. This reduces end-to-end latency and makes Presto a great tool for ad hoc data exploration over large datasets.

What is Airpal?

Airpal is a web-based query execution tool open-sourced by Airbnb that leverages Presto to facilitate data analysis. Airpal has many helplful features. For example, you can highlight syntax, export results to CSV for download, view query history, save queries, use a Table Finder to search for appropriate tables, and use Table Explorer to visualize the schema of a table. We have created an AWS CloudFormation script that makes it easy to set up Airpal on an Amazon EC2 instance on AWS.

For this blog post we will use the Wikimedia’s page count data, which is publicly available at ‘s3://support.elasticmapreduce/training/dataset/wikistats/’. This data is in textfile format. We will also convert the table to Parquet and ORC.

Spin up an EMR cluster with Hive and Presto installed

First, log in to the AWS console and navigate to the EMR console. Choose EMR-4.1.0 and Presto-Sandbox. Make sure you provide SSH keys so that you can log into the cluster.

Note: Write down the DNS name after creation is complete. You’ll need this for the next step.

Use AWS CloudFormation to deploy the Airpal server

Make sure you have a valid Key Pair for the region in which you want to deploy Airpal.

Navigate to AWS CloudFormation, click Create New Stack, name your stack, and choose Specify an Amazon S3 template URL.

Use the template in https://s3-external-1.amazonaws.com/emr.presto.airpal/scripts/deploy_airpal_env.json

Click Next and configure the parameters.

Important parameters you should configure:

PrestoCoordinatorURL Use the DNS name you noted earlier. Follow the format: http://<DNS Name of the cluster>:<Port of Presto>. The default port for Presto installation is 8889.

Example: http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:8889

AirpalPort Choose the port on which the Airpal server should run. The default is 8193. Adjust this according to your firewall setting and make sure it’s not blocked.

S3BootstrapBuckets This is the S3 bucket name that holds the bootstrap scripts. There is no need to change the default value of emr.presto.airpal.

InstallAirpal This is the path to the installation script of Airpal Server. There is no need to change the default value of scripts/install_airpal.sh.

StartAirpal This is the path to the starting script of Airpal Server. There is no need to change the default value of scripts/start_airpal.sh.

MyKeyPairName Select a valid Key Pair you have in this region. You’ll use this to log in to the master node.

Click Next and add a tag to the stack if needed. Select the check box for IAM policy and click Create.

Wait 5 -10 minutes after the stack status changes to create_complete. (The server configuration takes longer than the stack creation.)

Navigate to the EC2 console and select the Airpal Server instance and note its public IP address.

Open a browser. Use: <PublicIP>:<Airpal Port> to go to Airpal. Make sure that the Port 8889 is allowed on the Master Security Group for your EMR cluster.

Log in to the master node and run Hive scripts

Presto ships with several connectors. To query data from Amazon S3, you need to use the Hive connector. Presto only uses Hive to create the meta-data. Presto’s execution engine is different from that of Hive. By default, when you install Presto on your cluster, EMR installs Hive as well. The metadata is stored in a database such a MySQL and is accessed by the Hive metastore service. The Hive metastore service is also installed.

The dataset contains hits data for the Wikipedia pages of around 7 GB. The schema is as follows:

Language of the page
Title of the page
Number of hits
Retrieved page size

Define the schema

To define the schema:

ssh -i YourKeyPair.pem hadoop@ec2-xx-xx-xx-xx.compute-1.amazonaws.com

Replace YourKeyPair.pem with the place and name of your pem file. Replace ec2-xx-xx-xx-xx.compute-1.amazonaws.com with the public DNS name of your EMR cluster.

Type “hive” in the command line to enter Hive interactive mode and run the following commands:

CREATE EXTERNAL TABLE wikistats (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\n'
LOCATION 's3://support.elasticmapreduce/training/datasets/wikistats/';

Now you have created a “wikistats” table in csv format. You can also store this table using the Parquet format using the following command:

CREATE EXTERNAL TABLE wikistats_parq (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
STORED AS PARQUET
LOCATION 's3://emr.presto.airpal/wikistats/parquet';

You can store it in the compressed ORC format using the following command:

CREATE EXTERNAL TABLE wikistats_orc (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
STORED AS ORC
LOCATION 's3://emr.presto.airpal/wikistats/orc';

Now we have three tables holding the same data of three different data formats.

Try Presto in Airpal

Open a browser and go to ‘http://<ip address of the ec2 instance>:8193’.

You will use Presto queries to answer the questions below. Paste the following queries into the Airpal query editor.

What is the most frequently viewed page with page_title that contains “Amazon”?

SELECT language, 
    page_title,
    SUM(hits) AS hits 
FROM default.wikistats 
WHERE language = 'en'
AND page_title LIKE '%Amazon%'
GROUP BY language,
    page_title
ORDER BY hits DESC
LIMIT 10;

On average, what page is hit most in English?

SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats
WHERE language = 'en'
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

Try wikistats_orc and wikistats_parq with the same query. Do you see any difference in performance?

Go back to Airpal and view the results. The top records are Main_Page, Special: or 404_error, etc., which we don’t really care about. These words are noise here so you should filter them out in your query:

SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats
WHERE language = 'en'
AND page_title NOT IN ('Main_Page',  '404_error/')
AND page_title NOT LIKE '%Special%'
AND page_title NOT LIKE '%index%'
AND page_title NOT LIKE '%Search%'
AND NOT regexp_like(page_title, '%20')
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

Using the Presto CLI

You can also use the Presto CLI directly on the EMR cluster to query the data.

ssh -i YourKeyPair.pem hadoop@ec2-xx-xx-xx-xx.compute-1.amazonaws.com

Replace YourKeyPair.pem with the location and name of your pem file. Replace ec2-xx-xx-xx-xx.compute-1.amazonaws.com with the public DNS name of your EMR cluster.

Assuming you already defined the schema using Hive, start the Presto-CLI.

Run the following command:

$ presto-cli –catalog hive –schema default

Check to see if the table is still there.

Try the same query you tried earlier.

SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats
WHERE language = 'en'
AND page_title NOT IN ('Main_Page',  '404_error/')
AND page_title NOT LIKE '%Special%'
AND page_title NOT LIKE '%index%'
AND page_title NOT LIKE '%Search%'
AND NOT regexp_like(page_title, '%20')
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

As you can see, you can also execute the query from the Presto CLI.

Summary

Presto is a distributed SQL query engine optimized for ad hoc analysis and data-exploration. It supports ANSI SQL standard, including complex queries, aggregations, joins, and window functions. In this post, I’ve shown you how easy it is to set up an EMR cluster with Presto 0.119, create metadata using Hive, and use either the Presto-CLI or Airpal to run interactive queries.

If you have questions or suggestions, please leave a comment below.

——————————————-

——————————————–

Love to work on open source? Check out EMR’s careers page.