Statistical Analysis with Open-Source R and RStudio on Amazon EMR

Markus Schmidberger is a Senior Big Data Consultant for AWS Professional Services

Big Data is on every CIO’s mind. It is synonymous with technologies like Hadoop and the ‘NoSQL’ class of databases. Another technology shaking things up in Big Data is R. This blog post describes how to set up R, RHadoop packages and RStudio server on Amazon Elastic MapReduce (Amazon EMR). This combination provides a powerful statistical analyses environment, including a user-friendly IDE on a fully managed Hadoop environment that starts up in minutes, and saves time and money for your data-driven analyses. At the end of this post, I’ve added a Big Data analysis using a public data set with daily global weather measurements.

R is an open source programming language and software environment designed for statistical computing, visualization and data. Due to its flexible package system and powerful statistical engine, the statistical software R can provide methods and technologies to manage and process a big amount of data. It is the fastest-growing analytics platform in the world, and is established in both academia and business due to its robustness, reliability, and accuracy. Nearly every top vendor of advanced analytics has integrated R and can now import R models. This allows data scientists, statisticians and other sophisticated enterprise users to leverage R within their analytics package.

The open source project RHadoop provides several R packages to work with R and Hadoop interactively. It uses Hadoop Streaming to send jobs from R to Hadoop and works for the Hadoop distributions CDH3 and higher, or Apache 1.0.2 and higher. Furthermore, the RHadoop project provides packages to connect with Apache HBase and to execute functionality from the famous plyr package on Hadoop.

Traditionally, R was not designed to handle large amount of data. In recent years several packages were published to solve high-memory requirements and long computation times. The RHadoop packages combine R with Hadoop and allow you to marry R’s statistical capabilities with the scalable compute power provided by Amazon EMR on top of the Hadoop MapReduce framework. This integration allows you to process large data volumes on Amazon EMR which otherwise would not be possible using R in stand- alone mode.

RStudio is a free and open source integrated development environment for R. It can be used on a desktop computer and as a server version. The RStudio project started in 2011 and is a commonly used IDE for R. Installing RStudio server on the Hadoop master node plus using the RHadoop packages provides a great integration of R with Hadoop.

Amazon Elastic MapReduce (Amazon EMR) is a web service that makes it easy to quickly and cost-effectively process vast amounts of data. Amazon EMR uses Hadoop to distribute your data and processing across a resizable cluster of Amazon Elastic Compute Cloud (Amazon EC2) instances.

Starting an Amazon EMR Cluster with R

Installing RStudio server and RHadoop packages on Amazon EMR requires some bootstrap activity. To learn how to use Bootstrap Actions and other aspects of Amazon EMR, see the Amazon EMR getting started documentation.

First, we start an Amazon EMR cluster in the “us-east-1“ region using the AWS Command Line Interface. The required scripts are available at the emr-bootstrap-action github repository. Please copy them to your Amazon Simple Storage Service (Amazon S3) bucket and replace <YOUR_BUCKET> with your bucket name

NOTE: For EMR 4.x and later, see the “Installing and configuring RStudio for SparkR on EMR” section of Crunching Statistics at Scale with SparkR on Amazon EMR.

aws emr create-cluster –ami-version 3.2.1 –instance-groups
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.2xlarge InstanceGroupType=CORE,InstanceCount=5,InstanceType=m3.2xlarge
–bootstrap-actions Path= s3://<YOUR_BUCKET>/emR_bootstrap.sh,Name=CustomAction,Args=[–rstudio,–rexamples,–plyrmr,–rhdfs]
–steps Name=HDFS_tmp_permission,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=s3://<YOUR_BUCKET>/hdfs_permission.sh
–region us-east-1 –ec2-attributes KeyName=<YOUR_SSH_KEY>,AvailabilityZone=us-east-1a
–no-auto-terminate –name emR-example

This command line starts an Amazon EMR cluster with a master node with the instance type m3.2xlarge and five core instances with the instance type m3.2xlarge. On the master node, the emR_bootstrap.sh script will install RStudio server and RHadoop packages on all nodes depending on the provided arguments. As soon as the cluster is running, a first Hadoop job with the name HDFS_tmp_permission will run to fix Hadoop file system permission and to provide read/write permission for everyone to the /tmp folder. Furthermore, the cluster is named emR-example, auto-termination is enabled, and the region, the availability zone and the key file for accessing the master node is defined in the command line script. Replace <YOUR_SSH_KEY> with your own key.

The Bootstrap Script emR_bootstrap.sh

The script sets a list of Hadoop system variables and installs some system packages. For the installation of the RHadoop packages, R must be linked to the corresponding Java setup. Therefore, R CMD javareconfis executed. Now we install a list of R packages, which are required for the RHadoop packages. Afterwards, the RHadoop packages can be downloaded and installed. Finally, the user rstudio (–user,rstudio) with the password rstudio (–user-pw,rstudio) is added on all machines. If you set the –rstudio argument the bootstrap script will install RStudio server on the master node.

To omit firewall issues the default RStudio server port is changed to port 80 (–rstudio-port=80). R example scripts are copied to the user’s home directory using the –rexamples argument. The packages plyrmr and rhdfs require several compilation time. These packages can be installed with the arguments –plyrmr and –rhdfs. Due to performance reasons all packages are compiled with the R byte compiler. For updating R to the latest CRAN version you can set the –updater flag. This will compile R and takes up to 15 additional minutes of bootstrapping time.

Changing the security group

By default, port 80 is closed in the ElasticMapReduce-master security group. You can change this using the Amazon EC2 web console. Locate the security group tab and the corresponding security group and add a security rule HTTP for the source “My IP.”

Interacting with an Amazon EMR Cluster and Submitting R Jobs

Using the master public DNS, you can access RStudio running on the master node of the Amazon EMR cluster via your web browser. If you haven’t worked with RStudio, see the RStudio documentation.

In your home directory you will find an R example script called rmr2_example.R. Open this file and source it via the R command source() or by using the RStudio buttons. This script provides a very short guideline to use the rmr2 package. It creates a long vector of integers, moves this vector to HDFS, and calculates the square value of each vector element using a map task. The output of the commands provides a lot of information created by Hadoop Streaming, including an overview of the progress of the job and the job id:

As you can see in the output matrix, the script calculated the square values of the input vector. In MapReduce, everything works with key-value pairs. In most cases, the output of an rmr2 function is a list with the elements ‘key’ and ‘val’. You should get used to working with these elements.

A Real-world Big Data Analysis

To provide a more useful example we will run a real-world Big Data Analyses. Amazon Web Services provides a repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS documentation provides a detailed list of available data sets and first steps to work with these data.

For this analysis we use the Daily Global Weather Measurements dataset. The National Climactic Data Center (NCDC) originally collected data from 1929 – 2009 as part of the Global Surface Summary of Day (GSOD). The data set provides a global summary of day data for 18 surface meteorological elements derived from the synoptic/hourly observations. This data set can only be used within the United States. Therefore, we run all analyses in the us-east-1 region and create the Amazon Elastic Block Store (Amazon EBS) volume in the same availability zone as the Amazon EMR cluster (here us-east-1a).

The data is stored in an Amazon EBS snapshot. To access the data we have to create an Amazon EBS volume using the Snapshot ID listed above. Now we can attach this volume to the master node of the Amazon EMR cluster. In the Amazon EC2 web console you can find all machines of your Amazon EMR cluster. To detect the master node you can filter for the public DNS or Amazon EMR-master security group or the hostname. After attaching the Amazon EBS volume, you must log in to the master node via ssh and mount the new attached volume:

ssh –i <YOUR_SSH_KEY> hadoop@ec2-X-X-X-X.compute-1.amazonaws.com

mkdir data

sudo mount /dev/xvdf data

sudo chown -R hadoop:hadoop data

find data/gsod -maxdepth 2 -type f -exec sed -i ‘1d’ {} ; -print

hadoop fs -mkdir /tmp/data

hadoop fs -v -put data/*.txt /tmp/data/

hadoop fs -put data/gsod /tmp/data/

We also remove the header line of all files to omit issues in the rmr2 package and move the data to HDFS. Now we can use, for example, the plyrmr package to analyze the weather data set. In your home directory you will find a well-documented R script called biganalyses_example.R which provides basic steps to analyze the data set. As a result, the figure shows the variation of temperature averaged by month for all 25.000 weather stations in 1957. The red line describes the average over all stations. For most stations, you can see the temperature differences between summer and winter. The highest temperatures are in July and the lowest in January.

Copying the data from Amazon EBS into HDFS and several commands with all data run for a long time (up to 12 hours with the small example Amazon EMR cluster). Scaling up the Hadoop cluster to 15 core nodes of c3.x4large reduces the computation time to about two hours. One great feature of RStudio server is that you can log out and log in to your calculating R session without destroying the running calculation.

When you’re done, don’t forget to terminate your Amazon EMR cluster to so that you don’t incur additional costs:

aws emr terminate-clusters –cluster-ids XXX

All data and scripts created in RStudio in your home directory will be lost at termination. You can use git in RStudio to back up your scripts on an external version control system or download them to your local machine.

Summary

This blog post provided all the code you need to get your analyses with open-source R up and running on an Amazon EMR cluster. RStudio server provides a user-friendly programming environment for data analyses with R on Hadoop. The RHadoop packages provide a simple and efficient approach to writing mapReduce code with R and high-level functionality to analyze Big Data located in a Hadoop cluster. The installation described in this post moves your data analyses next to your data on Hadoop, and omits additional workload and latency based on time delays due to data movement. Overall, the bootstrap script allows rapid deployment of an advanced analytical platform on Amazon EMR, executing computing and data intensive workloads based on open-source R and Hadoop.

This analysis is a starting point for more detailed big data analyses with R on Hadoop. Several more examples are provided in this tutorial or at these use cases.

If you have questions, comments, or suggestions, please add a Comment below.

——————————

Related:

Running R on AWS

AWS Big Data Blog

Statistical Analysis with Open-Source R and RStudio on Amazon EMR

Starting an Amazon EMR Cluster with R

Interacting with an Amazon EMR Cluster and Submitting R Jobs

A Real-world Big Data Analysis

Summary

Resources

Follow