Derek Graeber is a senior consultant in big data analytics for AWS Professional Services
Working with customers who are running Apache Spark on Amazon EMR, I run into the scenario where data loaded into a SparkContext can and should be shared across multiple use cases. They ask a very valid question: “Once I load the data into Spark, how can I access the data to support ad hoc queries?” Open-source APIs are available that support doing this, such as JobServer, which exposes a RESTful interface to access the context.
“Okay, how do I put JobServer on EMR?” typically follows. There are a couple of ways. Amazon EMR recently started using Apache BigTop to support a much quicker release window (among other things) for out-of-the-box application components installed on EMR such as Spark, Pig, and Hive.
In this blog post, you will learn how to install JobServer on EMR using a bootstrap action (BA) derived from the JobServer GitHub repository. Then we’ll run JobServer using a sample dataset. To learn how to compile JobServer and install it on your Spark cluster, look at the JobServer readme file for EMR. All referenced code including the BA, Spark code, and commands, is located in this project’s GitHub repository.
Background and setup
For this approach, we assume that you have a working knowledge of Apache Spark running on EMR and can create a cluster with configurations using either the AWS Management Console or the AWS Command Line Interface (AWS CLI). For this exercise, we will define a cluster size and use the airline flight's public dataset available on Amazon S3. This data is in Parquet format. We will create an Amazon EMR 4.7.1 cluster consisting of:
- One r3.xlarge master instance
- Five r3.xlarge core instances
Note that this cluster setup is completely arbitrary. This cluster is the one I typically use for proof-of-concept work. Note that the cluster is not optimized. (Optimization is outside the scope of this post.) You can modify your cluster nodes or Spark job configuration as you wish. For a reference, see AWS Big Data Blog post.
If you haven’t read the readme file for both JobServer and the EMR-JobServer configurations, do that before proceeding. Then get the project in GitHub and explore. The project is laid out in a typical Maven structure with additional directories for the configurations and BA.
Next, look at the version information in the JobServer readme file and determine the version of JobServer you’d like to use based on the version of Spark you are using. In this example, we are using Amazon EMR 4.7.1, which supports Spark 1.6.1. Thus, based on the readme, we will need version 0.6.2 (v0.6.2 branch) of JobServer. Make a note of this for later.
As you read the readme for EMR-JobServer, you’ll see there are two configuration files to be aware of:
emr.sh – this file defines parameters related to your Spark installation. For our example, these points apply:
- You only need to modify the SPARK_VERSION value.
- We will use the emr_v1.6.1.sh file provided in this blog post’s sample code.
emr.conf – this file defines the Spark run-time file used and some contexts that can be created. For our example, these points apply:
- We are creating a pre-defined spark-sql context for Hive.
- Researching the context definitions under the job-server-extras part of the GitHub project for JobServer helps a lot to understand these context factories.
- We will use the emr_contexts.conf file provided in this blog post’s sample code.
You can find copies of these configuration files in the GitHub project under <project-root>/jobserver-configs. Review and become familiar with them. We will be staging them on Amazon S3 for the cluster creation later on.
We also provide two BAs and an EMR configuration sample under <project-root>/BA:
- full_install_jobserver_BA.sh – this BA installs all necessary build components on your cluster, gets the project from GitHub, compiles, and creates a JobServer distribution. When installation is complete, this BA deploys JobServer and also puts the .tar file for the compiled code onto S3 for reuse.
- existing_build_jobserver_BA.sh – this BA looks for a precompiled distribution of JobServer in S3 and deploys that onto the cluster.
- configurations.json – this sample EMR configuration is provided for illustration purposes. Here, we’re setting the Spark cluster to use the maximumResourceAllocation option.
Why two BAs? When you determine the version of JobServer you want to use and begin to use it extensively, the overhead of installing the build frameworks and compiling the source code becomes redundant and time-consuming. You have built the distro and it is available on Amazon S3, so why save time by reusing it and streamlining the build process of the cluster,? This approach also means that your EMR cluster is cleaner, because it won’t have SBT and Git installed. However, this is just a matter of preference. We will walk through both approaches. Read More →