Deploying Cloudera’s Enterprise Data Hub on AWS

Karthik Krishnan is an AWS Solutions Architect

UPDATE April 6, 2015: The newest quickstart reference guide supports Cloudera Director 1.1.0. To manage your cluster with Cloudera Director 1.1.0, refer to the updated reference guide.

Apache Hadoop is an open-source software framework to store and process large scale data-sets. In this post, we discuss the deployment of a Hadoop cluster via Cloudera’s Enterprise Data Hub (EDH) on AWS. The easy deployment below leverages various AWS technologies such as AWS Cloudformation, Amazon Elastic Compute Cloud (Amazon EC2), and Amazon Virtual Private Cloud (Amazon VPC) along with Cloudera Director software. Cloudera Director enables the delivery of enterprise-class, elastic, self-service experience for the Enterprise Data Hub on cloud infrastructure. This deployment allows customers to build a Hadoop cluster rapidly on-demand, while providing enough options to customize their cluster at fine granularity.

The flexible architecture allows you to choose the most appropriate network, compute and storage infrastructure for your environment, while the automation via Cloudformation and Cloudera Director Software takes care of building the infrastructure on AWS. The automation roughly works by launching a Launcher Instance through which the entire cluster is constructed. Options to customize the cluster is done on the launcher instance. Because most of the steps are automated, customers can rapidly construct the cluster on AWS by changing configuration files and parameters during deployment. Let’s begin!

Prerequisites

To deploy EDH, you’ll need to sign up for an AWS account, create a key pair, and request an Amazon EC2 limit increase if applicable. You’ll need to be familiar with AWS technologies such as Amazon VPC, NAT instance, Security Groups, and IAM roles. These are automatically built via CloudFormation. You’ll also need to be familiar with Hadoop services such as HDFS, YARN, ZOOKEEPER, HIVE, HUE, and OOZIE.

You should also understand different instance types and your workload type as outlned in the following table.

Instance	VCPU	Memory (GiB)	Workload Type	Local Storage (GB)	Storage Type
m2.4xlarge	8	68.4	BALANCED	2 x 840 GB	MAGNETIC
c3.8xlarge	32	60.0	COMPUTE	2 x 320 GB	SSD
i2.2xlarge	8	61.0	BALANCED	2 x 800 GB	MAGNETIC
cc2.8xlarge	32	60.5	COMPUTE	4 x 840 GB	MAGNETIC
i2.4xlarge	16	122.0	MEMORY	4 x 800 GB	SSD
hs1.8xlarge	16	117.0	BALANCE	24 x 2000 GB	MAGNETIC
i2.8xlarge	32	244.0	MEMORY	8 x 800 GB	SSD

AWS Cluster Topology

An EDH cluster could be launched either within a Public Subnet or within a Private Subnet. A cluster topology derived from a Public Subnet includes an Amazon EC2 instance (referred to as Cluster Launcher instance) which is launched within the public subnet. An Elastic IP Address (EIP) is assigned to the instance, and a security group allowing SSH access to the instance is created. The Cluster Launcher instance then builds the EDH cluster by launching all of the Hadoop related Amazon EC2 instances within the public subnet. In this topology, all the instances launched have direct access to the Internet and to any other AWS services that may be subsequently used such as Amazon Simple Storage Services (Amazon S3) and Amazon Relational Database Service (Amazon RDS).

EDH Cluster in Public Subnet

EDH Cluster in Private Subnet

A cluster topology derived from a Private Subnet launches the Cluster Launcher instance within the public subnet. An Elastic IP Address (EIP) is assigned to the instance, and a security group is created allowing SSH access to the instance. All other Hadoop-related Amazon EC2 instances are created within the private subnet. In this topology, the Amazon EC2 instances within the EDH Cluster do not have direct access to the Internet or to other AWS services. Instead, their access is routed via NAT instances residing in the public subnet. This topology is more suitable if the EDH cluster doesn’t require full external bandwidth to the Internet or to other AWS services such as Amazon RDS and Amazon S3. To learn more about high availability for NAT Instances, see High Availability for Amazon VPC NAT Instances.

Step 1: Launch the Virtual Private Network and Configure AWS Services for EDH Deployment

In this step, you configure the VPC and build Public and Private Subnet along with a NAT instance. A cluster lancher instance is also created. You will connect to the launcher instance in next step and customize the cluster and then deploy the EDH cluster. The only mandatory input expected by the template is KeyName, while other may still be customized. Launch the VPC template into your AWS account using AWS CloudFormation link: LaunchStack Wait until the AWS CloudFormation status is CREATE_COMPLETE.

Step 2: Configure EDH Cluster and Services

Next, you SSH to the Launcher EC2 instance, configure EDH services and deploy the EDH Cluster. During bootup, the Launcher instance downloads Cloudera Director software and automatically builds a configuration file based on various resources created such as VPC, private subnet, and public subnet. There are two configuration files: aws.simple.conf for simple clusters and aws.reference.conf for complex clusters.

The following steps configure and deploy the cluster:

1. Copy the private keyfile (.pem) used to launch to the Launcher instance. For example, you can do a command line copy via “scp -i mykey.pem mykey.pem ec2-user@cluster-launcher-public-ip:/home/ec2-user/mykey.pem“

Connect to the cluster launcher instance by clicking the Connect tab under EC2 Instances as shown below. You will need your private key to launch the instance.

Modify /home/ec2-user/cloudera/cloudera-director-1.0.0/aws.simple.conf (or aws.reference.conf for complex clusters) to include the path to private key file. Customize other parameters such as EC2 instance type, Public/Private Subnet or number of EDH nodes inside the configuration file.

Step 3: Deploy the EDH Cluster

Deploy using the CLI as shown below:

./bin/cloudera-director bootstrap aws.simple.conf (simple cluster)

-OR-

./bin/cloudera-director bootstrap aws.reference.conf (complex cluster)

If you want to spin up and manage multiple clusters, instead of above, you may deploy using Cloudera Director Server (recommended) as below.

./bin/cloudera-director-server (start the server running on port 7189 first)

./bin/cloudera-director bootstrap-remote aws.simple.conf –lp.remote.hostAndPort=127.0.0.1:7189 –lp.remote.username=admin –lp.remote.password=admin

-OR-

./bin/cloudera-director bootstrap-remote aws.reference.conf –lp.remote.hostAndPort=127.0.0.1:7189 –lp.remote.username=admin –lp.remote.password=admin

That’s it! The cluster will be ready in less than 30 minutes.

From Cloudera Director’s web interface you can clone the cluster you just created, dynamically scale the cluster, or spin up an entirely new cluster. You can also view all your clusters using a centralized dashboard.

Shutting Down the Cluster

The cluster can be shut down via CLI as shown below:

./bin/cloudera-director terminate aws.simple.conf

For additional information, please refer to the following links.

Conclusion

By following these simple deployment steps, you can setup a cluster running Cloudera’s Enterprise Data Hub. This eliminates the process of procuring systems and installing Hadoop software and rapidly reduces the time to build a test or production cluster.

If you have questions or suggestions, please leave a comment below.

UPDATE (October 24, 2014)

For HTTP failures related to Cloudera Director Server, the correct command (note admin/passwd) is:

– /bin/cloudera-director bootstrap-remote aws.simple.conf –lp.remote.hostAndPort=127.0.0.1:7189 –lp.remote.username=admin –lp.remote.password=admin

-OR-

./bin/cloudera-director bootstrap-remote aws.reference.conf –lp.remote.hostAndPort=127.0.0.1:7189 –lp.remote.username=admin –lp.remote.password=admin

On the mysql issue, we’ve noted the bug. Cloudera plans to fix it in next version (no official date – but expect in a week). When the fix is available, we will update our templates to download the new version. In the meantime, there is a workaround. Install Cloudera Manager 5.1.3 by overriding the repository in conf file:

cloudera-manager {
[…]
repository: “http://archive.cloudera.com/cm5/redhat/6/x86_64/cm/5.1.3/” }

Give it a try and let us know if you still see any issues.

—————————————————————

Love to work on open source? Check out EMR’s careers page.

—————————————————————-