AWS Big Data Blog

Getting Started with Elasticsearch and Kibana on Amazon EMR

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.


Hernan Vivani is a Big Data Support Engineer for Amazon Web Services

This post shows you how to install Elasticsearch and Kibana on an Amazon EMR cluster and provides a few simple ways to confirm it is working. (Please also see “Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch.”)

NOTE: If your goal is to quickly get started with Elasticsearch and Kibana without any underlying maintenance and configuration, consider using Amazon OpenSearch Service.

What is Elasticsearch?

Elasticsearch is a distributed RESTful search and analytics engine built on top of Apache Lucene. Developed in Java and released as open source under Apache license, Elasticsearch can index large volumes of log files and text content in real-time and allows users to query them in free form text using RESTful apis. It integrates with Hadoop and can significantly improve the way large volumes of data are analyzed using Hadoop’s mapreduce model. For more details on Elasticsearch and its capabilities, please see Elasticsearch: The Definitive Guide.

What is Kibana?

Kibana is a web interface for Elasticsearch and provides visualization capabilities on top of the content indexed on an Elasticsearch cluster. You can create bar, line and scatter plots, or pie charts and maps on top of large volumes of data. You can also do comparisons of queries across different time ranges. In short, it provides a friendly interface for analyzing a large volume of data without writing code.

Installing Elasticsearch and Kibana on Amazon EMR

First, install the EMR CLI tools. Next, install Elasticsearch and Kibana on Amazon EMR by using Amazon EMR’s bootstrap action feature. A bootstrap action  script allows you to customize existing applications or install additional software when launching a new cluster. These scripts are run on each cluster node when Amazon EMR launches them as part of the cluster. These scripts run before Hadoop is configured or started and nodes start processing data.

The following command launches a 3-node Amazon EMR cluster with Elasticsearch 1.3.1 and Kibana 3.1.0 already installed and configured:

aws emr create-cluster --ec2-attributes KeyName="<YOUR_EC2_KEYNAME>" \
--log-uri="<YOUR_LOGGING_BUCKET>" \
--bootstrap-action \
Name="Install ElasticSearch",Path="s3://beta.elasticmapreduce/bigdatablog/elasticsearch_install.rb" \
Name="Installkibanaginx",Path="s3://beta.elasticmapreduce/bigdatablog/kibananginx_install.rb" \
--ami-version=3.2.1 \
--instance-count=3 \
--instance-type=m1.medium \
--name="TestElasticSearch" \
--use-default-roles 

Along with Elasticsearch, Elasticsearch-Hadoop 2.0 library is also installed on the cluster to provide easy integration for Hadoop MapReduce, Hive, Pig, and Cascading, among others.

This bootstrap action configures Elasticsearch to listen on port 9200 and Kibana on port 80 on the master node of the cluster. By default these ports are protected from public access. To access these interfaces on the master node, do one of the following: SSH tunnel to the master node and browse proxies, or add a rule to allow incoming TCP traffic on port 80 and port 9200 on the EC2 Security Group “ElasticMapReduce-master”.

Note: This security group is configured in each region and applies to all the clusters in that region. This means all Amazon EMR clusters would be allowing incoming TCP traffic on port 80 and 9200 if such a rule is added. While testing on a production account, use SSH tunnels instead of modifying the security group settings.

Verifying the installation

Once the cluster is up and running, you can SSH into the master node and perform following tests to verify the installation.

Performing the Elasticsearch cluster health check

$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'

Indexing Content

$ curl -XPUT "http://localhost:9200/movies/movie/1" -d' {
   "title": "The Godfather",
   "director": "Francis Ford Coppola",
   "year": 1972
	}'

Querying the Indexed Content

$ curl -XGET http://localhost:9200/_search?pretty=true&q={'matchAll':{''}}

Using Kibana

Since this bootstrap action configures Kibana to listen at port 80 on the master node, you can point the browser to the Master Node DNS public address (example: http://ec2-54-77-165-138.eu-west-1.compute.amazonaws.com) and get the Kibana console running.

Kibana is already configured to point to Elasticsearch installation on the cluster. If you click on the Sample Dashboard, it will show our indexed content:

Conclusion

This post has shown you how Amazon EMR lets you install and configure additional software (in this case Elasticsearch and Kibana) on top of Hadoop. Amazon EMR maintains an open source repository for bootstrap actions. Take a look and see if there is already a bootstrap action that interests you. Let us know in the comments section whether you’d like to see a new bootstrap action added and please feel free to submit your own bootstrap action.

If you’ve enjoyed this post, please also see “Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch.”

—————————————————————-

Love to work on open source? Check out our careers page.

—————————————————————-

Do more with EMR:

Using IPython Notebook to Analyze Data with EMR

Strategies for Reducing your EMR Costs