Ben Snively is a Solutions Architect with AWS
With big data, you deal with many different formats and large volumes of data. SQL-style queries have been around for nearly four decades. Many systems support SQL-style syntax on top of the data layers, and the Hadoop/Spark ecosystem is no exception. This allows companies to try new technologies quickly without learning a new query syntax for basic retrievals, joins, and aggregations.
Amazon EMR is a managed service for the Hadoop and Spark ecosystem that allows customers to quickly focus on the analytics they want to run, not the heavy lifting of cluster management.
In this post, we demonstrate how you can leverage big data platforms and still write queries using a SQL-style syntax over data that is in different data formats within a data lake. We first show how you can use Hue within EMR to perform SQL-style queries quickly on top of Apache Hive. Then we show you how to query the dataset much faster using the Zeppelin web interface on the Spark execution engine. Lastly, we show you how to take the result from a Spark SQL query and store it in Amazon DynamoDB.
Hive and Spark SQL history
For versions <= 1.x, Apache Hive executed native Hadoop MapReduce to run the analytics and often required the interpreter to write multiple jobs that were chained together in phases. This allowed massive datasets to be queried but was slow due to the overhead of Hadoop MapReduce jobs.
SparkSQL adds this same SQL interface to Spark, just as Hive added to the Hadoop MapReduce capabilities. SparkSQL is built on top of the Spark Core, which leverages in-memory computations and RDDs that allow it to be much faster than Hadoop MapReduce.
Spark integrates easily with many big data repositories. The following illustration shows some of these integrations.