AWS Big Data Blog

Combine NoSQL and Massively Parallel Analytics Using Apache HBase and Apache Hive on Amazon EMR

Ben Snively is a Solutions Architect with AWS

Jon Fritz, a Senior Product Manager for Amazon EMR, co-authored this post

With today’s launch of Amazon EMR release 4.6, you can now quickly and easily provision a cluster with Apache HBase 1.2Apache HBase is a massively scalable, distributed big data store in the Apache Hadoop ecosystem. It is an open-source, non-relational, versioned database which runs on top of the Hadoop Distributed Filesystem (HDFS), and it is built for random, strictly consistent realtime access for tables with billions of rows and millions of columns. It has tight integration with Apache Hadoop, Apache Hive, and Apache Pig, so you can easily combine massively parallel analytics with fast data access. Apache HBase’s data model, throughput, and fault tolerance are a good match for workloads in ad tech, web analytics, financial services, applications using time-series data, and many more.

Table structure in Apache HBase, like many NoSQL technologies, should be directly influenced by the queries and access patterns of the data. Query performance varies drastically based on the way the cluster has to process and return the data. In this post, we’ll demonstrate query performance differences by showing you how to launch an EMR cluster with HBase and restore a table from a snapshot in Amazon S3. The table in the snapshot contains approximately 3.5 million rows, and you’ll perform look-ups and scans using the HBase shell as well as perform SQL queries over the same dataset using Hive with the Hive query editor in the Hue UI.

Introduction to Apache HBase

HBase is considered a persistent, multidimensional, sorted map, where each cell is indexed by a row key and column key (family and qualifier). Each cell can contain multiple versions of a value captured with a long variable, which defaults to a timestamp value. The reason it’s considered multidimensional is that there are several parameters that contribute to the cell location.

Table Structure

An HBase table is composed of one or more column families that are defined for the table. Each column family defines shared storage and a set of properties for an arbitrary grouping of columns. Column families are predefined (either at table creation time or modifying an existing table), but the columns themselves can be dynamically created and discovered while adding or updating rows.

HBase stores data in HDFS, which spreads data stored in a table across the cluster. All the columns in a column family are stored together in a set of HFiles (also known as a column-oriented storage). A rowkey, which is immutable and uniquely defines a row, usually spans multiple HFiles. Rowkeys are treated as byte arrays (byte[]) and are stored in a sorted order in the multi-dimensional sorted map. Cells in your table are byte arrays too.

Data from a table is served by RegionServers running on nodes in your cluster. Each region server manages a namespace of rowkeys, and HBase can split regions as tables get larger, to keep the namespace for each region to a manageable size.

Query Performance

Lower query latency and higher throughput is achieved when each query scans less data and are distributed across all the RegionServers. Querying for a specific row and column allows the cluster to find the RegionServer quickly and underlying files that store the information and return it to the caller. A partial range, if known, can speed up queries by reducing the rows that need to be scanned when a single row is unknown. By default, HBase also utilizes row-level bloom filters to reduce the number of disk reads per Get request

The diagram below shows the relationship between a cardinality and query performance:

HBase shell query walkthrough

Using the console or the CLI, launch a new EMR 4.6 cluster that has the HBase, Hive, and Hue applications selected. Next, you restore a table snapshot on your cluster. We created this snapshot from an HBase table for this demo, and snapshotting is a convenient way to back-up and restore tables for production HBase clusters. Below is the schema of the data stored in the snapshot:

The rowkey is a composite value that combines lastname, firstname, and customerId. There are three column families that group columns containing information about address, credit card, and contact information.

After the cluster is running, recover the sample HBase snapshot from Amazon S3. HBase uses Hadoop MapReduce and EMRFS under the hood to transfer snapshots quickly from Amazon S3. SSH to the master node of your cluster, and use the HBase shell to create an empty table which will be populated by the restored data.

hbase shell
>> create 'customer', 'address', 'cc', 'contact'

Next, run an HBase command (outside of the shell) to copy the snapshot from Amazon S3 to HDFS on your cluster.

sudo su hbase -s /bin/sh -c 'hbase snapshot export -D hbase.rootdir=s3://us-east-1.elasticmapreduce.samples/hbase-demo-customer-data/snapshot -snapshot customer_snapshot1 -copy-to hdfs://<MASTERDNS>:8020/user/hbase -mappers 2 -chuser hbase -chmod 700'

Finally, use the HBase shell to disable the ‘customer’ table, restore the snapshot, and re-enable the table.

hbase shell
>> disable 'customer'
>> restore_snapshot 'customer_snapshot1'
>> enable 'customer'

To demonstrate how specifying different portions of the key structure in the multi-dimensional map influences performance, you will perform a variety of queries using the HBase shell.

Get the credit card type for a specific customer

In this case, you specify every parameter in the multi-dimension map and quickly return the cell from the RegionServer. The syntax for this ‘get’ request is ‘TABLE_NAME’, ‘ROWKEY’, ‘COLUMN_FAMILY:COLUMN’

hbase(main):008:0> 
    get 'customer', 'armstrong_zula_8570365786', 'cc:type'

Get all the address data from the address column family for a specific customer

In this case, you return all of the data from the ‘address’ column family for a specific customer. As you can see, there is a time penalty when compared to just returning one specific cell.

hbase(main):011:0> 
    get 'customer', 'armstrong_zula_8570365786', {COLUMN => 'address'}

Get all the data on a specific customer

In this case, you return all of the data for a specific customer. There is a time cost for returning additional information.

hbase(main):004:0> get 'customer', 'armstrong_zula_8570365786'

Get all the cities for each row that has a customer with a last name starting with “armstrong”

In this case, there is a partial prefix for the rowkey as well as a specific column family/qualifier for which you are querying. The HBase cluster is able to check with Region and RegionServers that you need to query. After that is done, the RegionServer can check the various HFiles if it has that specific rowkey and column data in the dataset.

There are multiple performance improvements that a customer can turn on their table, such as bloom filters. Bloom filters allow the RegionServer to skip over some of the HFiles quickly.

hbase(main):014:0> 
scan 'customer', {STARTROW => 'armstrong_', ENDROW => 'armstronh', COLUMN => 'address:city'}

Get all customers that use a Visa credit card

In this example, HBase has to scan over every RegionServer, and search all the HFiles that contain the cc Column Family. The full list of customers that use Visa credit cards is returned.

hbase(main):014:0> 
scan 'customer', { COLUMNS => 'cc:type',  
                   FILTER => "ValueFilter( =, 'binaryprefix:visa' )" }

Running SQL analytics over data in HBase tables using Hive

HBase can organize your dataset for fast look-ups and scans, easily update rows, and bulk update your tables. Though it doesn’t have a native SQL interface to run massive parallel analytics workloads, HBase tightly integrates with Apache Hive, which utilizes Hadoop MapReduce as an execution engine, allowing you to write SQL queries on your HBase tables quickly or join data in HBase with other datasets.

For this post, you will write your Hive queries and browse tables in the Hive Metastore in the Hue UI, which runs on the master node of your cluster. To connect to Hue, see Launch the Hue Web Interface.

After you have signed into Hue in your browser, explore the UI by choosing Data Browsers, HBase Browser. Select the “customer” table to view your HBase table in the browser. Here you can see a sample of rows in your table and can query, add, and update rows through the UI. To learn more about the functionality available, see HBase Browser in the Hue documentation. Use the search box to scan HBase for 100 records with the rowkey prefix “smith” and return the credit card type and state:

Smith* +100 [cc:type,address:state]

Next, go to the Hive editor under Query Editor and create an external Hive table over your data in HBase:

CREATE EXTERNAL TABLE customer
(rowkey STRING,
street STRING,
city STRING,
state STRING,
zip STRING,
cctype STRING,
ccnumber STRING,
ccexpire STRING,
phone STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,address:street,address:city,address:state,address:zip,cc:type,cc:number,cc:expire,contact:phone', 'hbase.scan.cacheblocks' = 'false', 'hbase.scan.cache' = '1000')
TBLPROPERTIES ('hbase.table.name' = 'customer');

Hive uses the HBase storage handler to interact with HBase tables, and you specify the column information and other information in the SerDe properties. In this table, you disable caching the blocks requested by Hive in RegionServers to avoid replacing records in the cache, and set the number of batched records returned from HBase to a client to 1,000. Other settings for the storage handler can be found in the Class HBaseSerDe topic in the Hive documentation.

After creating the Hive table, you can browse the schema and sample the data in the Metastore Tables section under the Data Browsers category:

Now that you have created your table, run some SQL queries. Note that Hive creates one mapper per input split, and calculates one split per region in your HBase table (see the HBase storage handler page for more information). The table in this example has two regions, so the Hive job uses two mappers despite the true memory requirements of the job.

Mappers use the memory settings present on the cluster, and the default mapper sizes for each instance type can be found in the EMR documentation. However, the way Hive calculates the number of mappers (1 per region) may cause too few mappers to be created for the actual data size, which in turn can cause out-of-memory issues for your job. To give each mapper more memory than the defaults, you can set these parameters in your Hive query editor (these stay alive for the duration of your Hive session in Hue):

SET mapreduce.map.memory.mb=2000;
SET mapreduce.map.java.opts=-Xmx1500m;

Execute, and now you have increased the mapper memory size. Now, count the rows in the HBase table:

SELECT count (rowkey)
FROM customer;

As you can see from the logs, Hive creates a MapReduce job, and each mapper is using predicate-push downs to use the RegionServer scan API to return data efficiently. Now, perform a more advanced query: finding the count of each credit card type for all customers from California.

SELECT cctype, count(rowkey) AS count
FROM customer
WHERE state = 'CA'
GROUP BY cctype;

Choose Chart in the results section to view the output in a graphical representation.

Querying, fusing, and aggregating data from S3 and HBase

You can create multiple tables in the Hive Metastore, and have each table backed by different external sources including Amazon S3, HDFS, and HBase. These sources can be used together in a join, enabling you to enrich data across sources and aggregate metrics across datasets.

In this example, you are going to use data stored in Amazon S3 to help enrich the data already stored in HBase. The data in HBase has the abbreviation for each state in each row. However, there is a CSV file in S3 that contains the state abbreviation (as a join key), full state name, governor name, and governor’s political affiliation.

First, you need to create an external table over Amazon S3. Use the Hive query editor in Hue to create the table:

CREATE EXTERNAL TABLE StateInfo(
        StateName STRING, 
        Abbr STRING,
        Name STRING,
        Affiliation STRING)
     ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    location 's3://us-east-1.elasticmapreduce.samples/hbase-demo-customer-data/state-data/';

In the following query, you find the number of customers per credit card type and state name. To do this, query tables in both HBase and S3 as input sources, and join the tables on the state abbreviation:

SELECT cctype, statename, count(rowkey)
FROM customer c
JOIN stateinfo s ON (c.state = s.abbr)
GROUP BY cctype, statename;

Next, you can see the top 10 senators who have the most constituents who use Visa credit cards:

SELECT name, max(affiliation) AS affiliation, count(rowkey)
FROM customer c
JOIN stateinfo s ON (c.state = s.abbr)
WHERE cctype = 'visa'
GROUP BY name;

Conclusion

In this post, you explored the HBase table structure, restoring table snapshots from S3, and the relative performance of various GET requests. Next, you used Hive to create an external table over data in HBase, and ran SQL queries using the Hive HBase storage handler on your HBase data. Finally, you joined data from an external, relational table in Amazon S3 with data in non-relational format in HBase.

If you have any questions about using HBase 1.2 on Amazon EMR, or have interesting use cases leveraging HBase with other applications in the Apache ecosystem, please comment below.

If you have questions or suggestions, please leave a comment below.

———————————-

Related

Analyze Your Data on Amazon DynamoDB with Apache Spark