AWS Big Data Blog

Nasdaq’s Architecture using Amazon EMR and Amazon S3 for Ad Hoc Access to a Massive Data Set

This is a guest post by Nate Sammons, a Principal Architect for Nasdaq

The Nasdaq group of companies operates financial exchanges around the world and processes large volumes of data every day. We run a wide variety of analytic and surveillance systems, all of which require access to essentially the same data sets.

The Nasdaq Group has been a user of Amazon Redshift since it was released and we are extremely happy with it. We’ve discussed our usage of that system at re:Invent several times, the most recent of which was FIN401 Seismic Shift: Nasdaq’s Migration to Amazon Redshift. Currently, our system is moving an average of 5.5 billion rows into Amazon Redshift every day (14 billion on a peak day in October of 2014).

In addition to our Amazon Redshift data warehouse, we have a large historical data footprint that we would like to access as a single, gigantic data set. Currently, this historical archive is spread across a large number of disparate systems, making it difficult to use. Our aim for a new, unified warehouse platform is two-fold: increase the accessibility of this historic data set to a growing number of internal groups at Nasdaq and to gain cost efficiencies in the process. For this platform, Hadoop is a clear choice: it supports a number of different SQL and other interfaces for accessing data and has an active and growing ecosystem of tools and projects.

Why Amazon S3 and Amazon EMR?

A major architectural tenet of our new warehouse platform is the separation of storage and compute resources. In traditional Hadoop deployments, scaling storage capacity typically requires also scaling compute capacity, and any change to the ratio of compute to storage involves modifying hardware. In our use case for a long-term historical archive, data that is not accessed frequently in HDFS would still require always-on, attached compute resources on each node in the cluster.

Additionally, the default replication factor for HDFS is 3, meaning that every block of data is present on three nodes in the cluster. This provides some level of durability, but it means you must buy three times the amount of disk needed when expanding storage capacity on your cluster. It also introduces hotspots: if a given piece of data is present on only three nodes in your cluster but is accessed very frequently, you must either duplicate it or deal with certain compute nodes becoming hotspots.

We can avoid these problems by using Amazon S3 and Amazon EMR, allowing us to separate compute and storage for our data warehouse and scale each independently. There are no hotspots because every object in S3 is exactly as accessible as any other object. Amazon S3 has practically infinite scalability, 11 9’s of durability (99.999999999%), automatic cross-data center replication, straightforward cross-region replication, and low cost. With additional integration with IAM policies, S3 provides a storage solution with fine-grained access control across multiple AWS accounts. For these reasons, Netflix and others have proven that it’s a viable platform for data warehousing. Also, new services like AWS Lambda, which can execute arbitrary code in response to S3 events, will open up additional possibilities.

Launching Clusters with EMR

EMR makes it easy to deploy and manage Hadoop clusters. We can grow and shrink clusters as needed, and shut them down over weekends or holidays. Everything runs inside a VPC where we have tight control on network access. IAM role integration makes pervasive access control easy. Critically, EMRFS, which is an HDFS-compliant filesystem to access objects in S3, allows that access control to flow through to our data storage layer in S3.

We are able to run multiple data access layers on top of the same storage system by using multiple EMR clusters, and those access layers can be isolated and do not need to compete for compute resources. If a certain job needs an unusually large amount of RAM on each node, we can simply run a cluster of memory optimized nodes instead of a cluster with a more balanced CPU to memory ratio. We can run experimental clusters without the need to modify the “main” production cluster for a given internal customer. Cost allocation is easy as well, as we can tag EMR clusters to track costs or run them in entirely separate AWS accounts as needed.

Because it’s so easy to manage clusters, we’re able to experiment with new data access layers and run large-scale POCs without an up-front investment in new hardware or spending costly time configuring all the various moving parts involved with running Hadoop clusters. This is similar to our experience with Amazon Redshift: before Amazon Redshift, we would never have even thought about duplicating a 100+ TB production database for an experiment. It would have meant months of delay in ordering hardware, setup, data migration, etc. – now we can do that over a weekend and “throw it away” when we’re done experimenting. It opens up new possibilities that weren’t really conceivable in the past.

With our new data warehouse, our aim is to build a platform that is useful and accessible to a wide range of users with different skill sets. The primary interface is through SQL; we are evaluating Spark SQL, Presto, and Drill, though we are also looking at alternatives and non-SQL capabilities. The Hadoop community is moving very quickly, and because there’s no such thing as a one-size-fits-all query tool, our aim is to support any Hadoop ecosystem analytics and data access application that our customers want to use. EMR and EMRFS’ connectivity to S3 make this possible.

Security Requirements and Amazon S3 Client-Side Encryption

The Nasdaq Group has a group-wide, highly engaged, internal information security team with strict policies and standards regarding data and applications. Because of the sensitivity of this data, it must be encrypted when stored and when in transit. Additionally, the encryption keys must be rooted in a cluster of HSMs (hardware security modules) physically located in Nasdaq facilities. We use the same brand and model of HSM as the AWS CloudHSM service (SafeNet Luna SA), which also allows us to store Amazon Redshift cluster master keys in our own HSMs. For simplicity, we’ll call this the Nasdaq KMS, as its functionality is similar to that of the AWS Key Management Service (AWS KMS).

Recently, EMR launched a feature in EMRFS to allow S3 client-side encryption using customer keys, which utilizes the S3 encryption client’s envelope encryption. EMRFS allows us to write a thin adapter by implementing the EncryptionMaterialsProvider interface from the AWS SDK so that when EMRFS reads or writes an object in S3 we get a callback to supply the encryption key. This is a clean, low-level hook that lets us run most applications without the need to integrate any encryption logic.

Because EMRFS is an implementation of the HDFS interface (it’s called when you use the scheme “s3://” in EMR), none of the layers above need to be aware of the encryption. It’s also low enough in the stack that it acts as a catch-all, making it nearly impossible to forget to integrate encryption. It’s important to note that seek() functionality also works on encrypted files in S3, which is a critical performance feature for many Hadoop ecosystem file formats.

Using a Custom Encryption Materials Provider

For our materials provider, we built an implementation that communicates with the Nasdaq KMS (see the “Code Sample” section below for more information). EMR pulls our implementation .jar file from S3, places it on each node of the cluster, and configures EMRFS to use our materials provider in the emrfs-site.xml configuration file on the cluster. We can do this easily from the AWS CLI by specifying the following arguments when creating a cluster:

–emrfs Encryption=ClientSide,ProviderType=Custom,CustomProviderLocation=s3://mybucket/myfolder/myprovider.jar,CustomProviderClass=providerclassname

An s3get and a configure-hadoop bootstrap action will be automatically added and configured to copy the provider jar file and configure emrfs-site.xml, respectively. The s3get bootstrap action uses the Instance Profile assigned to the cluster when making the get request to S3, so you can limit access to your provider jar. And, if your provider class implements Configurable from the Hadoop API, you will be also able to pull configuration values from the Hadoop configuration XML files at runtime.

The “x-amz-matdesc” user metadata field on an S3 object is used to store the “materials description” so that your provider knows what key to retrieve. This field contains a JSON version of a Map<String,String> which is passed to the provider implementation when requesting a key for an existing object. When a key is requested to write a new object, that same map is supplied through the EncryptionMaterials object, which itself has a Map<String,String> as a description. We rotate encryption keys often, so we use this map to store a unique identifier for the encryption key used for the S3 object. The key identifier itself is not considered sensitive information, so it can be stored in object metadata in S3.

Data Ingest Workflow

Nasdaq already has a mature data ingest system that was developed to support our Amazon Redshift efforts. This system is built on a workflow engine that orchestrates approximately 30,000 operations each day and uses MySQL as a persistent store for state information.

Most of those operations execute multiple steps with the following general pattern:

  1. Retrieve data from another system via JDBC, SMB, FTP, SFTP, etc.
  2. Validate data for syntax and correctness to ensure that the schema hasn’t changed, that we aren’t missing data, etc.
  3. Convert the data into Parquet or ORC files.
  4. Upload files to S3 using S3 client-side encryption.

Our data ingest system itself has no knowledge of HDFS. We write data directly to S3 using a similar EncryptionMaterialsProvider implementation that doesn’t use the Hadoop Configuration interface hooks. This is important: there is nothing special about data read or written through EMRFS. It is all just S3, and as long as your materials provider can determine the correct encryption key to use, EMR can operate seamlessly on your encrypted objects in S3.

Code Sample

Our encryption materials provider looks something like this:

import java.util.Map;
import javax.crypto.spec.SecretKeySpec;
import com.amazonaws.services.s3.model.EncryptionMaterials;
import com.amazonaws.services.s3.model.EncryptionMaterialsProvider;
import org.apache.hadoop.conf.Configurable;
import org.apache.hadoop.conf.Configuration;

public class NasdaqKMSEncryptionMaterialsProvider
implements EncryptionMaterialsProvider, Configurable
{
  private static final String TOKEN_PROPERTY = “token”;
  private Configuration config = null; // Hadoop configuration
  private NasdaqKMSClient kms = null;  // Client for our KMS

  @Override
  public void setConf(Configuration config) {
    this.config = config;
    // read any configuration values needed to initialize
    this.kms = new NasdaqKMSClient(config);
  }

  @Override
  public Configuration getConf() { return this.config; }

  @Override
  public void refresh() { /* nothing to do here */ }

  @Override
  public EncryptionMaterials getEncryptionMaterials(Map desc) {
    // get the key token from the materials description
    String token = desc.get(TOKEN_PROPERTY);

    // retrieve the encryption key from the KMS
    byte[] key = kms.retrieveKey(token);

    // create a new encryption materials object with the key
    SecretKeySpec secretKey = new SecretKeySpec(key, “AES”);
    EncryptionMaterials materials = new EncryptionMaterials(secretKey);

    // stamp that materials with the key token (identifier)
    materials.addDescription(TOKEN_PROPERTY, token);
    return materials;
  }

  @Override
  public EncryptionMaterials getEncryptionMaterials() {
    // generates a new key, associates a token with it and
    // return a new encryption materials with the token
    // stored in the materials description.
    return kms.generateNewEncryptionMaterials();
  }
}

Summary

This post has provided a high-level view of our new data warehouse initiative. Because we can now use Amazon S3 client-side encryption with EMRFS, we can meet our security requirements for data at rest in Amazon S3 and enjoy the scalability and ecosystem of applications in Amazon EMR.

If you have questions or suggestions, please leave a comment below.

Nate Sammons is not an Amazon employee and does not represent Amazon.

————————————

Related:

Strategies for Reducing your EMR Costs

—————————————————————

Love to work on open source? Check out EMR’s careers page.