AWS Big Data Blog

Month in Review (January 2016)

Lots for big data enthusiasts in January on the AWS Big Data Blog. Take a look!

Running an External Zeppelin Instance using S3 Backed Notebooks with Spark on Amazon EMR

Learn how to set up Zeppelin running “off-cluster” on a separate EC2 instance. You’ll  be able to submit Spark jobs to an EMR cluster directly from your Zeppelin instance.

Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming

Work with key tools available in the Apache Spark application ecosystem for streaming analytics. This covers how features like Spark Streaming, Spark SQL, and HiveServer2 can work together on delivering a data stream as a temporary table that understands SQL queries.

Agile Analytics with Amazon Redshift

What makes outstanding business intelligence (BI)? It needs to be accurate and up-to-date, but this alone won’t differentiate a solution. Perhaps a better measure is to consider the reaction you get when your latest report or metric is released to the business. Good BI excites. This post shows how your Amazon Redshift data warehouse can be agile.

Turning Amazon EMR into a Massive Amazon S3 Processing Engine with Campanile

Customers have used Campanile to migrate petabytes of data from one account to another, run periodic sync jobs and large Amazon Glacier restores, enable SSE, create indexes, and sync data before enabling CRR.

From the Archive (June 11, 2015):

Building a Binary Classification Model with Amazon Machine Learning and Amazon Redshift

Use Amazon Machine Learning and Amazon Redshift to predict the likelihood that a specific user will click on a specific ad.