AWS Big Data Blog

Dispatches from re:Invent – Day 4

Matt Yanchyshyn is a Principal Solutions Architect at AWS

I now have a collection of napkins from customer dinners with various AWS technology solutions sketched on them.  This particular napkin is an Amazon DynamoDB schema design for a customer interested in using the new JSON document support to import a bunch of JSON files into their DynamoDB tables – I had a little trouble reading it the next day :)

Thursday at re:Invent was nothing short of thrilling.  As someone who writes a lot of JavaScript and event-driven programs in Node.js, the Amazon Lambda announcement (https://aws.amazon.com/lambda/) is the best thing to happen to me in a while.  That plus Amazon S3 Notification Events (http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html) is going to open up a whole new world application design for the customers that I work with.

The big splash of the day was the new Amazon ECS service (https://aws.amazon.com/ecs/), high performance container management service that supports Docker containers.  Twitter was on fire with hilarious tweets about “All the Dockerz” and you can be sure that we’ll be exploring new and novel ways to use this service for big data workloads on this blog in the coming weeks.

Tonight before heading off to check-out the Skrillex show at the AWS re:Invent party I attended an executive briefing session with one of my favorite customers.  They’re using Amazon Redshift at large scale to provide analytics-as-a-service.  They take a “fleet-based approach” to Amazon Redshift cluster provisioning, deploying a mix of dedicated customer clusters and multi-tenant environments, depending on customer tier and the size of the workload.  Queries are often load-balanced across multiple clusters and the system is elastic so capacity can be added and removed on demand.  I love this approach to solving the problem of handling a huge number of concurrent queries in a multi-tenant analytics environment: leverage both the elastic nature of Amazon Redshift and the ability to deploy multiple clusters as needed, balancing across them, to provide scale and keep costs low by matching real demand.

At Netflix’s popular breakout session today they discussed how Amazon S3 is a “source of truth” for their analytics data.  I’ve heard this a lot at the conference, including at our big data bootcamp.  Amazon S3 is quickly emerging as a popular place to store all of your data since it can be directly and indirectly accessed by a growing number of tools in the analytics toolkit, from Amazon EMR with Hive, Pig, Spark, Presto and others, to Amazon Redshift and more.

So many new products and features, so many interesting customer meetings and breakout sessions.  I can’t believe that it’s all wrapping up tomorrow – it went by so fast.. Be sure to stay on top of this blog for cool ways to use all the new stuff.