AWS Big Data Blog

Sharpen your Skill Set with Apache Spark on the AWS Big Data Blog

The AWS Big Data Blog has a large community of authors who are passionate about Apache Spark and who regularly publish content that helps customers use Spark to build real-world solutions. You’ll see content on a variety of topics, including deep-dives on Spark’s internals, building Spark Streaming applications, creating machine learning pipelines using MLlib, and ways to apply Spark to various real-world use cases. You can learn hands-on by creating distributed applications using code samples from the blog directly against data in Amazon S3, and you can run Spark on Amazon EMR to enable fast experimentation and quick production deployments.

The latest releases of Spark are supported within a few weeks of Apache general availability (Spark 1.6.1 was included in EMR 4.5 last week). Spark on EMR is configured by default to use dynamic allocation of executors to efficiently utilize available resources, it can utilize EMRFS to efficiently query data in Amazon S3, and it can be used with interactive notebooks when you’re also installing Apache Zeppelin on your cluster.

Below are recent posts that focus on Spark:

We hope these posts help you learn more about the Spark ecosystem and demonstrate ways to leverage these technologies on AWS to help you derive value from your data. And with new posts coming out every week, stay tuned for new Spark use cases and examples!

Please let us know in the comments below if you’d like us to cover specific Spark-related topics. If you have questions about Spark on EMR, please email us at emr-help@amazon.com and we’ll get back to you right away.