Songzhi Liu is a Professional Services Consultant with AWS
The data lake concept has become more and more popular among enterprise customers because it collects data from different sources and stores it where it can be easily combined, governed, and accessed.
On the AWS cloud, Amazon S3 is a good candidate for a data lake implementation, with large-scale data storage. Amazon EMR provides transparent scalability and seamless compatibility with many big data applications on Hadoop. However, no matter what kind of storage or processing is used, data must be defined.
In this post, I introduce a simple data ingestion and preparation framework based on AWS Lambda, Amazon DynamoDB, and Apache Hive on EMR for data from different sources landing in S3. This solution lets Hive pick up new partitions as data is loaded into S3 because Hive by itself cannot detect new partitions as data lands.
Hive is a great choice as it is a general data interfacing language thanks to its well-designed Metastore and other related projects like HCatalog. Many other Hadoop applications like Pig, Spark, and Presto, etc. can leverage the schemas defined in Hive.
Moreover, external tables make Hive a great data definition language to define the data coming from different sources on S3, such as streaming data from Amazon Kinesis, log files from Amazon CloudWatch and AWS CloudTrail, or data ingested using other Hadoop applications like Sqoop or Flume.
To maximize the efficiency of data organization in Hive, you should leverage external tables and partitioning. By properly partitioning the data, you can largely reduce the amount of data needs to be retrieved and improve the efficiency during ETL or other types of analysis.
Solving the problem with AWS services
For many of the aforementioned services or applications, data is loaded periodically, as in one batch every 15 minutes. Because Hive external tables don’t pick up new partitions automatically, you need to update and add new partitions manually; this is difficult to manage at scale. A framework based on Lambda, DynamoDB, and S3 can assist with this challenge.
As data is ingested from different sources to S3, new partitions are added by this framework and become available in the predefined Hive external tables.