AWS Big Data Blog

Moving Big Data Into The Cloud with ExpeDat Gateway for Amazon S3

Matt Yanchyshyn is a Principal Solutions Architect with Amazon Web Services

Introduction

A previous blog post (Moving Big Data Into the Cloud with Tsunami UDP) discussed how Tsunami UDP is a fast and easy way to move large amounts of data to and from AWS. Specifically, we showed how you can use it to move data quickly into Amazon EC2 from another instance in a distant AWS Region. From there, using the multipart upload functionality built into the AWS CLI, we moved the data into Amazon Simple Storage Service (Amazon S3).

Tsunami UDP has no software license fees and is easy to set up, but it has drawbacks. It doesn’t support native encryption—important to know if you’re working with sensitive data. It’s also a single-threaded application that hasn’t been updated since 2009 and commercial support is not available. In addition, due to the lack of an SDK or plugins, Tsunami UDP can be hard to automate for tasks like creating watch folders with complex rules. And Tsunami UDP doesn’t support native Amazon S3 integration, so transfers must first be terminated on an Amazon Elastic Compute Cloud (Amazon EC2) instance and then re-transmitted to Amazon S3 manually using tools like the AWS CLI.

ExpeDat, by Data Expedition Inc., addresses these shortfalls. It also provides features that make moving large amounts of data into Amazon S3 from on-premises or Amazon EC2 instances in other regions a seamless experience. Unlike Tsunami UDP, ExpeDat is an actively maintained and fully supported product that employs AES encryption and has lightweight, cross-platform clients with GUIs. ExpeDat also has Object Handlers that let you integrate with any external script or program, making automation easy to set up. If lower-level integration is required, SDKs are available. The ExpeDat S3 Gateway product can also automatically stream data into Amazon S3 – data never touches Amazon Elastic Block Store (Amazon EBS) or ephemeral storage, but instead lives only in memory as it’s transmitted via the ExpeDat gateway on Amazon EC2 to the bucket of your choice in Amazon S3.

One of the easiest ways to get started with ExpeDat is to install the ExpeDat Gateway for Amazon S3 via the AWS Marketplace. This product can transmit ~300GB per hour and is available as a monthly subscription. The ExpeDat S3 Gateway runs on an Amazon EC2 instance, which can be set up in a couple of minutes.

Getting Started

For this example, we’ll use the same dataset that we used in the earlier blog post: the Wikipedia Traffic Statistics V2 from AWS Public Data Sets. We’ ll move this 650GB compressed dataset over the Internet from an Amazon EC2 instance in the AWS Tokyo Region (ap-northeast-1) to an ExpeDat S3 Gateway “free trial” Amazon EC2 instance launched from the AWS Marketplace into the AWS N. Virginia Region (us-east-1). For convenience, we’ve placed a copy of the data in an Amazon S3 bucket in ap-northeast-1: s3://accel-file-tx-demo-tokyo.

Launch the ExpeDat Gateway for Amazon S3 (server)

  1. Go to the AWS Marketplace page for the ExpeDat Gateway for Amazon S3 trial.

Note: You won’t be charged software fees for 21 days, but the usual AWS infrastructure fees apply.

  1. Launch an instance of the ExpeDat Gateway for Amazon S3 into the US East (Virginia) Region.

Note: We will use an ExpeDat command line client on a separate Amazon EC2 instance for our tests, but feel free to download others such as the graphical clients for Mac or Windows for your own tests.

You should now have a working ExpeDat Gateway for Amazon S3 instance running the servedat server application and pointing to an Amazon S3 bucket.

Set up the ExpeDat Client

  1. Launch an Amazon Linux instance in ap-northeast-1 (Tokyo). For testing purposes, this instance should be the same type as the one you launched from the AWS Marketplace in us-east-1. For convenience, we’ve prepared an AWS CloudFormation template that launches an Amazon EC2 instance running the 64-bit 2014.03.01 Amazon Linux PV AMI on instance types with two or more large ephemeral drives. The bootstrap script in the template creates a RAID0 array of two of the ephemeral volumes and mounts it to /mnt/bigephemeral.
  1. Download the ExpeDat client from the web interface of the ExpeDat Gateway for Amazon S3 instance that you just launched:

If you’d like to try ExpeDat without installing the S3 Gateway from the AWS Marketplace, you can download a trial version of the Linux x86-64 ExpeDat client.

  1. Copy the ExpeDat client that you downloaded to the instance created by the AWS CloudFormation template. You can use the scp utility for this because it uses the same port as SSH, TCP 22, which should already be open in the instance’s Security Group.
  1. SSH onto the instance created by the AWS CloudFormation template.
  1. Tune the operating system by increasing the UDP buffers, which can result in overall faster throughput:
	sysctl -w net.core.wmem_max=4194304
	sysctl -w net.core.rmem_max=4194304

See Data Expedition’s UPD Tuning guidance for notes about this optimization.

  1. Install the ExpeDat “movedat” file transfer client. To do this, uncompress the ExpeDat archive that you downloaded from your ExpeDat Gateway for Amazon S3 instance and run the install-movedat.sh script:

  1. Use the “fallocate” command to create a test file, replacing 650 with the size in gigabytes that you prefer for testing:
	fallocate -l 650G bigfile.img
	movedat bigfile.img [user]@[ExpeDat S3 Gateway IP]:=S3

Alternative option: Create a tarball from the dataset files, such as those discussed in the previous post, and pipe the output to the ExpeDat movedat transfer application:

	tar -cf - /mnt/bigephemeral | 
	movedat [user]@[ExpeDat S3 Gateway IP]:=S3

Big data upload - sample code

  1. The file(s) should start showing up in your Amazon S3 bucket very shortly after the transfer completes:

Big data upload into S3 bucket

Easy! No more watch folders or manual Amazon S3 uploads. Files sent to the ExpeDat Gateway for Amazon S3 from an ExpeDat client (movedat) move straight into your Amazon S3 bucket.

To use AES-128 encryption with ExpeDat, add a -K argument after the movedat command. This addresses Tsunami UDP’s lack of encryption, one of its major shortcomings. Enabling encrypted transfers with ExpeDat increases the CPU load of both the server and the client computers and may cause a reduction in performance on very fast networks or very busy CPUs:

	movedat -K * test@[ExpeDat S3 Gateway IP]:=S3

Big data upload - encryption

Another great use for ExpeDat Gateway for Amazon S3 is to list, rename, delete and download files in Amazon S3 buckets. Manipulating objects in Amazon S3 via ExpeDat makes automation a lot easier since complex file workloads can be scripted without having to use multiple tools. For example, to list a bucket:

movedat -o [user]@[server]:=S3

Big data upload - managing files in S3 bucket

List a subdirectory in S3:

	movedat -o [user]@[server]:=subdir/S3

Rename:

movedat -o -m 
[user]@[server]:pagecounts-2008-1001-000000.gz=S3 
newfile.gz

Download the renamed file and give it a new local name:

	movedat -o [test]@[server:newfile.gz=S3 localfile.gz

Conclusion

The use cases for the big data world continue to evolve, and in some industries batch processing is giving way to real-time or nearly real-time analytics. Tools like ExpeDat S3 Gateway make it easier to meet your business needs when you must quickly move a ton of data into AWS. Its direct Amazon S3 integration makes this a fast, seamless experience. This is especially true if you need to move large files from on-premises or other AWS Regions into Amazon S3 for analysis with Amazon EMR or Amazon Redshift.

ExpeDat Gateway for Amazon S3 improves on free tools such as Tsunami UDP by providing AES 128-bit encryption and a rich selection of graphical and command line clients. It’s easy to automate complex file workflows thanks to the many options provided by its command line tools. This includes the Amazon S3 object manipulation as demonstrated in this blog post and more advanced techniques like using Object Handlers to trigger actions on the ExpeDat Gateway for Amazon S3 server or using the ExpeDat Client SDK in your own programs. Perhaps most importantly, ExpeDat and the suite of associated products offered by Data Expedition, Inc. are actively maintained and fully supported. It’s easy to install them via the AWS Marketplace and the overall cost per GB is very low – there’s even a free trial, so give it a try today!

If you have questions or suggestions, please leave a comment below.

———————————————————-

Related:

Building and Maintaining an Amazon S3 Metadata Index without Servers

TAGS: