Agile Analytics with Amazon Redshift

Nick Corbett is a Big Data Consultant for AWS Professional Services

What makes outstanding business intelligence (BI)? It needs to be accurate and up-to-date, but this alone won’t differentiate a solution. Perhaps a better measure is to consider the reaction you get when your latest report or metric is released to the business. Good BI excites:  it prompts new ways of thinking and new ideas that, more often than not, require change to support. Businesses are constantly looking to evolve, to use their BI to gain insight and competitive advantage. Truly outstanding BI is agile enough to keep pace with this demand.

In this post, I show you how your Amazon Redshift data warehouse can be agile. To do this, you need to adopt a continuous delivery (CD) approach that draws on many of the tools and techniques already successfully used by software engineers. CD focuses on automating the release process, allowing quick and frequent deployments to production whilst ensuring quality is maintained. In return, you will enjoy many benefits; for example, you can identify issues quickly, and those that you do find are likely to be less complex.  By shortening your release cycle, you can respond to requests from your business more quickly.

A simple CD process, or pipeline, is shown below.  This uses a combination of AWS fully-managed services and the extensible, open-source, continuous integration server Jenkins. You can follow the instructions at the AWS Big Data Blog repository on GitHub to build your own sample environment and learn how to configure these components. The repository includes a CloudFormation script to set up and configure a Jenkins server in a virtual private cloud (VPC) and instructions for setting up AWS CodeCommit and AWS CodePipeline.  Note that starting this environment incurs a charge in your account, although all the AWS services used are eligible for the AWS Free Tier.

Build a Deployment Package

One of the key assumptions of CD is that no one directly interacts with the production system; all deployments and updates are fully automated. The logical conclusion of this assumption is that everything required to build, maintain, update and test your data warehouse should be scripted and under source control. The sample environment uses AWS CodeCommit as a source control repository. AWS CodeCommit is a fully-managed source control service that makes it easy for companies to host secure and highly scalable private Git repositories.

The first stage of your CD pipeline is to build a deployment package. This is a snapshot of a branch of your source code repository and is immutable. After it’s built, the same package is used for all your testing and finally deployed to your production system.

Each time you build a new package you should aim to answer one question: can I deploy this to my production system?  You can answer ‘yes’ when you are satisfied that the new features in the deployment work as expected and that don’t break anything that’s already part of your production system. The faster you can answer this question, the better. Testing each package costs you, both in infrastructure costs and people resources if manual testing is needed. The more testing you do on a package, the more expensive it becomes. You should look to build a process that fails fast by performing simple, automated tests first and only spend time manually testing builds that you know are of good quality.

The environment uses a Jenkins job to build the deployment package. Jenkins is configured to scan the repository in AWS CodeCommit periodically and trigger a job when changes are detected. The sample pipeline scans the master branch, but you can configure Jenkins to scan any branch, perhaps to test new features as they are being developed. The sample job simply zips all the files in the repository before uploading to an Amazon S3 bucket that has versioning enabled. With fail-fast methodology, you could use this opportunity to verify that all files in the package conform to your coding standards; for example, do they have the correct header and have naming standards been followed?  At this point, you don’t have a Amazon Redshift database to test things in, but you can do quick and easy tests on the files, such as using grep commands to check the contents.

Test the deployment package

The remaining stages in the pipeline are managed by AWS CodePipeline, a continuous delivery service for fast and reliable updates. AWS CodePipeline builds, tests, and deploys your code every time there is a change, based on the release process models that you define. This enables you to deliver features and updates rapidly and reliably.

The CD pipeline shown above uses AWS CodePipeline to run tests in two environments before deploying the package to the production system. The tests have been split into a set of functional tests, run against a small set of data, and non-functional tests run against a production-sized Amazon Redshift cluster. You may choose to organize your CD pipeline differently, but your process will probably involve running one or more test stages followed by a deployment to production. Each stage of the pipeline is implemented by a Jenkins job that carries out four main tasks.

1. Build a test environment

To run tests, you need an Amazon Redshift database. The sample job uses the AWS CloudFormation Plugin to create a VPC with a public facing, single-node Amazon Redshift database. The configuration of the plugin is shown below:

As you can see from the screenshot, the job is configured to build the CloudFormation template found in ./cloudformation/redshiftvpc.json within your deployment package. The definition of your test environment is held in the same source control as the other project assets.

The sample job creates an empty Amazon Redshift data warehouse and creates tables using SQL scripts. However, you can use the snapshotidentifer property of the Amazon Redshift object in the CloudFormation template to create a data warehouse from a point-in-time backup of a cluster. You may have a pre-prepared snapshot that has the same structure as your production system but contains test data. Alternatively, you might first use the AWS Command Line Interface (CLI) to call describe-cluster-snapshots and find the ID of the latest production snapshot.  Creating a test environment that mirrors production is a good choice for non-functional testing, especially performance tests.

Any outputs that are returned by your CloudFormation stack are automatically added as variables in your build environment and can be used in subsequent steps.  The variables are named <stack name>_<output name>.  In the sample job, you can see that the endpoint of the new Amazon Redshift database is used by subsequent steps.

2. Apply the update in the deployment package

The sample job takes the latest folder in the ./updates folder of the deployment package and runs all the .sql files against the data warehouse.  This assumes that you’ve asked your developers to provide scripts for all the changes they want to make to the system. If any of the .sql files fails to run, then the overall Jenkins job fails. If this occurs, you should ask your development team to fix the issue and restart the pipeline with a new deployment package.

3. Run the tests

You now have a populated Amazon Redshift data warehouse that you’ve successfully updated with the deployment package. You should run a series of tests to ensure that the new features work as expected and that the deployment hasn’t broken any existing functionality.

For example, imagine that your data warehouse has a customer table with a name column and you’ve asked your developers to refactor this to first_name and last_name. You would expect your developers to provide an ALTER TABLE  statement to change the definition of your customer table and an UPDATE statement to populate the new columns from the existing name column.

After the update scripts have run, the most basic check would be to confirm that the customer table contains first_name and last_name columns. However, you should also test the script that populates these new columns.  What happened if the name column contained three words, such as John Robert Smith? What if it contained initials or suffixes, such as JJ Smith Esq? What if it were null or contained a single very long name that might be truncated due to its length? You should also look at how your customer names are used. Perhaps you have a view used by a report that shows your best customers – does the name of each customer still appear correctly now that it is a concatenation of two fields?

Over time, you can build a library of tests that are run whenever you have a new deployment package. This ensures that you not only test new features but don’t regress in any other areas.

You may also choose to run non-functional tests. For example, is the query plan for the view that drives your best customer report still the same? Does the query still execute in the same amount of time? Is access to use the view restricted to the correct groups?

The sample job contains the code to run a set of SQL statements and check the result against what is expected. The pseudo-code is shown below:

for each .sql file in ./tests
	result = run contents of .sql file against Amazon Redshift
	expected = contents of .result file
	if [ Ignore Case and White Space ] Result != Expected
		// test fail
		exit 1
end for
exit 0

The test loops over all .sql files in the ./tests folder and executes each one. It then compares the result of the query with the contents of the .result file with the same name. For example, if message_test1.sql is executed, the value returned is compared with the contents of message_test1.sql.result. This comparison is made after removing all whitespace. If the result is not what was expected, the Jenkins job ends in failure. You can adapt this bash script to include performance statistics by monitoring the execution duration.

At the end of your testing, you can decide whether the deployment meets your quality bar. It may be that you still need to do manual testing before promoting to production, but this should only be performed after all automated tests have been passed. You may want to avoid involving your test team and the extra expense unless there’s a high chance that you will take this deployment forward to production.

4. Delete your test environment

After you have finished your testing, you can delete your test environment.  This can be done by your Jenkins server, meaning that the entire process of creating an environment, running the tests, and deleting the environment can be automated. You can save money with a transient environment that’s created only when you need to run tests and then immediately deleted. If you do need to perform manual testing, you can configure your Jenkins job to not delete the cluster.

Deploy to production

You are now ready to deploy the changes to your production environment. Because the same set of update scripts in the same deployment package have already been successfully run and tested in pre-production, you can be confident that this change will be easy to roll out. In AWS CodePipeline, the transition between the PreProduction and Production stages has been disabled:

When you re-enable a disabled transition, the latest revision runs through the remaining stages of the pipeline. At the end of the automated process, the decision to release to production is manual: it can be initiated by your release manager either using either the AWS Management Console or the API.

Post-deployment tasks

After you’ve deployed to production, you may want to have a final stage in your pipeline that sets up for the next round of testing. If you are using an environment containing test data for your functional testing, then a new snapshot is required for your next release. You could also implement some of the Top 10 Performance Tuning Tasks to make sure that your Amazon Redshift data warehouse is in good health following the latest update.

Summary

In this post, I have shown you how to build a CD pipeline for your Amazon Redshift data warehouse. The pipeline presented here is fairly simple but hopefully it’s easy to see how the concept can be expanded with more stages or more sophisticated testing.

However, remember that a simple pipeline is better than no pipeline at all. You should take a holistic view:  your ‘system’ is the Amazon Redshift data warehouse and the CD pipeline needed to maintain it. Start simple, iterate with each development sprint, and build complexity as you go. Work towards an agile data warehouse that can keep pace with your business, leading change rather than reacting to it. 

If you have any questions or suggestions, please leave a comment below.

--------------------------

Related:

Top 10 Performance Tuning Techniques for Amazon Redshift

 

 

Comments