AWS Big Data Blog

Migrating Metadata when Encrypting an Amazon Redshift Cluster

NOTE: Amazon Redshift now supports enabling and disabling encryption with 1-click. For more information, please review this “What’s New” post.

—————————————

John Loughlin is a Solutions Architect with Amazon Web Services.

A customer came to us asking for help expanding and modifying their Amazon Redshift cluster. In the course of responding to their request, we made use of several tools available in the AWSLabs GitHub repository. What follows is an account of how you can use some of these tools as we did (this is not intended to be an exhaustive description of the content of that library).

The customer is acquiring another manufacturing company that is only slightly smaller than they are. Each has a BI infrastructure and they believe consolidating platforms would lower expenses and simplify operations. They want to move the acquired organization’s warehouse into the existing Amazon Redshift cluster, but with a new requirement. Because of the nature of some of the projects the acquired company has, they have a contractual obligation to encrypt data.

Amazon Redshift supports the encryption of data at rest, in the database and the associated snapshots. To enable this, encryption must be selected when the database is created. To encrypt a database after it has been created, it is necessary to stand up a new database and move the content from the unencrypted cluster to the new cluster where it will be encrypted.

Moving the contents of your application’s data tables is straightforward, as Amazon Redshift provides an UNLOAD feature for this purpose.

To determine the tables to UNLOAD, consider running a query such as the following:

	SELECT tablename FROM pg_tables WHERE schemaname = ‘public’;

Note that the list of schema names should be extended to reflect where you have created objects in your cluster. Running UNLOAD from the source cluster and COPY in the new one migrates application data. Simple enough.

	UNLOAD ('SELECT * FROM sample_table') TO 's3://mybucket/sample/sample_Table_' credentials  'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>' manifest;

This command splits the results of a SELECT statement across a set of files, one or more files per node slice, to simplify parallel reloading of the data. It also creates a manifest file that can be used to ensure that the COPY command loads all of the required files, and only the required files, into the encrypted cluster. Using a manifest file on COPY is a recommended practice.

You can make this process simpler still by using the Amazon Redshift Unload/Copy utility. This tool exports data from a source cluster to a location on S3 and encrypts the data with the Amazon Key Management Service (Amazon KMS). It can also import the data into another Amazon Redshift cluster and clean up S3.

For many applications, the Amazon Redshift cluster contains more than just application data. Amazon Redshift supports the creation of database users, creation of groups, and assignment of privileges to both groups and users. Re-creating these accurately could be error-prone unless everything was created using scripts and every script source code–controlled. Fortunately, it is easy to create scripts that migrate this information directly from your source cluster and which can be run in the encrypted cluster to replicate the data that you require.

The best place to start before actually creating scripts is the AWSLABS GitHub repository. In the AdminViews directory, there are already several useful scripts. You can generate the DDL for schemas and tables and views. You can also get lists of schema, view, and table privileges by user and see the groups that a user belongs to. All this is useful information, but you want to generate SQL statements in your source database to run in your new, encrypted database.

You can pull a list of users from the pg_user table as follows:

SELECT 'CREATE USER '|| usename || ';' FROM pg_user WHERE usename <> 'rdsdb';

Produces:

CREATE USER acctowner;
CREATE USER mrexample;
CREATE USER counterexample;
CREATE USER mrspremise;
CREATE USER mrsconclusion;

You should assign passwords to the accounts you create. There is no way to extract the existing passwords from the source database so it is necessary to create new ones.

Download the code from GitHub, expand the src directory, and find the scripts in the AdminViews directory. Create a schema called admin in your Amazon Redshift cluster and run each of the scripts starting with v_ to create the views. The resulting views can then be accessed in a SQL statement as follows:

SELECT * FROM admin.v_generate_schema_ddl;

Schema name: Admin
ddl: Create schema admin

Run the v_generate_group_DDL.SQL script to create the groups in the new database:

SELECT 'CREATE GROUP  '|| groname  ||';' FROM pg_group;

Produces:

CREATE GROUP chosen;
CREATE GROUP readonly;

Users belong to groups and you can capture these associations using the v_get_users-in_group script:

SELECT 'ALTER GROUP ' ||groname||' ADD USER '||usename||';' FROM admin.v_get_users_in_group;

Produces:

ALTER GROUP chosen ADD USER mrsconclusion;
ALTER GROUP readonly ADD USERmrexample;
ALTER GROUP readonly ADD USERmrspremise;

Schema, view, and table DDL can be generated directly from the appropriate scripts:

	v_generate_schema_DDL.SQL, v_generate_table_DDL.SQL v_generate_view_DDL.SQL

You need to set appropriate privileges on the schemas in the new database and there is a script that you can use to capture the relevant information in the existing database:

SELECT * FROM admin.v_get_schema_priv_by_user
WHERE schemaname
NOT LIKE 'pg%'
AND schemaname <> 'information_schema'
AND usename <> 'johnlou'
AND usename <> 'rdsdb';

Here you see multiple different permissions granted to each user who has been granted privileges on a schema. To generate SQL to run against the new database, you can use a user-defined function (UDF) to create a string of privileges for each row in the result set. One way of building this function is as follows:

create function 
f_schema_priv_granted(cre boolean, usg boolean) returns varchar
STABLE 
AS $$
   priv = ''
   if cre:
       priv = str('create')
   if usg:
       priv = priv + str(', usage')
   return priv
$$LANGUAGE plpythonu

The f_schema_priv_granted function returns a string of concatenated permissions. Run this in a query to generate SQL containing GRANT statements:

SELECT 'GRANT '|| f_schema_priv_granted(cre, usg) ||' ON schema '|| schemaname || ' TO ' || usename || ';' 
FROM admin.v_get_schema_priv_by_user 
WHERE schemaname NOT LIKE 'pg%' 
AND schemaname <> 'information_schema' 
AND usename <> 'rdsdb';

Produces

GRANT CREATE, USAGE ON schema public TO mrexample;
GRANT CREATE, USAGE ON schema public TO counterexample;
GRANT CREATE, USAGE ON schema public TO mrspremise;
GRANT CREATE, USAGE ON schema public TO o mrsconclusion;

Alternatively, if you prefer CASE statements to UDFs or are not comfortable with python, you can write something similar to the following:

SELECT 'grant '|| concat(CASE WHEN cre is true THEN 'create' else ' ' END, 
CASE WHEN usg is true THEN ', usage' ELSE ' ' END  )
|| ' ON schema '|| schemaname || ' TO ' || usename || ';'
FROM admin.v_get_schema_priv_by_user 
WHERE schemaname NOT LIKE 'pg%' 
AND schemaname <> 'information_schema' 
AND schemaname <> 'public'
AND usg = 'true';

Similarly, a UDF can be used to create a string of permissions used in a GRANT statement on each view and table. There is a wider range of privileges: SELECT, INSERT, UPDATE, DELETE, and REFERENCES. The UDF looks like the following:

create function 
f_table_priv_granted(sel boolean, ins boolean, upd boolean, delc boolean, ref boolean) returns varchar
STABLE 
AS $$
   priv = ''
   if sel:
       priv = str('select')
   if ins:
       priv = priv + str(', insert')
   if upd:
       priv = priv + str(', update')
   if delc:
       priv = priv + str(', delete')
   if ref:
       priv = priv + str(', references ')
   return priv
$$LANGUAGE plpythonu

Note that in the function, the fourth argument does not match the column in the views. Python objects to the use of ‘del’ as it is a reserved word. Also note that you can construct a SQL statement with the same function using CASE statements if you prefer not to use the UDF.

You can generate privileges for tables with the following query:

SELECT 'GRANT '|| f_table_priv_granted(sel, ins, upd, del, ref) || ' ON '|| 
schemaname||'.'||tablename ||' TO '|| usename || ';' FROM admin.v_get_tbl_priv_by_user 
WHERE schemaname NOT LIKE 'pg%' 
AND schemaname <> 'information_schema'
AND usename <> 'rdsdb';

Produces:

GRANT SELECT on public.old_sample to mrexample;
GRANT SELECT ON public.old_sample TO mrspremise;
GRANT SELECT, INSERT, UPDATE, DELETE, REFERENCES ON public.old_sample TO mrsconclusion;
GRANT SELECT, INSERT, UPDATE, DELETE, REFERENCES ON public.sample TO mrexample;
GRANT SELECT ON public.sample to mrspremise;

Similarly, run the following query for privileges on views:

SELECT 'GRANT '|| f_table_priv_granted(sel, ins, upd, del, ref) || ' ON '|| 
schemaname||'.'||tablename ||' TO '|| usename || ';' FROM admin.v_get_view_priv_by_user 
WHERE schemaname NOT LIKE 'pg%' 
AND schemaname <> 'information_schema'
AND usename <> 'rdsdb';

Produces:

GRANT SELECT, INSERT, UPDATE, DELETE, REFERENCES ON public.loadview TO johnlou;
GRANT SELECT, INSERT, UPDATE, DELETE, REFERENCES ON public.my_sample_view TO johnlou;

The scripts from the repository make migrating metadata to the new, encrypted cluster easier. Having moved the tables from the acquired company’s warehouse into a separate schema in Amazon Redshift, there are several other scripts that also proved useful.

  • The table_info.sql script shows the pct_unsorted and pct_stats_off columns to indicate the degree of urgency for running the vacuum and analyze processes.
  • The table_inspector.sql script is useful in validating that distribution keys chosen for the migrated tables are likely to be effective. The results include pct_skew_across_slices, the percentage of data distribution skew, and pct_slices_populated. Problematic tables are those where there is either a large value in the pct_skew_across_column_slices or a low value in the pct_slices_populated column.

Summary

In this post, you saw examples of extending existing scripts to generate SQL that can be used to define users and groups in a new database, two examples of using the UDF feature to generate lists of privileges for various objects, sample queries to generate SQL to make this easier to do, and two scripts that help validate that new tables perform well.

Hopefully, these scripts can simplify work in your environment and suggest ways you can extend the existing scripts for more custom processing on Amazon Redshift clusters that are appropriate to your uses.

If you have questions or suggestions, please leave a comment below.

—————————–

Related:

Best Practices for Micro-Batch Loading on Amazon Redshift