Uploading Archives to Amazon Glacier from PHP

You can easily upload your data archives to Amazon Glacier by using the Glacier client included in the AWS SDK for PHP. Similar to the Amazon S3 service, Amazon Glacier has an API for both single and multipart uploads. You can upload archives of up to 40,000 GB through the multipart operations. With the UploadArchive operation, you can upload archives of up to 4 GB in a single request; however, we recommended using the multipart operations for archives larger than 100 MB.

Before we look at how to use the specific operations, let's create a client object to work with Amazon Glacier.

use Aws\Glacier\GlacierClient;

$client = GlacierClient::factory(array(
    'key'    => '[aws access key]',
    'secret' => '[aws secret key]',
    'region' => '[aws region]', // (e.g., us-west-2)
));

Uploading an archive in a single request

Now let's upload some data to your Amazon Glacier vault. For the sake of this and other code samples in this blog post, I will assume that you have already created a vault and have stored the vault name in a variable called $vaultName. I'll also assume that the archive data you are uploading is stored in a file and that the path to that file is stored in a variable called $filename. The following code demonstrates how to use the UploadArchive operation to upload an archive in a single request.

$result = $client->uploadArchive(array(
    'vaultName' => $vaultName,
    'body'      => fopen($filename, 'r'),
));
$archiveId = $result->get('archiveId');

In this case, the SDK does some additional work for you behind the scenes. In addition to the vault name and upload body, Amazon Glacier requires that you provide the account ID of the vault owner, a SHA-256 tree hash of the upload body, and a SHA-256 content hash of the entire payload. You can manually specify these parameters if needed, but the SDK will calculate them for you if you do not explicitly provide them.

For more details about the SHA-256 tree hash and SHA-256 content hash, see the Computing Checksums section in the Amazon Glacier Developer Guide. See the GlacierClient::uploadArchive API documentation for a list of all the parameters to the UploadArchive operation.

Uploading an archive in parts

Amazon Glacier also allows you to upload archives in parts, which you can do using the multipart operations: InitiateMultipartUpload, UploadMultipartPart, CompleteMultipartUpload, and AbortMultipartUpload. The multipart operations allow you to upload parts of your archive in any order and in parallel. Also, if one part of your archive fails to upload, you only need to reupload that one part, not the entire archive.

The AWS SDK for PHP provides two different techniques for doing multipart uploads with Amazon Glacier. First, you can use the multipart operations manually, which provides the most flexibility. Second, you can use the multipart upload abstraction which allows you to configure and create a transfer object that encapsulates the multipart operations. Let's look at the multipart abstraction first.

Using the multipart upload abstraction

The easiest way to perform a multipart upload is to use the classes provided in the Aws\Glacier\Model\MultipartUpload namespace. The classes provide an abstraction of the multipart uploading process. The main class you interact with is UploadBuilder. The following code uses the UploadBuilder to configure a multipart upload using a part size of 4 MB. The upload() method executes the uploads and returns the result of the CompleteMultipartUpload operation at the end of the upload process.

use Aws\Glacier\Model\MultipartUpload\UploadBuilder;

$uploader = UploadBuilder::newInstance()
    ->setClient($client)
    ->setSource($filename)
    ->setVaultName($vaultName)
    ->setPartSize(4 * 1024 * 1024)
    ->build();

$result = $uploader->upload();

$archiveId = $result->get('archiveId');

Using the UploadBuilder class, you can also configure the parts to be uploaded in parallel by using the setConcurrency() method.

$uploader = UploadBuilder::newInstance()
    ->setClient($client)
    ->setSource($filename)
    ->setVaultName($vaultName)
    ->setPartSize(4 * 1024 * 1024)
    ->setConcurrency(3) // Upload 3 at a time in parallel
    ->build();

If a problem occurs during the upload process, an Aws\Common\Exception\MultipartUploadException is thrown, which has access to a TransferState object that represents the state of the upload.

try {
    $result = $uploader->upload();
    $archiveId = $result->get('archiveId');
} catch (\Aws\Common\Exception\MultipartUploadException $e) {
    // If the upload fails, get the state of the upload
    $state = $e->getState();
}

The TransferState object can be serialized so that the upload can be completed in a separate request if needed. To resume an upload using a TransferState object, you must use the resumeFrom() method of the UploadBuilder.

$resumedUploader = UploadBuilder::newInstance()
    ->setClient($client)
    ->setSource($filename)
    ->setVaultName($vaultName)
    ->resumeFrom($state)
    ->build();

$result = $resumedUploader->upload();

Using the multipart operations

For the most flexibility, you can manage all of the upload process yourself using the individual multipart operations. The following code sample shows how to initialize an upload, upload each of the parts one by one, and then complete the upload. It also uses the UploadPartGenerator class to help calculate the information about each part. UploadPartGenerator is not required to work with the multipart operations, but it does make it much easier, especially for calculating the checksums for each of the parts and the archive as a whole.

use Aws\Glacier\Model\MultipartUpload\UploadPartGenerator;

// Use helpers in the SDK to get information about each of the parts
$archiveData = fopen($filename, 'r');
$partSize = 4 * 1024 * 1024; // (i.e., 4 MB)
$parts = UploadPartGenerator::factory($archiveData, $partSize);

// Initiate the upload and get the upload ID
$result = $client->initiateMultipartUpload(array(
    'vaultName' => $vaultName,
    'partSize'  => $partSize,
));
$uploadId = $result->get('uploadId');

// Upload each part individually using data from the part generator
foreach ($parts as $part) {
    fseek($archiveData, $part->getOffset())
    $client->uploadMultipartPart(array(
        'vaultName'     => $vaultName,
        'uploadId'      => $uploadId,
        'body'          => fread($archiveData, $part->getSize()),
        'range'         => $part->getFormattedRange(),
        'checksum'      => $part->getChecksum(),
        'ContentSHA256' => $part->getContentHash(),
    ));
}

// Complete the upload by using data aggregated by the part generator
$result = $client->completeMultipartUpload(array(
    'vaultName'   => $vaultName,
    'uploadId'    => $uploadId,
    'archiveSize' => $parts->getArchiveSize(),
    'checksum'    => $parts->getRootChecksum(),
));
$archiveId = $result->get('archiveId');

fclose($archiveData);

For more information about the various multipart operations, see the API documentation for GlacierClient. You should also take a look at the API docs for the classes in the MultipartUpload namespace to become more familiar with the multipart abstraction. We hope that this post helps you work better with Amazon Glacier and take advantage of the low-cost, long-term storage it provides.

Comments