Google Cloud Platform: Storing Data on Google Cloud Storage

Ted Neward

0/5 (0 vote)

Feb 12, 2014

CPOL

10 min read

21107

Google Cloud Platform - Part 7: Google Cloud Storage

Google Cloud Platform: Getting Started with Google App Engine (Part 1)
Google Cloud Platform: Deploying with Google App Engine (Part 2)
Google Cloud Platform: User Management (Part 3)
Google Cloud Platform: Mobile Endpoints (Part 4)
Google Cloud Platform - Google Cloud SQL (Part 5)
Google Cloud Platform - Google Cloud Datastore (Part 6)
Google Cloud Platform - Google Cloud Storage (Part 7)

Part 7: Cloud Storage

Welcome back! If this is the first article on the Google Cloud Platform you’re reading up here, you may want to check out the intro to this series, available here, but if you’re more of a "watch the Super Bowl at halftime because nothing exciting happens in the first two quarters" kind of person, feel free to jump right in. You can catch the highlights of the safety-on-the-first-play, the pick-6 and the rest on the sports channels later.

As we mentioned last time, Google offers several different ways of dealing with data storage: Google Cloud SQL is for those applications that want to store data in the time-honored fashion of the relational database and relational model. There’s also Google Cloud Datastore, a non-relational "NoSQL" data storage approach, which we covered last time. And lastly, there’s Google Cloud Storage client, which is geared more for "large binaries," like images and videos, which we’ll get into here, in just a moment.

Before we do so, however, a very serious point bears repeating: Much as the various technical pundits and evangelists might want to disagree, none of these is "superior" to the others. Those individuals who prefer to slavishly follow whatever "best practice" industry pundits are going on about will hate to hear me say this, but the fact is, each one solves a different kind of problem, and sometimes the best approach is to use all of them simultaneously, a technique sometimes called "polyglot persistence" or "poly-store persistence". Or, as a famous writer once put it, "From each database, according to its abilities, to each project, according to its needs."

Overview

First off, let’s make the point clear that Google Cloud Storage is not really something that programmers will use to store the traditional "object" or "customer data" that most other storage options (including Google Cloud SQL and Google Cloud Datastore) will be storing. Google Cloud Storage is more about storing larger, atomic (meaning we, as programmers, don’t look inside of them, but treat them as entirely opaque blobs of storage) binary data, like videos, sounds, large images, and what-not. It’s not a programmatic API for being able to read parts of the file or for picking apart these files into constituent parts. It’s more of a cloud-scale storage area network, with APIs to upload, download, enumerate and remove. This is further reinforced by the fact that for the duration of a given storage’s lifetime on Google Cloud Storage, it is entirely immutable—it cannot be appended to, trimmed, or modified in any other way.

That doesn’t make it less useful than the other two forms of storage, however. If, for example, the application under construction needs to store anything larger than a few K in size, particularly if that facility needs to be allowed to the end-users of said application—such as allowing users to attach arbitrary files as part of a bug report, or as part of a blog comment, or as part of a forum post, or any of a dozen different possibilities—then pretty clearly this is going to be a vastly superior solution, particularly since it’s an "out of the box" solution, as opposed to having to write something custom. (While the other two data storage options are certainly capable of holding large data files, Google Cloud SQL has the usual drawbacks every relational database has when trying to store large BLOBs, and Google Cloud Datastore is going to run into similar problems for similar reasons.)

And yes, if this facility sounds similar to Amazon’s S3 system, that’s because it is; not surprisingly, Google offers a helpful guide for migrating data out of S3 and into Google Cloud Storage at https://developers.google.com/storage/docs/migrating. I’ll leave reasons why one might look to migrate to the imagination of the reader, but there’s also a reasonable discussion around using both systems as a form of redundancy and backup—the chances of both major cloud platforms failing are roughly equivalent to winning the lottery. In three states. On the same day. While being struck by lightning. You get the idea.

Secondly, in marked contrast to the other Google Cloud Platform discussions in this series, the Google Cloud Storage API is entirely language-agnostic, being entirely "RESTful" APIs, using HTTP and either XML or JSON (which is still experimental as of this writing) as the means by which a developer interacts with Google Cloud Storage. This has both benefits and drawbacks. It’s easier for Google to maintain, document, and improve the developer API, because they have only one entry point to worry about. On top of that, an HTTP API means that the Web tooling natively knows how to "speak" Google Cloud Storage; we’ll get into this in a moment. But, it’s sometimes harder on the individual developer, because now there isn’t a language-friendly API to lean on, and (in the case of Java, at least) as a result we lose any compile-time type safety to save us from stupid mistakes.

(For those whose HTTP skills are less than stellar, Google does have some experimental client libraries that wrap the HTTP API calls, but frankly, the benefit gained seems negligible, and developers will eventually want to see the "raw" HTTP calls anyway, for debugging purposes if nothing else. So this article is going to proceed under the assumption that the HTTP APIs are the preferred option.)

APIs, URLs, HTTP, oh my!

Fundamentally, Google Cloud Storage recognizes a few core concepts. First, all data within a given Google Cloud Storage system is scoped to a "project", which corresponds to the projects that we’ve been examining all along as part of this series. As far as developers are concerned, the project is essentially the administrative unit for Google Cloud Storage, so in order to start working with Google Cloud Storage, a developer needs to go into the console (the same one we’ve been using for everything else) and "turn on" Google Cloud Storage. This also implies that this is the project that will be billed for the storage consumed.

Secondly, all data stored within Google Cloud Storage will be stored within "buckets", which, as the name implies, are named containers that will contain named "objects" that contain the actual data. In other words, "buckets" are directories, "objects" are files. Pretty straightforward, no? Buckets can be geographically located, so as to make the data live as close to the target user as possible (to reduce the latency of download when the objects are pulled down), and a given project has no limit to the number of buckets that it can contain, so for projects that span worldwide boundaries, feel free to create buckets with unique names that are essentially duplicates of one another except for the geographic location. (In essence, this would be duplicating the effects of a CDN, up to a point.) Buckets cannot nest, but there’s no real limit to the names of the buckets, so it’s not unusual to create naming schemes that mimic a directory path ("images-accident-12232013-seattle", for example), up to 63 characters in length.

Objects, like files, have both data and metadata associated with them, similarly (at least conceptually) to the metadata associated with files on the filesystem, in the form of name-value pairs, similarly to how HTTP handles header arguments.

The API itself is pretty straightforward; https://developers.google.com/storage/docs/reference-methods has the complete list of all the APIs, but a quick summary is as follows:

GET /: Return a list of all the buckets this authenticated user can see
PUT /: Create a bucket of the name specified in the Host HTTP header
GET /: Get the items in a bucket of the name specified in the Host HTTP header
GET (bucket): Get the items in a bucket of the name specified in the URL request line
GET (object): Get the object of the name specified in the URL request line, from a bucket of the name specified in the Host HTTP heade
PUT (object): Upload an object to the bucket specified in the Host HTTP header, giving it the name specified in the URL request line

Note that most of these have a series of HTTP headers that must also be included as part of the request, including an Authorization string obtained prior to the Google Cloud Storage request, often done by carrying out an OAuth request against the Google Cloud Platform.

Example

Imagine, for a moment, that the application being developed is a trading card game, similar to Magic: The Gathering, except it’s entirely online and Web-based. Trading card games are often characterized (and bought) because of the artwork created for each card; downloading all of these images ahead of time can be prohibitively expensive, and if the game wants to allow players to upload custom-created cards (which would be a cool feature, you have to admit), the images will need to be stored online somewhere and referenced as part of the game. That somewhere, of course, is Google Cloud Storage.

Uploading a new image (whether by the developers or by a player) would be a PUT operation to the "card-images" bucket, like so:

    PUT /meatshieldfighter.jpg HTTP/1.1
    Host: coolcardgame.storage.googleapis.com
    Date: Sat, 20 Feb 2010 16:31:08 GMT
    Content-Type: image/jpg
    Content-MD5: iB94gawbwUSiZy5FuruIOQ==
    Content-Length: 552346
    Authorization: OAuth 1/zVNpoQNsOSxZKqOZgckhpQ

Notice the Host parameter to the HTTP request; the "coolcardgame" is the name of the bucket to which the object "meatshieldfighter.jpg" (obviously the artwork for the Meat Shield Fighter card) would be stored. Then, the game can later simply include a GET URL, perhaps directly as part of an "img" tag, like so:

GET /meatshieldfighter.jpg HTTP/1.1
    Host: coolcardgame.storage.googleapis.com
    Content-Length: 0
    Authorization: OAuth 1/zVNpoQNsOSxZKqOZgckhpQ

This is part of the power of using the HTTP API directly—by using direct URLs into the Google Cloud Storage system; developers can let users’ browsers take full advantage of the rest of the Web infrastructure, including any and all caching servers in between their browser and the Google Cloud Platform servers. For this same reason, Google Cloud Storage supports a POST form of the PUT command to upload an object, specifically built to allow for Web forms to POST images (in this case) directly into the cloud.

There’s more, of course, including the ability to use HEAD to obtain the metadata for an object and DELETE to delete objects and/or empty buckets, but the HTTP to use those are pretty inferable from looking at the above. There’s also an access-control system that can be set on individual objects and buckets, as well as a mechanism to mark certain objects as "public" (meaning no Authorization header is necessary to retrieve them), all of which is described in the Google Cloud Storage documentation online.

Summary

Realistically, there’s not much more to say about Google Cloud Storage: it supports upload, download, enumeration and deletion of large binary objects. By using an HTTP-based API (what developer enthusiasts are more and more coming to call "web APIs"), Google makes it pretty trivial to do all of these things from within the browser as well as from an application server if necessary.

More importantly, the Google Cloud Platform, as we’ve seen, is a pretty full-featured, "batteries included" cloud platform, with all the core capabilities that developers have come to expect from a cloud API environment, as well as a few features that other cloud platforms lack. Overall, it’s a powerful platform, and one that developers need to examine carefully as an option for the next Web-based or mobile-based application they’re asked to build.

As the Google Cloud Platform evolves further, we’ll take a look at the various parts and pieces that emerge, but for now, we’ve covered the core parts, so good luck, get clouding, and happy coding!