At Movable Ink we heavily use Amazon S3 for storing millions of files and serving them to hundreds of millions of users. It has a number of very compelling qualities: it has great performance characteristics and durability guarantees of a blistering eleven 9’s—they replicate our data in such a way that in theory there is 99.999999999% object retention.
However, durability and uptime are not one and the same, as many S3 customers found out when an internal configuration issue impacted services on Monday morning. The problem affected buckets in the US Standard S3 region, the most commonly used US S3 region.
We’re pretty conscious about potential single points of failure, and tend to have redundancy at multiple tiers: each layer is spread across multiple hosts which are interconnected at multiple points to the layers above and below it. This manifests as multiple load balancers, app servers, and availability zones, with the entire setup replicated across geographically separate datacenters thousands of miles apart. With all of that redundancy, of course we want our S3 serving to also be redundant.
S3 buckets are tied to a geographical location, and most correspond to one of Amazon’s datacenters. However, US Standard stores data on both the east coast and west coast. Given that it can be accessed from either coast, my first concern was around consistency: what would happen if you were to write data on one side and then immediately try to read it from the other? We tested it and it was oddly consistent, which seemed strange since it was serving from two different regions.
It turns out there is no replication happening. It actually only writes to the region of the endpoint you use while writing:
Amazon S3 automatically routes requests to facilities in Northern Virginia or the Pacific Northwest using network maps. Amazon S3 stores object data only in the facility that received the request.
Given this, we should really be treating US Standard as a single point of failure. So how can we make it redundant?
The strategy we take is to store data in different S3 regions, then come up with a way to point users and our backend services at whichever region is currently active. AWS actually has a couple of tools to facilitate the former. S3 supports file creation notifications to SNS or SQS, and you could set up AWS Lambda to automatically copy files to a different region. But even better than that, a few months ago Amazon released Cross-Region Replication to do exactly what we want. Setup is simple:
- Turn on versioning on the source bucket. This comes at an extra cost since you pay for all previous versions of your files, but since we’ve already decided that this data is very important it’s worth it. After all, we’re talking about doubling our storage costs here.
- Turn on cross-region replication. As part of the setup, you’ll create another versioned bucket in the destination datacenter and an IAM policy to allow data transfer between the two.
- Do a one-time manual copy of all of your files from the source bucket to the destination bucket. Replication only copies files that are added or changed after replication is enabled. Make sure the permissions are identical.
Now every time we add a file to the source bucket, it is (eventually) replicated to the destination bucket. If all of our access is through our backend services, this may be good enough since failing over is a simple configuration change. But many of the references to our S3 buckets are buried in HTML documents or managed by third parties. How can we make it easy to switch between the buckets?
Our initial idea was to just set up a subdomain entry on a domain we control to CNAME to our S3 bucket, then do failover with DNS. S3 allows this, with one big caveat: your bucket must be named exactly the same as the domain. If you want to reference your S3 bucket as foo.example.com, your S3 bucket needs to be named foo.example.com.s3.amazonaws.com. Combined with S3’s restriction that every bucket name must be unique across regions, only one bucket can ever be referenced from foo.example.com so this doesn’t work.
Amazon has a CDN service, Cloudfront, which allows us to set an S3 bucket as an origin for our CDN distribution. We can then CNAME our subdomain to our Cloudfront distribution’s endpoint. In the event of a regional S3 failure, we can update Cloudfront to point to our backup S3 bucket. And you can either turn on caching and reap some latency benefits, or set the time-to-live cache setting to zero to act as a pass-through.
We would have preferred to set up two Cloudfront distributions and switch between them with DNS, but Amazon has similar restrictions disallowing two distributions from having the same CNAME. Still, this setup still lets us respond to an S3 outage in minutes, routing traffic to an unaffected region. In our tests, the failover can fully complete in between 5-10 minutes.
Building applications in the cloud means expecting failure, but it’s not always straightforward, especially when using third-party services like S3. Even with our final setup, it’s not completely clear what Cloudfront’s dependencies and failure modes are. But importantly, we control the DNS so we can implement our own fixes rather than waiting for Amazon.
If you’re interested in working on challenging problems like this, check out Movable Ink’s careers page.