Movable Ink Engineering

Editing Git Hunks for Fun and Profit

Connor Demille — Mon, 07 Oct 2019 12:54:04 GMT

“Fun and Profit” — Photo credit to Tim Gouw

If your development process is anything like mine, you tend to make lots of changes and worry about organizing the commits afterwards. This can lead to some frustration in separating the changes into neat, logical increments.

Add changes by patch

Adding changes with the --patch flag makes telling a clear commit “story” easy. For example, after making a few minor changes to my dotfiles readme, running git add --patch outputs the following hunk:

$ git add . --patch
diff --git a/README.md b/README.md
index 741178e..447cdb7 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,5 @@
## Custom files for various environments

-Miner change.
+Minor change.

Some context.

-Major change.
+Take a look at this major change.
Stage this hunk [y,n,q,a,d,s,e,?]?

The important part of this output is the last line —
Stage this hunk [y,n,q,a,d,s,e,?]?
These options let you decide what to do with each hunk of changes.

y — stage this hunk
n — do not stage this hunk
q — do not stage this hunk or any of the remaining ones
a — stage this hunk and all later hunks in the file
d — do not stage this hunk or any of the later hunks in the file
s — split the current hunk into smaller hunks
e — manually edit the current hunk
? — print help (displays this list)

The s option is only present when git knows how to split up the hunk. Often, this is a good option to use before doing any manual editing. If we were to split up our above example, the first hunk would look like this:

diff --git a/README.md b/README.md
index 741178e..447cdb7 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,5 @@
## Custom files for various environments

-Miner change.
+Minor change.
Stage this hunk [y,n,q,a,d,e,?]?

Manually editing a hunk

Once we’ve decided on a hunk to edit, we can enter the e option and we’re dropped into our default editor (as specified by $EDITOR). The file will contain the hunk as well as a few comments describing how to edit the hunk. For this example, I’ve elected not to split the hunk prior to editing.

# Manual hunk edit mode -- see bottom for a quick guide.
@@ -1,5 +1,5 @@
## Custom files for various environments

-Miner change.
+Minor change.

Some context.

-Major change.
+Take a look at this major change.
# ---
# To remove '-' lines, make them ' ' lines (context).
# To remove '+' lines, delete them.
# Lines starting with # will be removed.
# 
# If the patch applies cleanly, the edited hunk will immediately be
# marked for staging.
# If it does not apply cleanly, you will be given an opportunity to
# edit again.  If all lines of the hunk are removed, then the edit is
# aborted and the hunk is left unchanged.

From here we can negate line additions or deletions. If we make any other edits, we should ensure that the edited line is prepended with a ‘+’. For example, we can change our example like so:

@@ -1,5 +1,5 @@
## Custom files for various environments

Miner change.
# Here we've negated the line deletion
# Notice that the line is prepended with a ' '

Some context.

-Major change.
+Take a look at this enormous change!
# Here we've changed the added line to say something different
# In my experience, this is the most useful type of edit

Conclusion

I use git patch adding everyday and honestly can’t believe how long I’d been writing code without it. The intent of this article was to provide a simple walk through of git add --patch along with a brief explanation of its features.

The best way to learn about this git feature is, of course, using it directly, but hopefully this article cleared up any points of confusion or made the process more approachable. If not, check out these posts on the same subject by Markus Wein and Kenny Ballou.

Crowdsourced Password Security

Stacey Stewart — Thu, 08 Aug 2019 18:57:30 GMT

About a month ago, my roommate came home from work and told me that her bank account was hacked and drained of a huge amount of money. She has been saving for business school for about 8 years so the amount stolen was in the mid-five-figures. While we will probably never definitively know how hackers gained access to drain her account of that much money, it’s my guess that it was because she was reusing passwords and her bank account’s reused password was compromised in a leak.

Upon learning about this, I started thinking about all of the people in my life that could easily fall victim to this. I’ve seen everything from people keeping passwords written down in a note in their phone to people who use the same 4-5 passwords on every website they log into. Did I mention I used to be one of these people? These days, I use a password manager that my fellow engineers at Movable Ink helped me set up. So far, I’m up to 360 accounts that are saved in my vault. Did you catch that - three hundred and sixty accounts! Nobody could possibly remember 360 strong passwords.

After asking a few of my close friends if they would be interested in learning more, and hearing that they would, I decided to create a 1-pager. The 1-pager is a simple guide of what to do to make your accounts more secure by getting a password manager, setting up MFA, migrating accounts into the password manager, and changing your passwords.

After I was finished writing up everything I knew myself, I decided that our engineering team was the best resource I had to solicit feedback on my 1-pager. I wrote a little blurb on what inspired it and explained that I wanted to be sure it was at a level that even someone like my parents could understand, and I slacked it out and asked for any and all feedback. I was really pleased with all of the helpful feedback I received. Feedback ranged from places where I could add a link to other topics I could touch on like suggesting a credit report freeze. It felt great that our engineering team was so supportive and genuinely wanted to help—some even said they would send the 1-pager to their parents as well.

Around the same time I was writing up the one-pager, our women’s Employee Resource Group, Movable Pink, was in talks with Techfest Club about hosting a talk in our office. The Techfest founder had seen my 1-pager and asked that I give a talk on this. Because I had already written the 1-pager and gotten such great feedback on it, I felt completely comfortable and prepared for this.

The talk went really well and I even had someone at Movable Ink ask to sit down so I could do an even more in-depth overview for her. There’s strength in numbers, and it can be really helpful to have someone that has been through this process serve as a resource for you when you’re getting started. Here is a copy of my 1-pager. Feel free to share it with friends or family or it use it yourself!

Pre-Mortems

Michael Nutt — Wed, 16 Jan 2019 21:48:25 GMT

Winter holidays, starting with Thanksgiving, are the busiest time of the year at Movable Ink. We generate and serve billions of images a day, and ramping up to the holidays we often see our daily traffic double or triple. We need to be able to respond to events as well as anticipate what might go wrong.

In preparation, we hold a pre-mortem: assume that a month from now everything is on fire, how did we get to this point? What can we do to prevent it?

Pre-mortems ideally involve as much of the engineering organization as possible because when we're dealing with unknown unknowns, many different perspectives are helpful. We gather in a room and each engineer gets a set of sticky notes and a marker. On each sticky note, they write down issues that they can imagine might arise, focusing on high probability or high impact. These can range from "a newly-rolled-out service runs out of resources due to insufficient capacity planning" to "Slack is down and we can't coordinate disaster recovery".

After 10 or 15 minutes, we place our stickies on the whiteboard and a volunteer facilitator arranges them by category and we discuss each one. Many times, the discussion leads to new stickies being produced. As we discuss each potential issue, we decide on the severity and determine if and how we might prevent it. In some cases there will be a straightforward prevention, such as adding redundancy or more error handling. In other cases, the issue might be vague or unlikely enough that we don't address it directly but instead decide to add extra logging or alerting. (side note: while dealing with the aftermath of an incident, I almost always look back and wish we had more logging in place) Or we might explicitly decide to accept a risk and take no action, for example: we decide not to run a redundant database cluster for some less-important data because recovering from backups is good enough and we have identified that the cost is not worth it.

One interesting thing we've found is that it's often not enough to solve the immediate problem, we also have to think about the effects of our solutions. Two simplified examples:

We expect to see a large increase in request throughput, so we have autoscaling horizontally scale out our application servers. As a result, we hit the maximum number of connections on our databases. Action item: scale out database replicas, ensure proper monitoring and alerting for connection limits.
A caching server for a particular service represents a single point of failure. We add additional caching servers, then notice a decrease in cache hit rate which has downstream effects. Action items: configure application servers to choose cache server priority order based on cache key, ensure monitoring and alerting based on cache hit ratio, scale cache servers vertically where feasible.

After we've created action items, we prioritize them and they go into regular sprint development queues. It's important to start early, as the action items may need to evolve as we implement them.

Neither we nor the development community invented the ideas behind pre-mortems; they're very well-understood in the business and operations world under the umbrella of Risk Assessments. Where our priorities are loosely based around likelihood and impact, something like a Risk Assessment Matrix can help you categorize what you should work on:

Typical (simplified) example of a Risk Assessment Matrix

You can use a Risk Assessment Matrix to determine how important the risks are and what you should do about the risks: you can redesign your system to eliminate risks, change your processes and procedures to mitigate them, put in metrics to monitor them, or even accept them. (or any combination of the above) For instance, high-likelihood, low-impact risks may be worth trying to eliminate if you can do so cheaply, whereas moderate-impact but very low-likelihood risks might be accepted and monitored.

The pre-mortem is just one tool we use for keeping things running smoothly; if things do go wrong a post-mortem can illuminate events or failure modes we didn't expect. And our weekly kaizens, meetings where we reflect on what went well and where we can improve, are a useful tool for consistently ratcheting up the quality of our processes and systems.

Scaling Past a Billion - HAProxy, Varnish, and Connection Tracking

Matt Chesler — Mon, 11 Sep 2017 21:32:00 GMT

Of IPTables, Netfilter Connection Tracking, and Fanout Servers

We use HAProxy for load balancing and Varnish as a caching HTTP reverse proxy in quite a few places. While these are two different technologies that serve two different purposes, they’re both types of “fanout” servers, meaning a client connects on the frontend and their request is routed to one of many backends.

In our environment, we are routinely handling 20,000 connections per second and frequently bursting past 30,000 connections per second. One of our most common use cases is serving up an image and then tearing down the connection.

DDoS or Gremlins?

A while back, we had an issue where we were seeing mysterious timeouts and dropped connections from the Internet to the HAProxy layer. Our initial working theory was that we were experiencing some DDoS attack because this issue was occurring against a piece of infrastructure that had been rock solid for a significant amount of time. However, analysis of our logs argued firmly against the DDoS theory. We then formulated and investigated multiple hypotheses for what could be causing these issues ranging from application configuration to operating system tuning to network bandwidth limitations.

Definitely Gremlins

Similarly to losing a set of keys, the thing you’re looking for is always in the last place you look for it. After chasing down several rabbit holes, we finally noticed that dmesg on the HAProxy machines logged quite a few entries with nf_conntrack: table full, dropping packet. In addition to access control at the edge of our network, we run local firewalls on every machine to ensure that only the intended ports are exposed. IPTables provides a modular architecture to make it flexible. One of the modules in use by default is the Netfilter Connection Tracking module.

Connection Tracking

Netfilter’s Connection Tracking System is a module that provides stateful packet inspection for iptables. The key word in the last sentence is stateful. Any stateful system must maintain its state somewhere. In the case of nf_conntrack, that state is tracked in a hash table to provide efficient lookups. Like almost anything else in a Linux system, the hash table is configurable via two sysctl values, nf_conntrack_buckets and nf_conntrack_max. Per the kernel documentation nf_conntrack_buckets determines the number of buckets in the hash table, and nf_conntrack_max determines the maximum number of nodes in hash table. These two values directly impact the amount of memory required to maintain the table and the performance of the table. The memory required for the table is nf_conntrack_max * 304 bytes (each conntrack node is 304 bytes), and each bucket will contain nf_conntrack_max / nf_conntrack_buckets nodes.

Honing in on a solution

After quite a bit of googling, tweaking, load testing, and trial and error, we arrived at a set of configuration changes that signficantly increased the capacity of each of our HAProxy and Varnish systems. As it turns out, our priority is extremely heavily weighted toward maximizing the number of connections each “fanout” server is able to handle. This meant that nf_conntrack in any form was a major bottleneck for us. In the end, we opted to forego stateful packet inspection at the load balancer layer and disable nf_conntrack entirely. We blacklist the nf_conntrack and nf_nat via /etc/modprobe.d/blacklist-conntrack.conf:

blacklist nf_conntrack
install nf_conntrack /bin/true
blacklist nf_nat
install nf_nat /bin/true

Rinse and repeat

When you have a new favorite hammer, everything starts to look like a nail. Once we resolved the bottleneck at the top of our stack, the issue shifted further down to the Varnish layer. Fortunately, we were able to use the experiences from our HAProxy layer to identify and resolve the issue far more quickly.

Doorman Authenticating Proxy

Michael Nutt — Mon, 11 Sep 2017 03:12:00 GMT

At Movable Ink, we have a number of different backend services, many of which have administrative web interfaces. We want to give employees access to these services without opening the services up to the world, so we need to authenticate user access. Some of the services have built-in authentication, but few support delegated authentication and we didn’t want to have to manage users in many different places.

All of our services live firewalled off inside VPCs in Amazon’s EC2. One method for allowing access might be to let our employees set up VPN connections into our datacenters. With the VPN connected, employees could make internal requests via direct access to our internal network.

The problem with that approach is that we have different levels of authorization within our organization. Someone on our devops team may need direct access into our internal network, but many of our employees may just need access to a handful of web services. It’s possible to implement fine-grained access control using networking, but it is complicated and error-prone. Google recently blogged about the same drawbacks.

We developed Doorman to act as an authenticating proxy between the world and our internal services. It’s written in node.js, and leans heavily on node-http-proxy for proxying and everyauth for service authentication. We use express’s middleware pattern to chain everything together. An annotated excerpt from Doorman:

var app = express();

// Log the request to stdout
app.use(logMiddleware);

// Redirect http requests to https
app.use(tls);

// If the user is already authenticated, read their signed cookie
app.use(cookieParser(conf.sessionSecret));

// Create the user's session from the cookie
app.use(doormanSession);

// Display flash messages when appropriate
app.use(flash());

// Ensure the user is valid and allowed to access the service, this
// also halts the chain and proxies to the internal service if successful
app.use(checkUser);

// Parse POST bodies
app.use(bodyParser.urlencoded({extended: false}));

// Register paths for oAuth redirects and callbacks
app.use(everyauth.middleware());

// `/_doorman` serves assets for doorman login form
app.use(express.static(__dirname + "/public", {maxAge: 0 }));

// display the login page if the user isn't authenticated
app.use(loginPage);

Each middleware runs successively and has the opportunity to modify the request/response, halt it and return, or continue down the chain. For example, the tls middleware tells the response to send a 301 Moved and halts the chain, while cookieParser tacks some information onto the request object and continues down to the next middleware. In the event none of the middlewares halt the chain and finish the response, we serve a 404 Not Found.

One particular challenge we ran into was how to use Doorman for multiple internal services at once. A simple way would be to run a separate Doorman app for each internal service, but it’s often overkill to have a dedicated Doorman for each service. We’re planning on introducing multiple domain support, which can route to different internal services based on the hostname, and have different access control for each service. Due to SSL limitations, it will likely require that all of the internal services be on different subdomains and have Doorman configured to use a wildcard certificate. Another possibility would be to use LetsEncrypt with SNI support.

If you’re interested in using or contributing to Doorman, you can clone it on Github.

Building a Rich Text Editor

Michael Nutt — Tue, 29 Aug 2017 03:15:00 GMT

Building a rich text editor might as well be a rite of passage for frontend developers. Most people doing any kind of digital work have used this type of editor before, and most content-based websites have some kind of unique text editor in their interface. A textarea might be good enough for 90% of use cases, but that last 10% is where you’ll find some of the more recognizable text editors on the Internet.

A rich text editor, a type of WYSIWYG (“What You See Is What You Get”) editor, is a catch-all term describing a text editing interface where what the user sees in the editor has a similar appearance to what the end-user sees. Whether you’re writing a blog post, an email, or a tweet, you’re probably using a custom-built WYSIWYG text editor. And it was probably a pain.

Before we talk about how these editors can be built, let’s take a quick look at what’s out there.

Google’s Gmail service is obviously massively popular. Countless emails are drafted up in this editor every day. Most emails fall into two camps: all-text for basic communications, or lots of images stitched together, more commonly used for marketing. Gmail’s editor is clearly more focused on text, giving you controls for bolding and italicizing your text, as well as adding header styles and properties like text alignment. You can also add images into the email, with an easy drag and drop interface. The important thing to take note of is that the email appears the same to you in the editor as it will in the recipient’s inbox. If you select text and click “bold”, that information becomes visually apparent immediately. If you add an image, you won’t see ; you’ll see the image as soon as it’s done uploading. There’s no guesswork involved in what your users will see.

Another editor you might be familiar with is Twitter’s new tweet box.

At first glance, the box is pretty straightforward. You have a faded placeholder message prompting you to type into it. You can spread your tweet out across multiple lines, but there’s no option for bold or italic text. You can’t add inline images like Gmail lets you. However, Twitter’s doing some interesting find-and-replace work in this box as you type. If you type a hashtag with the # symbol, or tag another user with an @, Twitter will immediately swap out the boring, black text, with their branded blue color, making those pieces of text look more like a hyperlink. Again, you’re seeing what the end-user (in this case, your followers) will see.

Twitter knows to remove the hyperlink as soon as the # or @ are removed, and for some of the bigger branded hashtags, they’ll even add extra content into your tweet for you in the form of custom emojis.

Pretty slick.

If you’re a developer reading this, you’re probably familiar with GitHub’s interface. GitHub takes a more functional approach to text editing; instead of giving live visual updates as you type up an issue, pull request or comment, they give users a tabbed editing box, allowing you to toggle between “Write” and “Preview” views of your content easily.

GitHub uses their own brand of Markdown syntax, a common formatting language that transforms contextual symbols into HTML. Surrounding text with **asterisks** results in text wrapped in strong tags. Using _underscores_ gives you italicized text wrapped in em tags.

This editor doesn’t give us the kind of immediate visual feedback that other editors do, but it does make the serialization layer very apparent. Markdown is an easy syntax to recognize, and once you’re familiar with it, it’s easy to translate into what the approximate HTML output would be. The format is portable, lightweight and simple enough for non-technical users to grasp, even if they’re not seeing exactly what the end-user is going to see.

Since GitHub’s not directly storing user-submitted HTML, they’re also protecting themselves against some security threats of script injection. Beyond that, simplifying the data that they’re working with helps to make the logic for the editor itself simpler.

The challenges of building an editor

Let’s think about how we could implement this with an Ember component.

Ember is our preferred front-end framework at Movable Ink, because it gives us a very structured system to manage the complexity of our applications. The “What is Ember?” section of the official Ember guides describes it as “a JavaScript front-end framework designed to help you build websites with rich and complex user interactions.” This editor definitely qualifies as rich and complex, making Ember a great choice for such a project. If you’re not already familiar, you may want to read into the Ember guides a bit further.

Your first instinct might be to start with a tag, but then we lose the ability to show line breaks in the text. Using a textarea tag fixes that problem for us.

So we’ve got your textarea, and if we know we’re going to try to handle a find and replace for those hashtags, we’ll probably be using some kind of RegExp match and an event handler. If we look very closely, you can see that the blue highlighting for a hashtag doesn’t get added until after at least one character appears next to the # sign, so Twitter probably isn’t using a keydown or input handler. If they were, we wouldn’t see that quick flash as the black text gets wrapped in their link styling.

We tend to use DockYard’s excellent ember-one-way-controls addon for our form fields, as they help enforce the Data Down, Actions Up (DDAU) philosophy which Ember has embraced.

When a user types into the textarea, the one-way-textarea component logic reacts to an input event, in which it pulls the value property off of the textarea, and passes that value to whatever update action has been defined on it. The update action is expected to change the value attribute that’s passed into the component, triggering a re-render and displaying the new, updated value to the user inside the textarea.

Something like this should serve us:

// controllers/application.js
export default Ember.Controller.extend({
  text: 'I am a #Jedi, like my father before me.',

  actions: {
    updateText(newValue) {
      this.set('text', newValue);
    }
  }
});

// components/tweet-box.js
export default Ember.Component.extend({
  hashtagPattern: /[^>](#[\w\d]+)/g,

  html: Ember.computed('text', function() {
    const text = this.get('text');
    const pattern = this.get('hashtagPattern');

    return text.replace(pattern, "$1");
  })
});

// templates/application.hbs
{{tweet-box text=text onchange=(action 'updateText')}}

// templates/components/tweet-box.hbs
{{one-way-textarea value=html update=(action onchange)}}

This looks decent- Our RegExp will match any text prefixed with a # that isn’t already wrapped in a span tag, and surround it with opening and closing tags. We’re checking that the hashtag isn’t already wrapped pretty naively, but for a quick demo, it’ll do. If you load up this example in Ember Twiddle, you’ll quickly realize that we’re not getting what we want anyway.

The immediate problem is that we’re not seeing the visual feedback we wanted. The link should be colored blue, and we shouldn’t be seeing the HTML that’s being generated. There’s also some nasty cursor behavior going on- whenever our HTML runs a find-and-replace and wraps a new hashtag, the cursor immediately jumps to the end of the textarea.

textarea tags and normal input fields can’t render HTML inside of them. Fortunately, there’s a widely available native solution in the contenteditable HTML attribute. Setting nearly any HTML tag to use contenteditable="true" will let you type inside the element as if it was a textarea tag. However, since you’re editing the HTML directly, and not actually transforming it into a form field, it will still render the way you want, respecting whatever CSS you’ve applied to the page.

If we update our component to use a

tag instead of a textarea, things start to look a little better. We don’t have access to most of the event handlers that input and textarea tags have, but contenteditable elements do fire an input event. We can pull the innerHTML property of our editable div inside that event, and use that in place of our nice update action in the DockYard {{one-way-textarea}} component.

Here’s our updated Twiddle.

It’s looking better, but interacting with the editable div is even worse than before. Typing anything into the div moves our cursor to the front of the text. Unless you’re typing in a language read right-to-left, this is less than ideal.

Further investigation will reveal more problems, some that come from Glimmer (Ember’s rendering engine) throwing up its hands when trying to understand what you’re doing inside the editable div. Clearing the contents of the div throws an error, because you’ve unexpectedly removed HTML nodes that are invisible to you.

The contenteditable attribute is a double-edged sword. It gives you some really nice behavior out of the box, including the ability to toggle bold and italicized text with the expected shortcuts (Command+B or I in macOS). Pasting HTML inside the editable element will render it as you would expect, with full styling.

Aside from that, the contenteditable attribute is not going to get you to the finish line of building a rich text editor without putting in more of the heavy-lifting yourself, and the more custom behavior you add into the editor, the more you’ll have to wrestle with the browser’s native behavior.

If you’ve been tasked with building one of these editors, the contenteditablemight seem like a quick solution, but it’s a building block at best. A good solution will probably involve using the attribute in some way, but you’re still going to need to get your hands dirty to give your users a better experience.

The bottom line is that there are no easy solutions for building a text editor with a great user experience. It’s a common problem to solve, but oftentimes, your particular needs are going to require something bespoke.

Our Experience

At Movable Ink, we recently released a new WYSIWYG editor inside our web platform called Studio. Prior to the release, most of the content created in our platform was built by extending some shared logic on top of a custom template, which would be built out using HTML, JavaScript and CSS. Our goal was to lower the barrier to entry for making your own content, while still delivering reliable, intelligent content to our clients, and our clients’ customers.

Studio lets you create those templates using an interface that would be familiar to anyone that has used programs like Sketch, Photoshop or PowerPoint; elements on the page are created by clicking and dragging in a workspace, using simple tools. However, the more interesting parts of our platform are centered on delivering content that isn’t fully determined until an email is opened. A dynamic image might be built with our platform that includes your customer’s name and some account details, for example, but representing that template inside Studio in a way that clearly communicates how it will change is challenging. Studio has made it much easier to create the visual template that your content will be presented inside, but we faced a fairly unique challenge of making sure that users understood that the text that they see inside their workspace is not what the end-user will see.

We use a token system inside fields that hold “dynamic text” to show that parts of the text will change at time-of-open. A personalized message might be shown in our interface as “Hello, [mi_name]!”, but a user unfamiliar with our system might read that and be confused as to what “mi_name” represents. We decided to build out our own rich text editor that would perform a similar kind of find-and-replace logic that other editors use- If a user types something matching our token format, it should immediately be visually distinguished from the rest of the static text surrounding it. We also wanted our text editor to support the same kind of standard controls that others use to apply bold or italic styles to text.

We’re still in the process of expanding our text editor to cover all of our needs, but we’ve come to a few conclusions in our development process. Here’s some advice that could help guide your own plans.

It’s dangerous to go alone!

Be invisible

Editing text is such a core experience of working on a computer, that most users don’t think twice about things that the editor manages for then. There are lots of things the editor is keeping track of that you might take for granted. A good editor shouldn’t sacrifice any of those things. Users probably won’t be very impressed when you get those things right, but they’ll immediately realize when something doesn’t work as they expect. Your implementation should feel like it’s native to the browser. This includes things like copying and pasting text, using the undo and redo shortcuts, cursor movement, being able to make text bold or italic, indenting text, and others. All of these minor interactions add up to something familiar to most users. Stepping outside of that familiar territory will make your editor feel half-baked. Your goal should always be to make your work as invisible to the end-user as possible.

Use the tools that you’re given

The contenteditable property will get you close to a rich text editor, but as you add your own abstractions on top of that API, it’s important to make sure you’re not stepping in front of the browser unnecessarily. The browser handles all of those small features we just talked about, requiring nothing from your end. You’ll probably find yourself needing to recreate some of that behavior yourself, but use as light a touch as you can there. For example, our tweet-box component that we built earlier had an issue where it would reset the user’s cursor position anytime its find-and-replace logic ran on a hashtag. That’s unacceptable, but your solution should be as minimal as possible; if you need to save and restore the cursor’s position at certain moments, that’s fine, but taking it a step further to controlling all of the logic for cursor movement could be a tangled nightmare quickly. Modern, major browsers are built and maintained by large teams, who are paid to build and test efficient, clean platforms for you to build on top of. Leveraging the work those teams have done for you as much as possible frees you and your team to build new functionality rather than reinventing the wheel.

Understand the tools that you’re given

If you’re going to be doing any kind of programmatic editing inside the field (like toggling markup or wrapping text in tags as it’s typed), there are a few browser APIs you’re going to want to become an expert in. The Selection and Range objects are good tools that are available to you to manage cursor positions and apply changes to a section of a tag, or to a group of tags. You’re going to want to brush up on DOM traversal in general. The Node and Element documentation is a good place to start.

Assume malicious users will find your interface

Anytime you’re working with user-submitted HTML, security should be on your mind. Script injection attacks are a real threat, and even if you believe your users are trustworthy, it’s important not to give them the benefit of the doubt when it comes to security. If anyone outside of your team has access to your platform, you should expect to face some form of malicious user at some point. This part of the development process is going to be largely dependent on your specific use case, but generally, never take user input directly. Ideally, you should have a serialization layer between what the user types and what HTML gets generated. Allowing users to type or paste HTML into your field is dangerous, so you should do what you can to take input and turn it into HTML yourself. At the very least, you should whitelist which tags are allowed in your HTML, and escape anything that doesn’t make the list.

Rely on the community whenever possible

It’s tempting to look at your specific challenges as completely unique, but chances are, other teams and other developers have faced the same challenges at some point, and hopefully, have shared their solutions. There are already multiple well-designed editors out there that will probably fill your needs. You might normally prefer to roll your own solutions to cut down on vendor bloat, but in this case, it’s usually better to fall back on a community-provided solution for a number of reasons. If you find an open source package with even a small user base, you’ve already got more people beyond your own team invested in making sure that the tool is a good one.

Some of the more fully-featured editors we’re fans of are Quill and Bustle Lab’s Mobiledoc-Kit.

Ultimately, your primary goal should be to create a seamless experience for your users. With the right amount of preparation, you can deliver that experience. But if you try to dive into building a rich editor without doing the due diligence of building acceptance criteria, investigating your available options, and testing your work thoroughly, you’re going to have a bad time.

Rollup Analytics Counters using PostgreSQL

Michael Nutt — Tue, 15 Aug 2017 05:20:00 GMT

When we power email content, our customers want to see analytics about how it performed. We process some reports nightly as batch jobs, but other metrics benefit from the ability for customers to see them in real-time. We use counters to store metrics such as total number of email opens, clicks, and conversions, each split by device, location, and a few other metrics. We also segment by time, recording hourly, daily, and total metrics. Any time a user opens an email, we end up incrementing between 20 and 40 counter values. At 30,000 opens per second, we can see many hundreds of thousands of counter increments per second.

While two previous iterations of a system to hold this data used MongoDB and Cassandra, respectively, our current system uses PostgreSQL (a.k.a “Postgres”). We selected Postgres for a couple of reasons: first, we already use Postgres as our primary data store, so we had a good understanding of its production performance characteristics, how its replication works, and what to look for in the event of issues. We also found the SQL interface to be a natural fit with the ORM that our dashboard uses. (Rails’ ActiveRecord)

Storing counters can happen in a couple of different ways. A simple way to implement it is as a read/write operation where you do a read to see the current value of the counter, lock the row, increment it, write the new value, and unlock the row. This has the benefit of requiring a small amount of storage space (a single int) but has the drawback that only one writer can be active at any given time due to the row lock. An alternate approach is to write every increment as its own row and either do periodic rollups or just sum up the values on request. The first approach, despite its limitations, won out for us due to its simplicity and small storage size.

# simplified increment postgres function for incrementing a counter

CREATE FUNCTION oincr(incr_key character varying, incr_value bigint)
RETURNS void AS $$
  BEGIN
    UPDATE counters
      SET value = value + incr_value
      WHERE key = incr_key;
    IF found THEN
      RETURN;
    END IF;

    INSERT INTO counters(key, value)
      VALUES (incr_key, incr_value);
    RETURN;
  END;
$$ LANGUAGE plpgsql;

While this is approximately how we have implemented counters since Postgres 9.2, Postgres 9.5 has added upsert support which simplifies this. It’s typically more performant and free of race conditions:

# Postgres 9.5+ increment function

CREATE FUNCTION incr(incr_key character varying, incr_value bigint)
RETURNS void as $$
  BEGIN
    INSERT INTO counters(key, value)
      VALUES (incr_key, incr_value)
      ON CONFLICT (incr_key) DO UPDATE
        SET value = value + incr_value;
  END;
$$ LANGUAGE plpgsql;

The single most important thing we did along the way was to decouple our app servers from our counter store and place a queue in between to ensure that we could reconcile our data in the event of Postgres unavailability. We selected nsq, a decentralized queueing system. Initially, our app servers would publish data onto the queue and a worker would pull it off the queue and increment the counter in Postgres. Over time as we served more and more content, we saw that this was unsustainable. Postgres has many great performance characteristics and is getting faster with each release, but there was no way we were going to be able to make hundreds of thousands of reads and writes per second to a single instance.

One of the scaling challenges unique to email is that traffic is very spiky: an email will often go out to hundreds of thousands of people almost simultaneously, which results in a big traffic spike and then dies down. In looking at our data, we realized that at any given time the vast majority of our counter increments were dominated by a couple of recently launched campaigns. We were sending identical messages like “increment the number of iPhone users who opened campaign X between 9am and 10am by 1” thousands of times.

We decided we could take a lot of the load off of Postgres by combining many of these messages together. Rather than incr(x, 1); incr(x, 1); incr(x, 1) we could just do incr(x, 3). We built a worker to sit on the queue between the app servers and the insert worker. It would read individual increments and output aggregate increments. The aggregation worker is a golang app that uses a hashmap to store increment sums for each counter. When the worker receives a message consisting of a counter name and value, it looks for the counter name as a key in the hashmap, and increments the value. Periodically, the aggregates are “flushed”: the hashmap is emptied and its values are sent to the insert worker.

One nice aspect of the aggregate flush interval is that we’re able to use it to tune how many operations go to Postgres: even a modest 1-second aggregate flush interval combines up to 70% of the insert operations while keeping things relatively real-time, but in the event that Postgres can’t keep up we’re able to increase the interval up to a few minutes to further reduce the number of operations.

We operate all of this on a single Postgres instance on an EC2 c4.4xlarge, with streaming replication to another instance to use as a hot standby. With tuning, we expect to get to an order of magnitude more request volume before we have to rethink our approach. In the unlikely event that our aggregate workers become a bottleneck, we can continue to add more and use a ‘fan in’ approach: doubling the number of aggregate workers means that more incoming increments can be processed, but also increases the number of aggregate messages emitted so we can add a second tier to aggregate the messages further. The more likely future bottleneck is Postgres, but by aggregating the most common operations we’re spreading the reads and writes out evenly, so we could shard the data across multiple postgres instances.

Computed Property Setters

Michael Nutt — Fri, 04 Aug 2017 03:58:00 GMT

Computed properties are my favorite feature of Ember, hands-down. The more ways I discover to leverage computed properties, the less I need to override lifecycle hooks or define custom functions within a component. This makes them a powerful option in the Ember toolkit for writing declarative, concise code.

Out of the box, computed properties allow a property to be defined as the return value of the function passed to it, which will then be cached until any of its dependent properties are updated. If you are new to Ember, the Guides are pretty essential, and have a lot to say about computed properties.

When a computed property is declared in the usual way by being passed a function, it corresponds with the property’s getter. However, a computed property’s setter can also be defined if the property is passed an object instead:

import Ember from 'ember';
const { Component, computed, get } = Ember;

export default Component.extend({
  foo: computed('bar', {
    get() {
      return get(this, 'bar');
    },
    // key is 'foo' here
    set(key, value) {
      // 'foo' is set to the return value of this function
      return value;
    }
  })
});

This expanded syntax has come in handy often, and has allowed us to use computed properties in ways I didn’t initially anticipate.

Here are some problems we’ve solved through computed properties with a custom set:

Non-Primitive Default Values

A common “gotcha” within Ember is declaring a property default as a non-primitive value:

import Ember from 'ember';

export default Ember.Object.extend({
  names: ['Klaus', 'Michael']
});

An Ember Object declared like this will cause a developer grief, as the names property within every instance of that object / component will reference the same array. This is because extending Ember.Object defines properties on the resulting class’s prototype:

function BugCollector () {};
BugCollector.prototype.bugs = [ 'ant', 'caterpillar' ];
BugCollector.prototype.addBug = function (bug) {
  this.bugs.push(bug);
  return this.bugs;
}

const Terry = new BugCollector();
Terry.addBug('caterpillar'); // => [ 'ant', 'ladybug', 'caterpillar' ]
const Joe = new BugCollector();
Joe.addBug('weevil');  // [ 'ant', 'ladybug', 'caterpillar', 'weevil' ] :(

ES6 classes and their constructors make it so we don’t have to think about this too much anymore when we’re writing vanilla JavaScript. But until Ember classes can be declared with this syntax, it will remain a common concern for those of us in Ember-Land.

To get around this, I used to set up property defaults inside of the init hook if they were Objects or Arrays:

import Ember from 'ember';
const { Component, set } = Ember;

export default Component.extend({
  names: null,
  init() {
    set(this, 'names')([ 'Klaus', 'Michael' ]);
    this._super(...arguments);
  }
});

This works, but it has ugly boilerplate (_super ugly). If the component becomes large and has to do this for multiple properties, it will become less readable, as it will force the developer to jump back and forth between the property declarations and the init hook where their defaults are actually set up. I prefer a computed-property-based solution:

import Ember from 'ember';
const { Component, computed } = Ember;

export default Component.extend({
  names: computed({
    get() {
      return [ 'Klaus', 'Michael' ];
    },
    set(_, names) {
      return names;
    }
  })
});

A computed property can observe any number of values, but I was surprised to find that it can also observe none at all.

Computed properties cache their return values until one of their dependent keys updates. When a computed property doesn’t have any dependent keys, this never happens and it caches its last return value until explicitly set.

If the above component is invoked with names in its attrs hash, this will occur automatically on init, overriding the default value as expected.

However, if names is not explicitly passed into the component, its getter will lazily return a newly-constructed array of default values.

Writing computed properties this way is not only great for setting defaults; it’s also nice for writing a self-documenting API:

import Ember from 'ember';
const { Component, computed, assert, get } = Ember;

export default Component.extend({
  name: computed({
    get() {
      return assert("This component requires a 'name' property!");
    },
    set(_, name) {
      if (typeof name !== 'string') {
        assert("'name' must be set to a string!");
      }
      return name;
    }
  })
});

Refactoring Observers

Using this pattern also helps eliminate many common use-cases for Observers. If you’ve visited the Ember Guide page for Observers, you’ve probably read this line:

Observers are often over-used by new Ember developers. Observers are used heavily within the Ember framework itself, but for most problems Ember app developers face, computed properties are the appropriate solution.

And if you’re like me, you probably ignored that advice and shamefully added a bunch of observers into your app anyway.

Maybe it was a simple observer like this:

import Ember from 'ember';
const { Component, observer } = Ember;

export default Component.extend({
  name: 'Klaus',
  onNameChange: observer('name', function () {
    get(this, 'onnamechange')();
  })
});

Like the guides claim, computed properties are a better implementation for the above, but they don’t make it clear how one would go about it. I’ve been slowly refactoring many of our observers it into something like this:

import Ember from 'ember';
const { Component, computed } = Ember;

export default Component.extend({
  name: computed({
    get() {
      return 'Klaus';
    },
    set(_, name) {
      get(this, 'onnamechange')();
      return name;
    }
  })
});

You might need to add a little bit extra to this example if you don’t want onnamechange to fire in the init hook, though.

Real-World Example!

Our team was implementing a date-time picker component, using Ember Power Calendar with an extra input for time of day.

The ember-power-calendar component expects a selected (the currently selected date), and a center (the currently selected month). Whenever selected updates, center should read from it, so that the calendar re-centers on the correct month.

import Ember from 'ember';
const { Component, computed: { reads } } = Ember;

export default Component.extend({
  selected: null,
  center: reads('selected'),
  onupdate() {},

  actions: {
    update(newSelectedValue) {
      get(this, 'onupdate')(newSelectedValue);
    }
  }
});

However, users should also be able to browse months on the calendar directly without changing the selected date. Internally, this involves setting center directly without updating selected.

This is impossible to do with the above implementation. Once a user changes months, center will no longer continue to update when a user selects a new date.

Unfortunately, calling set on a computed property defined this way will simply overwrite the computed property and remove its binding to its dependent property. If the dependent property changes later, our computed property will not recompute itself as expected.

Because of this, we considered moving away from computed properties altogether:

import Ember from 'ember';
const { Component, get, set } = Ember;

export default Component.extend({
  selected: null,
  center: null,
  onupdate() {},

  init() {
    this._super(...arguments);
    set(this, 'center', get(this, 'selected'));
  },

  actions: {
    update(newSelected) {
      get(this, 'onupdate')(newSelected);
      set(this, 'center')(newSelected);
    }
  }
});

This solves our immediate problem, but I’m not a huge fan. Once again, we’re overriding init just to set up a property, which just doesn’t feel right.

Also, I mentioned that this was a date time picker, which had an extra input for the time of day. If a user has selected 11:59PM on the last day of a given month and bumps the time forward by one minute, how can this change propagate into the calendar component to correctly set the center to the next month?

You could extract the logic for center into the container component that manages the calendar and the date input, but that’s not ideal as it is really just display logic for the calendar.

What we want is a computed property that reads another most of the time, but can still be manually set from within the component.

import Ember from 'ember';
const { Component, computed, get } = Ember;

export default Component.extend({
  selected: null,
  onupdate() {},

  center: computed('selected', {
    get() {
      return get(this, 'selected');
    },
    set(_, value) {
      return value;
    }
  }),

  actions: {
    update(newSelectedValue) {
      get(this, 'onupdate')(newSelectedValue);
    }
  }
});

Defining center as a computed property with a custom setter gives us exactly what we need. If it is not explicitly set, center will simply read from and return selected. It can still be set manually, but a change in selected will break its cache, causing it to resynchronize with selected again.

We can update center directly in the template with the mut helper, but if selectedis updated anywhere - inside or outside of the calendar component - center will adhere to the change.

What Do You Think?

Although we still write most of our computed properties in the usual style, it feels great to have access to a hidden “power mode” when I need it.

I haven’t seen much discussion in the Ember community about this pattern, and although it has worked out well for us I am curious to hear what you think!

Fishing for Insight in a Sea of Logs

Matt Chesler — Mon, 17 Jul 2017 05:04:00 GMT

Like any reasonably sized engineering organization, the systems and applications comprising our environment are generating thousands of log entries every second. To make sense of the noise, we aggregate those logs to a central location for search and analysis. We call the stack we use for this purpose “FluEK”. This stack allows us to examine traffic going through a complex system with multiple moving parts as well as diagnose issues quickly and effectively, correlating data across multiple systems and applications.

Give someone a fish

FluEK, pronounced like fluke, is an acronym I was surprised to find didn’t exist yet. It stands for Fluentd, Elasticsearch, and Kibana.

Fluentd

Fluentd is the mechanism we use to get log entries from the applications to the database. The fluentd client runs on all of our hosts to read and parse system and application logs. Individual fluentd clients ship logs to a cluster of servers running fluentd in its aggregator role, collecting the parsed entries and publishing them into Elasticsearch.

Elasticsearch

Elasticsearch is a distributed, RESTful search and analytics engine. While we once ran our own Elasticsearch cluster, we currently use Amazon’s Elasticsearch Service.

Kibana

Kibana sits in front of Elasticsearch to explore and visualize our log data.

Teach someone to fish

Once we’ve collected logs across from the various components in our environment, we’re able to search for specific events, visualize activity, and investigate anomalies. FluEK enables us to pinpoint which campaigns are experiencing timeouts:

identify which specific URLs within those campaigns are causing the most timeouts:

or find a specific request to answer a client’s question.

We’re gonna need a bigger boat

When we moved to the Elasticsearch Service, we initially implemented a version 2.3 cluster - the latest available at the time. Eventually, the lure of the latest and greatest became too great to resist and we got caught up in the excitement of version 5.1. Despite all the great improvements from 2.3 to 5.1, the killer feature for us was the resolution of a particularly annoying UI bug between the older version of Kibana and Chrome.

Sortable Lists in Ember

Dennis Shy — Mon, 10 Jul 2017 04:48:00 GMT

Recently, we completed an overhaul of our platform’s interface. One of the improvements made was introducing the ability to sort list items.

Our goal was to optimize both aspects of the sorting: the typical drag-n-drop interaction and the persistence of the newly sorted list.

For the remainder of this post, we will refer to both of these behaviors as either reordering or ranking, depending on the context. Reordering will refer to sorting elements in the UI, while ranking will refer to the act of assigning a position to the elements in a collection so the order persists between page loads.

Why reordering / ranking is a necessity

Reordering / ranking is required when the order of the models within a platform has significance and users need the ability to freely sort them. Maintaining a sortable list is a common challenge with many use cases. For example, one list in our platform is sortable for purely cosmetic reasons so the user can tidy up their workspace. In contrast, the order of items for another list in our platform governs a crucial part of our business logic in how it determines what content is shown to recipients of the content.

Mapping out the workspace

The workspace that our users spend most of their time in is structured as a typical master-detail pattern. Users navigate between items on the left by selecting them, where further information and actions can be found on the right for the selected item. Both panes contain lists with sortable items. Items can be moved in any direction within their own list, but cannot be dragged across different lists.

Implementation

ember-sortable

We use EmberJS, a Javascript web framework, to build our platform. Ember features a strong community-backed addon ecosystem. The addon that we use for the sorting component is ember-sortable. It was an easy to integrate solution that exposes a simple API that made development of the feature very straightforward.

ember-sortable only handles the DOM interaction however. We still needed a way to actually update the underlying models behind the elements on the page. That way whenever these records are reloaded from the server, they will appear in their correct order.

We had a couple of shortcomings with our initial implementations of this updating service which offered a great learning experience.

Initial shortcomings

Whenever an item is moved within an ember-sortable grouping, the addon exposes a reference to the list of items in their current order. Our first pass at implementing the reordering feature was a simple forEach that would set the position value on a model to the current index in the iteration. We then save each model which sends off a PUT request for that particular model. This was easy enough and we were able to get the feature into users’ hands for feedback.

The feature was well received and the implementation worked well for a couple of months, until we began to receive reports of slow performance in the application from users. The worst performance was observed in workspaces that had very long lists. Whenever items were sorted or duplicated, the application would become unresponsive for a short amount of time, proportional to the length of the list.

Astute readers have probably realized the obvious flaw with this initial implementation. We were firing off a network request for each element in the list, regardless of if they were actually moved in the list. This is very inefficient and fixing it required a more clever solution. We had a couple of easy-win optimizations that we performed right away:

When swapping two elements next to each other, simply swap their position values
Only make a network request for models that contained dirty attributes, i.e. had their position values changed.

Another condition that we needed to consider was that whenever a user duplicates an element, we insert the newly duplicated element directly below (or sometimes above) the original element. The duplicated element will have the same position value as the original element, creating uncertainty in how the elements should be presented. We can sort by a second metric, such as a record’s createdAt timestamp. This solution works as long as a user always chooses to duplicate the most recently duplicated element. We cannot guarantee this behavior, so there will be instances where duplicated elements are placed two or more spaces separated from the original. This means that resolving position collisions is a requirement, and in the worst case scenario, we will need to do a full list reorder (back to our original problem).

ember-ranked-model

We needed a solution that could reduce the number of network calls. Preferably we would only tell the server to update the records that actually need to be updated. The implementation that we chose is heavily inspired by ranked-model. For those unfamiliar with it, it is a Ruby on Rails library that efficiently ranks database rows using an integer column (e.g. position).

Our ember-ranked-model module functions similarly. For more information on how it works, you can visit its GitHub page.

Given that all of our use cases involve a single element being dragged in the list to a new location, in the best case scenario we only ever have to update the dragged item.

A basic example:

actions: {
  // ember-sortable provides these two arguments
  reorder(models, dragged) {
    Ranker.create({models}).update(dragged).save();
  }
}

Our Ranker object takes an array of objects that is assumed to be the list in the newly unordered state. We are also given the dragged item by ember-sortable, which helps us determine which items need to be updated. There is also support for options such as ranking on a key other than position, descending and ascending lists, and setting min and max caps on the rank values.

Alternative

We try to design our API to be as RESTful as possible. That means avoiding creating custom endpoints that do one particular action, i.e. a bulk ranking endpoint. A bulk ranking endpoint that takes a collection of ids in the correct order would also have been a solution. It would require no operations on the front end and only one network request. All that would be needed is the ember-sortable integration. It could even use the actual ranked-model in the backend since we are using Rails!

Lessons

Iteration is important

Despite the significant value of the feature, it was surprisingly simple in scope and was quick to develop. Though it wasn’t intended to be the driving aspect for the redesign, it rapidly became became one of the most commonly used features.

Test the action under similar conditions

Our initial implementation’s inattention to performance helped us learn a bit more about how our platform was being used in some peculiar use cases by our users. As mentioned before, users were creating irregular campaigns with extremely large lists. Although few in number, these campaigns were almost impossible to work on while the performance issues were still present. Our inattention was compounded by the fact that in our development environment, we build smaller campaigns with shorter lists to test behavior, along with having a more predictable and faster internet. The performance issue would not have been very noticeable in development, but in production the problem would have been exacerbated by factors such as network latency and differences in hardware. Now that we know about the existence of these campaigns, we are more conscientious about platform performance going forward. For example, we now utilize the Chrome Inspector’s network throttling feature to help simulate experiences that users with slower connections might have. By testing under similar conditions, we get closer to reproducing our users’ behavior.

Conclusion

ember-ranked-model is something that we use in our platform in multiple places. Our users depend on the behavior being available so they can organize their workspace and edit their campaigns to the proper configuration. We found this library very useful to help expand the sorting feature into multiple areas of the application, and through open sourcing it, we hope others working on front end systems can easily do the same.

Serving Files: S3 and High Availability

Michael Nutt — Fri, 01 Jan 2016 20:39:00 GMT

At Movable Ink we heavily use Amazon S3 for storing millions of files and serving them to hundreds of millions of users. It has a number of very compelling qualities: it has great performance characteristics and durability guarantees of a blistering eleven 9’s—they replicate our data in such a way that in theory there is 99.999999999% object retention.

However, durability and uptime are not one and the same, as many S3 customers found out when an internal configuration issue impacted services on Monday morning. The problem affected buckets in the US Standard S3 region, the most commonly used US S3 region.

We’re pretty conscious about potential single points of failure, and tend to have redundancy at multiple tiers: each layer is spread across multiple hosts which are interconnected at multiple points to the layers above and below it. This manifests as multiple load balancers, app servers, and availability zones, with the entire setup replicated across geographically separate datacenters thousands of miles apart. With all of that redundancy, of course we want our S3 serving to also be redundant.

S3 buckets are tied to a geographical location, and most correspond to one of Amazon’s datacenters. However, US Standard stores data on both the east coast and west coast. Given that it can be accessed from either coast, my first concern was around consistency: what would happen if you were to write data on one side and then immediately try to read it from the other? We tested it and it was oddly consistent, which seemed strange since it was serving from two different regions.

It turns out there is no replication happening. It actually only writes to the region of the endpoint you use while writing:

Amazon S3 automatically routes requests to facilities in Northern Virginia or the Pacific Northwest using network maps. Amazon S3 stores object data only in the facility that received the request.

Given this, we should really be treating US Standard as a single point of failure. So how can we make it redundant?

The strategy we take is to store data in different S3 regions, then come up with a way to point users and our backend services at whichever region is currently active. AWS actually has a couple of tools to facilitate the former. S3 supports file creation notifications to SNS or SQS, and you could set up AWS Lambda to automatically copy files to a different region. But even better than that, a few months ago Amazon released Cross-Region Replication to do exactly what we want. Setup is simple:

Turn on versioning on the source bucket. This comes at an extra cost since you pay for all previous versions of your files, but since we’ve already decided that this data is very important it’s worth it. After all, we’re talking about doubling our storage costs here.
Turn on cross-region replication. As part of the setup, you’ll create another versioned bucket in the destination datacenter and an IAM policy to allow data transfer between the two.
Do a one-time manual copy of all of your files from the source bucket to the destination bucket. Replication only copies files that are added or changed after replication is enabled. Make sure the permissions are identical.

Now every time we add a file to the source bucket, it is (eventually) replicated to the destination bucket. If all of our access is through our backend services, this may be good enough since failing over is a simple configuration change. But many of the references to our S3 buckets are buried in HTML documents or managed by third parties. How can we make it easy to switch between the buckets?

Our initial idea was to just set up a subdomain entry on a domain we control to CNAME to our S3 bucket, then do failover with DNS. S3 allows this, with one big caveat: your bucket must be named exactly the same as the domain. If you want to reference your S3 bucket as foo.example.com, your S3 bucket needs to be named foo.example.com.s3.amazonaws.com. Combined with S3’s restriction that every bucket name must be unique across regions, only one bucket can ever be referenced from foo.example.com so this doesn’t work.

Amazon has a CDN service, Cloudfront, which allows us to set an S3 bucket as an origin for our CDN distribution. We can then CNAME our subdomain to our Cloudfront distribution’s endpoint. In the event of a regional S3 failure, we can update Cloudfront to point to our backup S3 bucket. And you can either turn on caching and reap some latency benefits, or set the time-to-live cache setting to zero to act as a pass-through.

We would have preferred to set up two Cloudfront distributions and switch between them with DNS, but Amazon has similar restrictions disallowing two distributions from having the same CNAME. Still, this setup still lets us respond to an S3 outage in minutes, routing traffic to an unaffected region. In our tests, the failover can fully complete in between 5-10 minutes.

Building applications in the cloud means expecting failure, but it’s not always straightforward, especially when using third-party services like S3. Even with our final setup, it’s not completely clear what Cloudfront’s dependencies and failure modes are. But importantly, we control the DNS so we can implement our own fixes rather than waiting for Amazon.

If you’re interested in working on challenging problems like this, check out Movable Ink’s careers page.