High Availability, Replication, and Failover Explained with Stuffed Animals

My 2-year-old daughter’s favorite stuffed animal, “Meow-Meow,” is a critical service in my house, so my wife and I run a highly available three-node Meow-Meow cluster.

Despite being simple, this Meow-Meow cluster can help explain the essential parts of highly available systems:

  • High availability
  • Replication
  • Observability
  • Consensus
  • Failover
  • The split-brain problem
  • Active-active replication

I’m going to work through examples of each of these topics using Meow-Meow to demonstrate. First, I’ll give you an overview of how the Meow-Meow cluster works, and then I’ll review each piece in more detail.

NOTE: This post is an extended exploration of a topic I wrote about in a Twitter thread. If you enjoy this topic, you should check out that thread for the excellent jokes, photos, and hard-won wisdom in the replies.

Architecture of the Meow-Meow Cluster

Here’s an overview of how the system works.

  1. Three observer processes monitor the current Meow-Meow primary. The observers are my wife, oldest daughter, and me.

  2. Observers use a gossip protocol to reach a consensus when the primary is unavailable (“Where is Meow-Meow?”)

  3. If the observers reach a consensus that the primary is unavailable, one of the observers initiates failover

  4. During failover, an observer process promotes one of two replica Meow-Meows to become the current primary

  5. If failover succeeds, all is well – but success is not guaranteed!

  6. A replica might be out of sync with the leader when we initiate failover. We keep two replicas for this reason. If failover fails, usually due to replication lag, the observer restarts failover using the second replica.

  7. If we initiated failover because we could not find the primary, the primary might show up again unexpectedly. So now there are two Meow-Meows! Unintentionally running two primaries in a system is called the split-brain problem, and you are in deep doodoo if it happens.

Now that I’ve established how the system works let’s dive into the specific topics that this example illustrates.

High Availability

What does it mean that the Meow-Meow cluster is highly available? Here is my paraphrasing of Wikipedia:

A service with higher than normal uptime.

NOTE: Uptime refers to time when users can use a system, while downtime refers to time when users can’t use the system.

So, if “highly available” (HA) means higher than normal uptime, what is “normal” uptime?

Generally speaking, all systems require downtime for maintenance. For example, databases go offline for upgrades, servers restart to apply updates, and so on. Systems also experience unplanned downtime due to hardware failures and software errors.

A highly available system is more resilient to these situations. HA systems don’t go down for planned maintenance from the client’s perspective. And they can recover from the failure of a single component, or even more than one. When is this important?

Imagine my family is traveling, and we accidentally drop Meow-Meow in an airport on the far side of the country. We only discover that Meow-Meow is missing when we arrive home and start getting our daughter ready for bed. What do we do?

A photo of a fantastical castle in Disneyland intended to prompt memories of your family vacations.

If we have only one Meow-Meow, or in other words a non-HA Meow-Meow system, we would need to buy a new Meow-Meow, so our system has unavoidable downtime. Unfortunately, this is a critical service: my daughter can’t sleep without Meow-Meow.

True high availability comes from running multiple copies of an individual node within a system. Databases have referred to these “copies” by many names over the years. The two names in vogue today are followers and replicas. I use those terms interchangeably in this post.

So, to run an HA system, you need replicas. An HA Meow-Meow system needs backup Meow-Meows. But can you just buy an extra Meow-Meow, hide it on a closet shelf, and rest easy at night? Unfortunately, no.

Once you have multiple nodes, you’ll need to keep them in sync. For that, you use replication.

Replication

Meow-Meow is stateful. Over time, the leader Meow-Meow accrues writes, just like a database: we remove a tag, the stuffing shifts from use, and she starts to smell like my daughter.

If the first challenge of any HA system is running backup nodes, the second is keeping the nodes in sync. Having a backup Meow-Meow doesn’t matter if when we finally need it, our daughter rejects the pristine Meow-Meow from the shelf because “it smells funny.”

Synchronizing state between multiple nodes is called replication. The most straightforward replication setup for databases and other stateful systems is to use one leader node with multiple followers. Clients can only write to the leader. Follower nodes subscribe to the leader, applying any changes it tells them about and staying up to date.

A photo of a Meow-Meow replica that still has a tag due to replication lag.

The secret sauce of replication consists of two ingredients:

  • Snapshots of the data at a point in time
  • A log of all changes

When you add a new replica to the system, it downloads a snapshot of the current data and then subscribes to changes from the log to stay up to date.

Meow-Meow is hard to replicate because there is no log of changes, but the principles are the same:

  • New Meow-Meows start from a snapshot (Meow-Meow from the factory)
  • There is only ever one Meow-Meow primary – the others are on a shelf
  • When a change happens to the Meow-Meow primary, like us removing a tag, we replicate that change to the other Meow-Meows

In HA systems, replicas must not fall out of sync. How would you know if that happened, though?

You need to monitor the state of replication – and that’s not all. Observability and monitoring are essential to HA systems, so we’ll talk about them next.

Observability

Terrible problems arise in all HA systems if no one can see what happens inside them. The ability to see inside a system is usually called observability. Good observability involves monitoring.

With an HA service like Meow-Meow, we need to monitor at least two facts:

  1. Availability of the primary Meow-Meow
  2. Replication lag

Monitoring Primary Availability

The primary Meow-Meow becoming unavailable is a catastrophe. This is especially true during a critical moment like nap time.

So, my family needs to know when the primary Meow-Meow becomes unavailable. If she isn’t available, we need to initiate failover, which we’ll discuss later in this post.

Ideally, we’ll know that the primary is unavailable before the client starts to cry about it. So, each of the observer processes looks around occasionally and checks with the others. Can anyone see Meow-Meow right now?

Monitoring Replication Lag

Replication lag is when a replica falls out of sync with the primary. Lag can happen in our Meow-Meow system if, for example, we snip the tag off the primary but not the replicas.

The consequences of broken replication can be severe. Imagine my daughter rejecting Meow-Meow because “she has a tag on her booty.” Or a bank losing your paycheck because it failed to replicate that change and the primary lost power.

Consensus

So, what happens when one of the observer processes in my house can’t find Meow-Meow? Or when one of us discovers that Meow-Meow was the victim of a bio-hazard event?

In this situation, we need to begin failover to replace the primary. But before we do, we need to decide on two critical questions:

Is the Meow-Meow primary really unavailable? Which Meow-Meow replica will be the new primary?

NOTE: In distributed systems, deciding on the second question is called leader election.

We need a consensus protocol to make these decisions. Our consensus protocol works like this:

  • If at least two observers think that Meow-Meow is unavailable, then Meow-Meow is unavailable.
  • The observer who first finds a replica to promote “wins” the leader election.
  • In the case of a tie, the observer process with the highest uptime wins.

Once the observers elect a new leader, we initiate failover to promote the replica. Let’s talk about failover next.

Failover

Once the observer processes elect a new leader (AKA primary) Meow-Meow, it’s time to begin failover.

Our failover process is simple. A grown-up takes the current primary to “go get cleaned up,” walks to the supply closet, and makes the swap. We then present the replica Meow-Meow to our daughter as “cleaned-up Meow-Meow,” and the crisis is averted.

TIP: You can rotate the replicas occasionally to get an even distribution of toddler smell and stuffing placement.

WARNING: Eventually, kids get too smart for these tricks. Once the illusion of a single primary ends, your active-passive system becomes an active-active system. In an active-active system, you run multiple primaries, each accepting writes. Later in this post, we’ll explore possible active-active architectures.

Returning to failover: if it succeeds, then the system is stable again. However, success is not guaranteed! Let’s talk about what happens if a former primary unexpectedly shows up again.

The Split-Brain Problem

What happens if we promote a new Meow-Meow replica to the primary spot, and then our daughter discovers the former Meow-Meow primary under the couch?

When two nodes think they’re primary in an active-passive system like Meow-Meow, the system has a split-brain.

Most active-passive systems depend on writes going to a single primary at a time. The primary is the “active” node because it allows writes. Replicas only copy updates from the primary, but don’t allow writes, which makes the replicas “passive.”

TIP: Meow-Meow uses replication for disaster recovery, so the client only sees a replica after we initiate failover. However, other systems allow reading from replicas to distribute read load across the cluster. Doing so can boost read performance and maximize the cost effectiveness of replication. If a system allows reading from replicas, we call those read-only replicas.

Why is running multiple primaries in an active-passive system a “problem,” though? It’s a problem because clients might write data to both primaries, but only one primary synchronizes correctly with its replicas.

So Client A might write data to the “true” primary, Primary 1, while client B writes to the “false” primary, Primary 2. Primary 1 is correctly replicating to the replicas, but Primary 2 is not replicating. Because no replicas are following Primary 2, data written to it will probably be lost.

Data loss is a problem, except we’re just talking about stuffed animals. Surely, in a Meow-Meow cluster, data loss is acceptable. So, why don’t we skip pretending that a single Meow-Meow exists and run multiple primaries?

Imagine that, instead of pretending there was a single Meow-Meow, we give our child three Meow-Meows to use simultaneously. That is, we run a cluster with three primary nodes. A system that has multiple primaries like this is called an active-active system.

What could go wrong? Well, active-active systems open a box of new problems that we need to think about. So let’s explore this topic in more detail.

Active-Active Replication and CRDTs

WARNING: We are now going to stretch the Meow-Meow example to its breaking point.

The term active-active can mean different things depending on the context. For example, an active-active database cluster is usually designed to allow writing to multiple primaries.

There are a few variants of this design. Here are two:

  1. Each primary stores different data. Clients connect to one primary to write one kind of data, a second primary to write other data, and so on.

  2. Each primary stores the same data. Thus, clients perceive a single logical entity while, in reality, writing to multiple primaries.

The first variant shows up in descriptions of active-active replication, but I don’t consider it truly “active-active,” and you’ll see why when we discuss it.

The second is how I would describe a proper active-active system.

Let’s take a closer look at these two types of systems.

Primaries Store Different Data

Storing parts of your application’s data in multiple databases is conceptually more straightforward than the other active-active configuration we’ll discuss.

Imagine again that we give our child three Meow-Meows. We don’t try to keep them in sync because that would be futile.

Instead of pretending all three Meow-Meows are the same Meow-Meow, we let our child use each stuffed animal for different purposes. Now, the nodes in our Meow-Meow cluster are “downstairs-Meow-Meow,” “sleepy-time-Meow-Meow,” and so on.

The problem with this is that if sleepy-time-Meow-Meow becomes the only Meow-Meow our child can use for sleep time, we’ve lost redundancy and are back to square one.

Maintaining high availability with this setup would require that we have more replicas: one for each critical special-purpose Meow-Meow. Ugh!

Primaries Store the Same Data

Now, let’s consider the second type of active-active configuration. In this system, multiple primaries exist, storing the same data, and clients can write to any of them.

You might think systems like this are exotic or obscure, but in fact, you’ve probably used them extensively.

Collaborative editing in Google Docs is an active-active system: you and someone else write simultaneously in the same document, and Google Docs synchronizes the changes.

If we could allow our daughter to play with all three Meow-Meows while keeping them perfectly in sync, we could use any Meow-Meow for sleep time.

(Obviously, all kids are different. Some have strong preferences, some don’t. We’re dealing with one who does.)

Sounds great, right? The problem is that synchronizing complex state between primaries is devilishly hard. What if a write on one primary conflicts with a write on another primary?

How to resolve conflicts between multiple primaries is an area of active research. Here are a couple examples of real conflict resolution strategies you’re likely to have seen:

  • If there’s a conflict, give up and ask the client to figure it out. Examples: Dropbox, git, and many iOS apps that use iCloud to sync (“Two conflicting copies of this file exist…”). Ugh! As we all know, this strategy is a pain in the ass.

  • If two writes conflict, accept the “last” one. I put “last” in quotes because the definition of “last” gets complicated in a distributed system. This strategy can result in data loss.

  • Resolve the conflict automatically. Doing this well is an area of active research. One of the more promising approaches is Conflict-Free Replicated Data Types or CRDTs. Automatic conflict resolution is the strategy that systems like Google Docs use.

Summary

It’s safe to say we’ve genuinely exhausted the use of my family’s Meow-Meow cluster to talk about high availability, replication, and more.

I hope you learned something new!


Photo by Trevor Vannoy.