Cassandra’s “Repair” Should Be Called “Required Maintenance”

25 September, 2015. It was a Friday.

One of the bigger challenges when you go Eventually Consistent is how to reconcile data not being replicated. This happens if your using Oracle and multi-data centers with tech like Golden Gate and it happens if you’re using async replicas with MySQL and one of your replicas got out of whack. You need a way to “repair” the lost data.

Since Cassandra fully embraces eventually consistency repair is actually a important mechanism for making sure copies of data are shipped around the cluster to meet your specified replication factor. About now several of you have alarm bells going off and think I’m insane. Let’s step through the common objections I hear.

What about Consistency Level?

Consistency Level (from here on out CL) does effectively define your contract for how many replicas you are requiring to be ‘successful’ but that’s at write time and unless you’re doing something silly like CL ALL, you can’t be certain that you’ve got RF (replication factor) copies around the cluster. Say you’re using CL ONE for all reads and writes then that means you’re only set on having one copy of the data. This means you can lose data if that node goes down.

So either write at CL ALL or use repair to make sure your cluster is spending most of it’s time at your specified RF.

What about hinted handoffs? {#37e7}

They’re great, except until Cassandra 3.0 they’re not really that great. Hinted Handoffs for those of you that don’t know are written on a failed write by a coordinator to a remote node. These hints are replayed later on in a separate process.

This helps eliminate the need for repair in theory. In practice they only last a relatively short window (3 hours by default) and generating a lot of them can be a huge resource hog for the system (3.0 should help greatly with this). I’ve worked on clusters with extended outages accross data centers are very high TPS rates resulted in terabytes of just hints in the cluster.

In summary, hints are at best a temporary fix. If you have any extended outage then repair is your friend.

Beware Deletes If You’re Too Cool For Repair

Another important aspect of repair is when you think about tombstones. Say I issue a delete to two nodes but it only succeeds on one. My data will look like so:

However a few hours later compaction on that node removes the first write on node 2. Fear now when I query ttl comparison will give me the right answer and partition 1 has no last_name value.

However, when I go past gc_grace_seconds (default 10 days) that tombstone will be removed. Since I’ve never run repair (and I’m assuming I don’t have read repair to save me), I now only have partition 1 on node 1 and my deleted data comes back from the dead.

Summary Run Repair {#3e3c}

So for most use cases you’ll see, you’ll really want to run repair on each node within gc_grace_seconds and I advise really running repair more frequently than that if you use CL one and use deletes or lots of updates to the same value.

I’ll add the one use case where repair has less importance. If you just do single writes with no updates, local_quorum consistency level, have a single data center and rely on TTLs to delete your data, you can probably not run repair until you find a need to do so such as a lost node.

← Scale is a Dish Best Served Eventually Consistent

Synthetic Sharding with Cassandra. Or How To Deal With Large Partitions. →