Lessons from a year of Golang


    I’m hoping to share in a non-negative way help others avoid the pitfalls I ran into with my most recent work building infrastructure software on top of a Kubernetes using Go, it sounded like an awesome job at first but I ran into a lot of problems getting productive.

    This isn’t meant to evaluate if you should pick up Go or tell you what you should think of it, this is strictly meant to help people out that are new to the language but experienced in Java, Python, Ruby, C#, etc and have read some basic Go getting started guide.

    Dependency management

    This is probably the feature most frequently talked about by newcomers to Go and with some justification, as dependency management been a rapidly shifting area that’s nothing like what experienced Java, C#, Ruby or Python developers are used to.

    I’ll cut to the chase the default tool now is Dep all other tools I’ve used such as Glide or Godep they’re now deprecated in favor of Dep, and while Dep has advanced rapidly there are some problems you’ll eventually run into (or I did):

    1. Dep hangs randomly and is slow, which is supposedly network traffic but it happens to everyone I know with tons of bandwidth. Regardless, I’d like an option to supply a timeout and report an error.
    2. Versions and transitive depency conflicts can be a real breaking issue in Go still. So without shading or it’s equivalent two package depending on different versions of a given package can break your build, there are a number or proposals to fix this but we’re not there yet.
    3. Dep has some goofy ways it resolves transitive dependencies and you may have to add explicit references to them in your Gopkg.toml file. You can see an example here under Updating dependencies – golang/dep.

    My advice

    • Avoid hangs by checking in your dependencies directly into your source repository and just using the dependency tool (dep, godep, glide it doesn’t matter) for downloading dependencies.
    • Minimize transitive dependencies by keeping stuff small and using patterns like microservices when your dependency tree conflicts.

    GOPATH

    Something that takes some adjustment is you check out all your source code in one directory with one path (by default ~/go/src ) and include the path to the source tree to where you check out. Example:

    1. I want to use a package I found on github called jim/awesomeness
    2. I have to go to ~/go/src and mkdir -p github.com/jim
    3. cd into that and clone the package.
    4. When I reference the package in my source file it’ll be literally importing github.com/jim/awesomeness

    A better guide to GOPATH and packages is here.

    My advice

    Don’t fight it, it’s actually not so bad once you embrace it.

    Code structure

    This is a hot topic and there are a few standards for the right way to structure you code from projects that do “file per class” to giant files with general concept names (think like types.go and net.go). Also if you’re used to using a lot of sub package you’re gonna to have issues with not being able to compile if for example you have two sub packages reference one another.

    My Advice

    In the end I was reasonably ok with something like the following:

    • myproject/bin for generated executables
    • myproject/cmd for command line code
    • myproject/pkg for code related to the package

    Now whatever you do is fine, this was just a common idiom I saw, but it wasn’t remotely all projects. I also had some luck with just jamming everything into the top level of the package and keeping packages small (and making new packages for common code that is used in several places in the code base). If I ever return to using Go for any reason I will probably just jam everything into the top level directory.

    Debugging

    No debugger! There are some projects attempting to add one but Rob Pike finds them a crutch.

    My Advice

    Lots of unit tests and print statements.

    No generics

    Sorta self explanatory and it causes you a lot of pain when you’re used to reaching for these.

    My advice

    Look at the code generation support which uses pragmas, this is not exactly the same as having generics but if you have some code that has a lot of boiler plate without them this is a valid alternative. See this official Go Blog post for more details.

    If you don’t want to use generation you really only have reflection left as a valid tool, which comes with all of it’s lack of speed and type safety.

    Cross compiling

    If you have certain features or dependencies you may find you cannot take advantage of one of Go’s better features cross compilation.

    I ran into this when using the confluent-go/kafka library which depends on the C librdkafka library. It basically meant I had to do all my development in a Linux VM because almost all our packages relied on this.

    My Advice

    Avoid C dependencies at all costs.

    Error handling

    Go error handling is not exception base but return based, and it’s got a lot of common idioms around it:

    myValue, err := doThing()
    if err != nil {
    	return -1, fmt.Errorf(“unable to doThing %v”, err)
      	}
    

    Needless to say this can get very wordy when dealing with deeply nested exceptions or when you’re interacting a lot with external systems. It is definitely a mind shift if you’re used to the throwing exceptions wherever and have one single place to catch all exceptions where they’re handled appropriately.

    My Advice

    I’ll be honest I never totally made my peace with this. I had good training from experienced opensource contributors to major Go projects, read all the right blog posts, definitely felt like I’d heard enough from the community on why the current state of Go error handling was great in their opinions, but the lack of stack traces was a deal breaker for me.

    On the positive side, I found Dave Cheney’s advice on error handling to be the most practical and he wrote a package containing a lot of that advice, we found it invaluable as it provided those stack traces we all missed but you had to remember to use it.

    Summary

    A lot of people really love Go and are very productive with it, I just was never one of those people and that’s ok. However, I think the advice in this post is reasonably sound and uncontroversial. So, if you find yourself needing to write some code it Go, give this guide a quick perusal and you’ll waste a lot less time than I did getting productive in developing applications in Go.

    Cassandra: Batch Loading Without the Batch — The Nuanced Edition


    My previous post on this subject has proven extraordinarily popular and I get commentary on it all the time, most of it quite good. It has however, gotten a decent number of comments from people quibbling with the nuance of the post and pointing out it’s failings, which is fair because I didn’t explicitly spell this out as a “framework of thinking” blog post or a series of principles to consider. This is the tension between making something approachable and understandable to the new user but still technically correct for the advanced one. Because of where Cassandra was at the time and the user base I was encountering day to day, I took the approach of simplification for the sake of understanding.

    However, now I think is a good time to write this up with all the complexity and detail of a production implementation and the tradeoffs to consider.

    TLDR

    1. Find the ideal write size it can make a 10x difference in perf (10k-100k is common).
    2. Limit threads in flight when writing.
    3. Use tokenaware unlogged batches if you need to get to your ideal size.

    Details on all this below.

    You cannot escape physics {#23cb}

    Ok it’s not really physics, but it’s a good word to use to get people to understand you have to consider your hard constraints and that wishing them away or pretending they’re not there will not save you from their factual nature.

    So now lets talk about some of the constraints that will influence your optimum throughput strategy.

    large writes hurtNow that the clarification is out of the way. Something the previous batch post neglected to mention was the size of your writes can change these tradeoffs a lot, back in 2.0 I worked with Brian Hess of DataStax on figuring this out in detail (he did most of the work, I took most of the credit). And what we found with the hardware we had at the time was total THROUGHPUT in megabytes changed by an order of magnitude depending on the size of each write. So in the case of 10mb writes, total throughput plummeted to an embarrassing slow level nearly 10x.
    small writes hurt
    Using the same benchmarks tiny writes still managed to hurt a lot and 1k writes were 9x less throughput than 10–100k writes. This number will vary depending on A LOT of factors in your cluster but it tells you the type of things you should be looking at if you’re wanting to optimize throughput.
    bottlenecks

    Can your network handle 100 of your 100mb writes? Can your disk? Can your memory? Can your client? Pay attention to whatever gets pegged during load and that should drive the size of your writes and your strategy for using unlogged batches or not.

    RECORD COUNT BARELY MATTERS {#1670}

    I’m dovetailing here with the previous statement a bit and starting with something inflammatory to get your attention. But this is often the metric I see people use and it’s at best accidentally accurate.

    I’ve had some very smart people tell me “100 row unlogged batches that are token aware are the best”. Are those 100 rows all 1GB a piece? or are they all 1 byte a piece?

    Your blog was wrong! {#0988}

    I had many people smarter than I tell me my “post was wrong” because they were getting awesome performance by ignoring it. None of this is surprising. In some of those cases they were still missing the key component of what matters, in other cases rightly poking holes in the lack of nuance in my previous blog post on this subject (I never clarified it was intentionally simplistic).

    wrong - client death

    I intentionally left out code to limit threads in flight. Many people new to Cassandra are often new to distributed programming and consequentially multi-threaded programming. I was in this world myself for a long time. Your database, your web framework and all of your client code do everything they can to hide threads from you and often will give you nice fat thread safe objects to work with. Not everyone in computer science has the same background and experience, and the folks in traditional batch jobs programming which is common in insurance and finance where I started do not have the same experience as folks working with distributed real time systems.

    In retrospect I think this was a mistake and I ran into a large number of users that did copy and paste my example code from the previous blog post and ran into issues with clients exhausting resources. So here is an updated example using the DataStax Java driver that will allow you to adjust thread levels easily and tweak the amount of threads in flight:

    version 2.1–3.0:

    Using classic manual manipulation of futures {#3a17}

    This is with a fixed thread pool and callbacks

    Either approach works as well as the other. Your preference should fit your background.

    wrong - token aware batches
    Sure tokenaware unlogged batches that act like big writes, and really match up with my advice on unlogged batches and partition keys. If you can find a way to have all of your batch writes go to the correct node and use a routing key to do so then yes of course one can get some nice perf boosts. If you are lucky enough to make it work well, trust that you have my blessing.

    Anyway it is advanced code and I did not think it appropriate for a beginner level blog post. I will later dedicate an entire blog post just to this.

    Conclusion {#12a2}

    I hope you found the additional nuance useful and maybe this can help some of you push even faster once all of these tradeoffs are taken into account. But I hope you take away my primary lesson:

    Distributed databases often require you to apply distributed principles to get the best experience out of them. Doing things the way we’re used to may lead to surprising results.

    CASSANDRA LOCAL_QUORUM SHOULD STAY LOCAL


    A couple of times a week I get a question where someone wants to know how to “failover” to a remote DC in the driver if the local Cassandra DC fails or even if there is only a couple of nodes in the local data center that are down.

    Frequently the customer is using LOCAL_QUORUM or LOCAL_ONE at our suggestion to pin all reads and writes to the local data center often times at their request to help with p99 latencies. But now they want the driver to “do the right thing” and start using replicas in another data center when there is a local outage.

    Here is how (but you DO NOT WANT TO DO THIS)

    However you may end up with more than you bargained for.

    failedfailed

    Here is why you DO NOT WANT TO DO THIS

    Intended case is stupidly rare

    The common case I hear is “my local Cassandra DC has failed” Ok stop..so your app servers in the SAME DATACENTER are up but your Cassandra nodes are all down.

    I’ve got news for you, if all your local Cassandra nodes are down one of the following has happened:

    1. The entire datacenter is down including your app servers that are supposed to be handling this failover via the LoadBalancingPolicy
    2. There is something seriously wrong in your data model that is knocking all the nodes off line, if you send this traffic to another data center it will also go off line (assuming you have the bandwidth to get the queries through)
    3. You’ve created a SPOF in your deployment for Cassandra using SAN or some other infrastructure mistake..don’t do this

    So on the off hand that by dumb luck or just neglect you hit this scenario, then sure go for it, but it’s not going to be free (I cover this later on so keep reading).

    Common failure cases aren’t helped {#commonfailurecasesaren’thelped}

    A badly thought out and unproven data model is the most common reason outside of SAN I see widespread issues with a cluster or data center, failover at a datacenter level will not help you in either case here.

    Failure is often transient

    More often you may have a brief problem where a couple of nodes are having a problem. Say one expensive query happened right before, or you restarted a node, or you made a simple configuration change. Do you want your queries bleeding out to the other data center because of a very short term hiccup? Instead you could have had just retried the query one more time, the local nodes would have responded before the remote data center had time to receive the query.

    TLDR More often than not the correct behavior is just to retry in the local data center.

    Application SLAs are not considered

    Electrons only move so fast, and if you have 300 ms latency between your London and Texas data centers, while your application has a 100ms SLA for all queries, you’re still basically “down”.

    Available bandwidth is not considered

    Now lets say the default latency is fine, how fat is that pipe? Often on long links companies go cheap and stay in the sub 100mb range. This can barely keep up with replication traffic from Cassandra for most use cases let alone shifting all queries over. That 300ms latency will soon climb to 1 second or eventually just out and out failure as the pipe totally jams full. So you’re still down.

    Operational overhead of the “failed to” data center are not considered {#operationaloverheadofthe“failedto”datacenterarenotconsidered}

    I’ve mentioned this in passing a few times, but lets say you have fast enough pipes, can your secondary data center handle not only it’s local traffic but also the new ‘failover’ traffic coming in? If you’ve not allowed for that operational overhead you maybe taking down all data centers one at a time until there is none remaining.

    Intended consistency is not met

    Now lets say you have operational overhead, latency and bandwidth overhead to connect to that remote data center HUZZAH RIGHT!

    Say your app is using LOCAL_QUORUM. This implies something about your consistency needs AKA you need your model to be relatively consistent. Now if you recall above I’ve mentioned that most failures in practice are transient and not permanent.

    So what happens in the following scenario?

    • Node A and B are down restarting because a new admin got overzealous.
    • A write logging a user’s purchase is destined for A and B in DC1 instead go to their partners in DC2 and the write quietly completes successfully in the remote dc.
    • Node A and B and DC1 start responding but do not yet have the write from DC2 propagated back to them.
    • The user goes back to read his purchase history which goes to nodes A and B but no does not yet have it.

    Now all of this happened without an error and the application server had no idea it should retry. I’ve now defeated the point of LOCAL_QUORUM and I’m effectively using consistency level of TWO without realizing it

    What should you do?

    Same thing you were before. Use technology like Eureka, F5, whatever customer solution you have or want to use.

    I think there is some confusion in terminology here, Cassandra data centers are boundaries to lock in queries and by extension provide some consistency guarantee (some of you will be confused by this statement but it’s a blog post unto itself so save that for another time) and potentially to pin some administrative operations to a set of nodes. However, new users expect the Cassandra data center to be a driver level failover zone.

    If you need queries to go beyond a data center then use the classic QUORUM consistency level or something like SERIAL, it’s globally honest and provides consistency at a cluster level in exchange for latency hits in the process, but you’ll get the right answer more often than not that way (SERIAL especially as it takes into account race conditions)

    So lets step through the above scenarios:

    • Transient failure. Retry is probably faster.
    • SLAs are met or the query can just fail at least on the Cassandra side. If you swap at the load balancer level to a distant data center the customer may not have a great experience still, but dependent queries won’t stack up.
    • Inter DC bandwidth is largely unaffected.
    • Application consistency is maintained.

    One scenario that still isn’t handled is having adequate capacity in the downstream data centers. But it’s still something you have to think about either way. On the whole uptime and customer experience are way better with failing over using a dedicated load balancer than trying to do Multidc failover with the Cassandra client.

    So is there any case I can use allowRemoteHosts?

    Of course but you have to have the following conditions:

    1. Enough operational overhead to handle application servers from multiple data centers talking to a single Cassandra data center
    2. Enough inter DC bandwidth to support all those app servers talking across the pipe
    3. High enough SLAs or data centers close enough to hit SLA still,
    4. Weak consistency needs but a preference for more consistency than what ONE would give you
    5. Prefer local nodes to remote ones still even though you have enough load, bandwidth and low enough latency to handle all of these queries from any data center.

    Basically the use of this feature is EXTRAORDINARILY minimal and at best only gives you a minor benefit over just using a consistency level of TWO.

    Connection to Oracle From Spark


    For some silly reason there is a has been a fair amount of difficulty in reading and writing to Oracle from Spark when using DataFrames.

    SPARK-10648 — Spark-SQL JDBC fails to set a default precision and scale when they are not defined in an oracle schema.

    This issue manifests itself when you have numbers in your schema with no precision or scale and attempt to read. They added the following dialect here

    This fixes the issue for 1.4.2, 1.5.3 and 1.6.0 (and DataStax Enterprise 4.8.3). However recently there is another issue.

    SPARK-12941 — Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype

    And this lead me to this SO issue. Unfortunately, one of the answers has a fix they claim only works for 1.5.x however, I had no issue porting it to 1.4.1. The solution in Java looked something like the following, which is just a Scala port of the SO answer above ( This is not under warranty and it may destroy your server, but this should allow you to write to Oracle.)

    In the future keep an eye out for more official support in SPARK-12941 and then you can ever forget the hacky workaround above.

    Reflection Scala-2.10 and Spark weird errors when saving to Cassandra


    This originally started with this SO question, and I’ll be honest I was flummoxed for a couple of days looking at this (in no small part because the code was doing a lot). But at some point I was able to isolate all issues down to dataFrame.saveToCassandra. Every 5 or so runs I’d get one or two errors:

    or

    I could write this data to a file with no issue, but when writing to a Cassandra these errors would come flying out at me. With the help of Russ Spitzer (by help I mean explaining to my thick skull what was going on) I was pointed to the fact that reflection wasn’t thread safe in Scala 2.10.

    No miracle here objects have to be read via reflection down to the even the type checking (type checking being something writing via text that is avoided) and then objects have to be populated on flush.

    Ok so what are the fixes {#5bed}

  • Run with a version of Spark compiled against Scala 2.11
  • Use foreachPartition to write directly using the Cassandra java driver API and avoid reflection. You’ll end up giving up some performance tweaks with token aware batching if you go this route, however.
  • Accept it! Spark is fault tolerant, and in any of these cases the job is not failing and it’s so rare that Spark just quietly retries the task and there is no data loss and no pain.
  • Logging The Generated CQL from the Spark Cassandra Connector


    This has come up some in the last few days so I thought I’d share the available options and the tradeoffs.

    Option 1: Turn ON ALL THE TRACING! nodetool settraceprobability 1.0 {#354c}

    Probabilistic tracing is a handy feature for finding expensive queries in use cases where there is little control over who has access to the cluster IE most enterprises in my experience (ironic considering all the process). However, it’s too expensive to turn it up too high in production, but in development it’s a good way to give you an idea of what a query turns into. Read more about probalistic tracing here:

    and the command syntax:

    Option 2: Trace at the driver level {#42c9}

    Set TRACE logging level on the java-driver request handler on the spark nodes you’re curious about.

    Say I have a typical join query:

    On the spark nodes now configure the DataStax java driver RequestHandler.

    In my case using the tarball this is dse-4.8.4/resources/spark/conf/logback-spark-executor.xml. In that file I just added the following inside the element: </p>

    On the spark nodes in the executor logs you’ll now have. In my case /var/lib/spark/worker/worker-0/app-20160203094945–0003/0/stdout, app-20160203094945–0003 is the job name.

    You’ll note this is a dumb table scan that is only limited to the tokens that the node owns. You’ll note the tokens involved are not visible, I leave it to the reader to repeat this exercise with pushdown like 2i and partitions.

    Don’t use TextField for your unique key in Solr


    This seems immediately obvious when you think about it, but TextField is what you use for fuzzy searches in Solr, and why would a person want a fuzzy search on a unique value? While I can come up with some oddball use cases, making use of copy fields would seem to be the more valid approach and fitting with the typical use of Solr IE you filter on strings and query on text.

    However, people have done this a few times and they throw me for a loop and in the case of DataStax Enterprise Search (built on Solr) this creates an interesting split between the index and the data.

    Given a Cassandra schema of

    A Solr Schema of (important bits in bold):

    Initial records never get indexed

    I’m assuming this is because the aspect of indexing that checks to see if it’s been visited or not is thrown by the tokens:

    First fill up a table

    Then turn on indexing

    Add one more record

    Then query via Solr and…no ‘1235’ or ‘1234’

    But Cassandra knows all about them

    To recap we never indexed ‘1234’ and ‘1235’ for some reason ‘123’ indexes and later on when I add 9999 it indexes fine. Later testing showed that as soon as readded ‘1234’ is joined the search results, so this only appears to happen to records that were there before hand.

    Deletes can greedily remove LOTS

    I delete id ‘1234’

    But when I query Solr I find only:

    So where did ‘1234 4566’, ‘1235’, and ‘1230’ go? If I query Cassandra directly they’re safe and sound only now Solr has no idea about them.

    To recap, this is just nasty and the only fix I’ve found is either reindexing or just adding the records again.

    Summary

    Just use a StrField type for your key and everything is happy. Special thanks to J.B. Langston (twitter https://twitter.com/jblang3) of DataStax for finding the nooks and crannies and then letting me take credit by posting about it.

    Spark job that writes to Cassandra just hangs when one node goes down?


    So this was hyper obvious once I saw the executor logs and the database schema, but this had me befuddled at first and the change in behavior with one node should have made it obvious. {#23da}

    The code was simple, read a bunch of information, do some minor transformations and flush to Cassandra. This was nothing crazy. But during the users fault tolerance testing, the job would just seamingly hang indefinitely when a node was down.

    If one node takes down your app, do you have any replicas?

    That was it, in fact that’s always it, if something myseriously “just stops” usually you have a small cluster and no replicas (RF 1). Now one may ask why anyone would ever have 1 replica with Cassandra and while I concede it is a very fair question, this was the case here.

    Example if I have RF1 and three nodes, when I wrote a row it’s only going to go to 1 of those nodes. If it dies, then how would I retrieve the data? Wouldn’t it just fail? Ok yeah wait a minute why is the job not failing?

    It’s the defaults!

    This is a bit of misdirection, the other nodes were timing out (probably slow IO layer). If we read the connector docs we get query.retry.count which gives us 10 retries, and the default read.timeout_ms is 120000 (which confusingly is also the write timeout), so 1.2 million milliseconds or 20 minutes to fail a task that is timing out. If you retry that task 4 times (which is the Spark default) it could take you 80 minutes to fail the job, this is of course assuming all the writes timeout.

    The Fix {#98c8}

    Short term

  • Don’t let any nodes fall over
  • drop retry down to 10 seconds, this will at least let you fail fast
  • drop output.batch.size.bytes down. Default is 1mb, half until you stop having issues.
  • Long term

  • Use a proper RF of 3
  • I still think the default retry of 120 seconds is way too high. Try 30 seconds at least.
  • Get better IO usually local SSDs will get you where you need to go.
  • Synthetic Sharding with Cassandra. Or How To Deal With Large Partitions.


    Extremely overdue that I write this down as it’s a common problem, and really applies to any database that needs to scale horizontally, not just Cassandra.

    Problem Statement

    Good partition keys are not always obvious, and it’s easy to create a bad one.

    Defining A Bad Partition

    1. Uneven Access. Read OR write count is more than 2x different from one partition to another. This is a purist view and some of you using time series are screaming at me right now, but set that aside, I’ll have another blog post for you, but if you’re new to Cassandra consider it a good principle and goal to aim for.
    2. Uneven Size. Same as above really you can run cfhistograms and if you see really tiny or empty partitions next to really large ones or ones with really high cell count, you are at least on ingest uneven. I would shoot for within an order of magnitude or two. If you’re smart enough to tell me why I’m wrong here that’s fine, I’m not gonna care, but if you’re new to Cassandra this is a good goal.
    3. Too Many Cells. The partition cell count for a given partition is over 100k (run cfhistograms to get these numbers). This is entirely a rule of thumb and varies amazingly between hardware, tuning and column names (length matters). You may find you can add more and not hit a problem and you may find you can’t get near this. If you want to be exacting you should test.
    4. Too Large. Your partition size is over 32mb (also in cfhisograms). This also varies like cell count. Some people tell me this matters less now (as of 2.1), and they run a lot larger. However, I’ve seen it cause problems on a number of clusters. I repeat as a new user this is a good number to shoot for, once you’re advanced enough to tell me why I’m wrong you may ignore this rule. Again you should test your cluster to get the number where things get problematic.

    Options if you have a bad partition

    1. Pick a better partition key (read http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling).
    2. Give up and use Synthetic Sharding.
    3. Pretend it’s not a problem and find out the hard way that it really is, usually this is at 3 am.

    Synthetic Sharding Strategy: Shard Table

    Pros

    • Always works.
    • Easy to parallelize (can be writing to shards in parallel).
    • Very very common and therefore battle tested.

    Cons

    • May have to do shards of shards for particularly large partitions.
    • Hard for new users to understand.
    • Hard to use in low latency use cases (but so are REALLY large partitions, so it’s a problem either way).

    Example Idea

    Synthetic Sharding Strategy: Shard Count Static Column

    Pros

    • No separate shard table.
    • No shards of shards problem
    • 2x faster to read when there is a single shard than the shard table option.
    • Still faster even when there is more more than a single shard than the shard table option.

    Cons

    • Maybe even harder for new users since it’s a little bit of a surprise.
    • Harder to load concurrently.
    • I don’t see this in wide use.

    Example Idea

    Synthetic Sharding Strategy: Known Shard Count

    Pros

    • No separate shard table.
    • No shards of shards problem
    • Not as fast as static column shard count option when only a single shard.
    • Easy to grasp once the rule is explained.
    • Can easily abstract the shards away (if you always query for example 5 shards, then this can be a series of queries added to a library).
    • Useful when you just want to shrink the overall size of the partitions by a set order of magnitude, but don’t care so much about making sure the shards are even.
    • Can use random shard selection and probably call it ‘good enough’
    • Can even use a for loop on ingest (and on read).

    Cons

    • I don’t see this in wide use.
    • Shard selection has to be somewhat thoughtful.

    Example Idea

    Appendix: Java Example For Async

    A lot of folks seem to struggle with async queries. So for example using an integer style of shards this would just be a very simple:

    Cassandra’s “Repair” Should Be Called “Required Maintenance”


    One of the bigger challenges when you go Eventually Consistent is how to reconcile data not being replicated. This happens if your using Oracle and multi-data centers with tech like Golden Gate and it happens if you’re using async replicas with MySQL and one of your replicas got out of whack. You need a way to “repair” the lost data.

    Since Cassandra fully embraces eventually consistency repair is actually a important mechanism for making sure copies of data are shipped around the cluster to meet your specified replication factor. About now several of you have alarm bells going off and think I’m insane. Let’s step through the common objections I hear.

    What about Consistency Level?

    Consistency Level (from here on out CL) does effectively define your contract for how many replicas you are requiring to be ‘successful’ but that’s at write time and unless you’re doing something silly like CL ALL, you can’t be certain that you’ve got RF (replication factor) copies around the cluster. Say you’re using CL ONE for all reads and writes then that means you’re only set on having one copy of the data. This means you can lose data if that node goes down.

    So either write at CL ALL or use repair to make sure your cluster is spending most of it’s time at your specified RF.

    What about hinted handoffs? {#37e7}

    They’re great, except until Cassandra 3.0 they’re not really that great. Hinted Handoffs for those of you that don’t know are written on a failed write by a coordinator to a remote node. These hints are replayed later on in a separate process.

    This helps eliminate the need for repair in theory. In practice they only last a relatively short window (3 hours by default) and generating a lot of them can be a huge resource hog for the system (3.0 should help greatly with this). I’ve worked on clusters with extended outages accross data centers are very high TPS rates resulted in terabytes of just hints in the cluster.

    In summary, hints are at best a temporary fix. If you have any extended outage then repair is your friend.

    Beware Deletes If You’re Too Cool For Repair

    Another important aspect of repair is when you think about tombstones. Say I issue a delete to two nodes but it only succeeds on one. My data will look like so:

    However a few hours later compaction on that node removes the first write on node 2. Fear now when I query ttl comparison will give me the right answer and partition 1 has no last_name value.

    However, when I go past gc_grace_seconds (default 10 days) that tombstone will be removed. Since I’ve never run repair (and I’m assuming I don’t have read repair to save me), I now only have partition 1 on node 1 and my deleted data comes back from the dead.

     

    Summary Run Repair {#3e3c}

    So for most use cases you’ll see, you’ll really want to run repair on each node within gc_grace_seconds and I advise really running repair more frequently than that if you use CL one and use deletes or lots of updates to the same value.

    I’ll add the one use case where repair has less importance. If you just do single writes with no updates, local_quorum consistency level, have a single data center and rely on TTLs to delete your data, you can probably not run repair until you find a need to do so such as a lost node.

subscribe via RSS