Data Density! Destroyer of Scalability


UPDATE: I’d incorrectly attributed a practice to Netflix about scaling down daily . I cannot find any reference to using today, and I’ve been unable to find the previous reference to it. So I’ve just removed the point. I’ll cover cluster scaling strategies later in another blog post. TLDR smaller nodes are easier to scale.

My past couple of years I’ve seen a growing and to my mind unfortunate trend towards ever larger Cassandra nodes running more expensive and more specialized hardware. Teams wanting 17TB nodes in a quest to get ever more data on a single unit (TLDR that rarely works). While wanting to lower one’s rack footprint sounds hyper appealing let me be the first to tell you, it comes at a great cost.

The Tale of Two Cassandra Clusters

Low Density Cluster

Say 300 GBs per node, it has the following properties:

  • Can be very affordable hardware which is easy to replace and procure. Say 4–8 cores, 16-32gb of ram (more if it’s cheap) and 1 single big ssd or 2 ssds in a raid 0
  • Can be very power efficient.
  • More easily can run in 1U rack units.
  • Can be quickly scaled up and scaled down in a matter of hours, and by multiples. Because of the available overhead scaling down is not painful.
  • More of the dataset can fit into RAM.
  • The cluster has more CPU and RAM per node.
  • Losing a node means little, the higher your node count the more readily you can lose hardware.
  • With a smaller data set a higher node count is needed to get rack awareness and not have all your nodes literally sitting on one rack.
  • High Density Cluster {#5648}

    Say 17 TBs per node, it has the following properties:

  • High end CPU 20+ cores, 256gb of ram or more, high end disk controllers w 10+ disks, probably all SSD, all enterprise grade.
  • Power hog.
  • Very expensive to get stuffed into a small rack, will pay a premium for his.
  • Will take weeks to scale up, as bootstrapping individual nodes will take eons. Hard almost by design to scale down.
  • Unfortunately, again almost by design most of the dataset will not fit into RAM.
  • You have relatively little horse power relative to data. This can be fine, but it does mean certain operations like repair will be agonizing.
  • Losing a node means you’re losing a large chunk of storage and capacity compared to an equivalent low density cluster. Your unit of work is basically bigger.
  • You’ll often end up with small (5 node) clusters for a lot of use cases. This makes rack awareness impractical. In fact you may literally end up with all your nodes in a single rack and you’re one network switch away from being down.
  • Cost Exercise

    Some of you maybe saying “our rack space is limited and expensive”. I’d suggest looking at platforms like Moonshot if that’s the case, though it’ll suffer from some of the same flaws as the dense cluster, it’s at least dense in a rack sense. But for most of you, you’ll enjoy many factors of cost savings in going with cheaper commodity hardware and get a better bang for your buck when it’s all said and done.

    On Premise Scenario {#75ec}

    This one actually happened to me and I was shocked at the cost increase.

    High Density 5tb per node

    • 70k per node (not a typo)
    • Node count — 5
    • total cost = 70k * 5 = 350k

    Low Density 300gb per node

    • 2k per node
    • Node count — 84
    • total cost = 2k * 84 = 168k

    So I could even spring for double hardware or double node count or just pocket my savings to pay for more colo space. This gets more ridiculous when you start trying to push 17Tbs per node.

    Cloud Scenario

    Since the cloud eliminates lots of moving pieces lets use this.

    High Density Instance

    i2.8xlarge — $6.82 an hour can get about 8*.8tb/2 or 3.2 GB density

    Low Density Instance

    m4.2xlarge — $0.504 an hour + EBS SSD storage (cheap but add a penny for sake of argument). say 300gb density.

    This ends up being a style choice. Amazon does this really well and you’re getting a good rate on both. The low density option is slightly cheaper if you never scale down and far better hardware for the cost, and the normal tradeoffs I mention above apply. I’ll also node the m4.2xlarge easily could go ‘denser’ I’ve just chosen not to here. One could even look into cheaper node options and see how they perform for your workload.

    Summary

    I hope this helps at least start a conversation about the tradeoffs of data density. I don’t pretend to know all the answers for every scenario, but I do think people have often failed to just run the most basic numbers.

    Often they believe the cheap hardware is too unreliable but in fact 18 more nodes is more redundant than any single node configuration and often half the cost.

    So I suggest leveraging distributed architecture to it’s fullest and consider those smaller nodes with less data. The resulting agility in your ops will allow you to scale up and down with the day to day demands of your business, and allow you to optimize savings further possibly exceeding all savings in rack space with the high density approaches.

    Real Time Analytics With Spark Streaming and Cassandra