Data Density! Destroyer of Scalability
UPDATE: I’d incorrectly attributed a practice to Netflix about scaling down daily . I cannot find any reference to using today, and I’ve been unable to find the previous reference to it. So I’ve just removed the point. I’ll cover cluster scaling strategies later in another blog post. TLDR smaller nodes are easier to scale.
My past couple of years I’ve seen a growing and to my mind unfortunate trend towards ever larger Cassandra nodes running more expensive and more specialized hardware. Teams wanting 17TB nodes in a quest to get ever more data on a single unit (TLDR that rarely works). While wanting to lower one’s rack footprint sounds hyper appealing let me be the first to tell you, it comes at a great cost.
The Tale of Two Cassandra Clusters
Low Density Cluster
Say 300 GBs per node, it has the following properties:
High Density Cluster {#5648}
Say 17 TBs per node, it has the following properties:
Cost Exercise
Some of you maybe saying “our rack space is limited and expensive”. I’d suggest looking at platforms like Moonshot if that’s the case, though it’ll suffer from some of the same flaws as the dense cluster, it’s at least dense in a rack sense. But for most of you, you’ll enjoy many factors of cost savings in going with cheaper commodity hardware and get a better bang for your buck when it’s all said and done.
On Premise Scenario {#75ec}
This one actually happened to me and I was shocked at the cost increase.
High Density 5tb per node
- 70k per node (not a typo)
- Node count — 5
- total cost = 70k * 5 = 350k
Low Density 300gb per node
- 2k per node
- Node count — 84
- total cost = 2k * 84 = 168k
So I could even spring for double hardware or double node count or just pocket my savings to pay for more colo space. This gets more ridiculous when you start trying to push 17Tbs per node.
Cloud Scenario
Since the cloud eliminates lots of moving pieces lets use this.
High Density Instance
i2.8xlarge — $6.82 an hour can get about 8*.8tb/2 or 3.2 GB density
Low Density Instance
m4.2xlarge — $0.504 an hour + EBS SSD storage (cheap but add a penny for sake of argument). say 300gb density.
This ends up being a style choice. Amazon does this really well and you’re getting a good rate on both. The low density option is slightly cheaper if you never scale down and far better hardware for the cost, and the normal tradeoffs I mention above apply. I’ll also node the m4.2xlarge easily could go ‘denser’ I’ve just chosen not to here. One could even look into cheaper node options and see how they perform for your workload.
Summary
I hope this helps at least start a conversation about the tradeoffs of data density. I don’t pretend to know all the answers for every scenario, but I do think people have often failed to just run the most basic numbers.
Often they believe the cheap hardware is too unreliable but in fact 18 more nodes is more redundant than any single node configuration and often half the cost.
So I suggest leveraging distributed architecture to it’s fullest and consider those smaller nodes with less data. The resulting agility in your ops will allow you to scale up and down with the day to day demands of your business, and allow you to optimize savings further possibly exceeding all savings in rack space with the high density approaches.