So this was hyper obvious once I saw the executor logs and the database schema, but this had me befuddled at first and the change in behavior with one node should have made it obvious.
The code was simple, read a bunch of information, do some minor transformations and flush to Cassandra. This was nothing crazy. But during the users fault tolerance testing, the job would just seamingly hang indefinitely when a node was down.
If one node takes down your app, do you have any replicas?
That was it, in fact that’s always it, if something myseriously “just stops” usually you have a small cluster and no replicas (RF 1). Now one may ask why anyone would ever have 1 replica with Cassandra and while I concede it is a very fair question, this was the case here.
Example if I have RF1 and three nodes, when I wrote a row it’s only going to go to 1 of those nodes. If it dies, then how would I retrieve the data? Wouldn’t it just fail? Ok yeah wait a minute why is the job not failing?
It’s the defaults!
This is a bit of misdirection, the other nodes were timing out (probably slow IO layer). If we read the connector docs we get query.retry.count which gives us 10 retries, and the default read.timeout_ms is 120000 (which confusingly is also the write timeout), so 1.2 million milliseconds or 20 minutes to fail a task that is timing out. If you retry that task 4 times (which is the Spark default) it could take you 80 minutes to fail the job, this is of course assuming all the writes timeout.
- Don’t let any nodes fall over
- drop retry down to 10 seconds, this will at least let you fail fast
- drop output.batch.size.bytes down. Default is 1mb, half until you stop having issues.
- Use a proper RF of 3
- I still think the default retry of 120 seconds is way too high. Try 30 seconds at least.
- Get better IO usually local SSDs will get you where you need to go.