This originally started with this SO question, and I’ll be honest I was flummoxed for a couple of days looking at this (in no small part because the code was doing a lot). But at some point I was able to isolate all issues down to dataFrame.saveToCassandra. Every 5 or so runs I’d get one or two errors:
I could write this data to a file with no issue, but when writing to a Cassandra these errors would come flying out at me. With the help of Russ Spitzer (by help I mean explaining to my thick skull what was going on) I was pointed to the fact that reflection wasn’t thread safe in Scala 2.10.
No miracle here objects have to be read via reflection down to the even the type checking (type checking being something writing via text that is avoided) and then objects have to be populated on flush.
Ok so what are the fixes
- Run with a version of Spark compiled against Scala 2.11
- Use foreachPartition to write directly using the Cassandra java driver API and avoid reflection. You’ll end up giving up some performance tweaks with token aware batching if you go this route, however.
- Accept it! Spark is fault tolerant, and in any of these cases the job is not failing and it’s so rare that Spark just quietly retries the task and there is no data loss and no pain.