Reflection Scala-2.10 and Spark weird errors when saving to Cassandra

This originally started with this SO question, and I’ll be honest I was flummoxed for a couple of days looking at this (in no small part because the code was doing a lot). But at some point I was able to isolate all issues down to dataFrame.saveToCassandra. Every 5 or so runs I’d get one or two errors:

WARN 2016–02–05 13:30:42 org.apache.spark.scheduler.TaskSetManager: 
  Lost task 16.0 in stage 9.0 (TID 226, 
  org.apache.spark.sql.types.TimestampType$; unable to create instance


WARN 2016–02–05 13:18:38 org.apache.spark.scheduler.TaskSetManager:
   Lost task 67.0 in stage 17.0 (TID 483, 
   java.lang.UnsupportedOperationException: tail of empty list

I could write this data to a file with no issue, but when writing to a Cassandra these errors would come flying out at me. With the help of Russ Spitzer (by help I mean explaining to my thick skull what was going on) I was pointed to the fact that reflection wasn’t thread safe in Scala 2.10.

No miracle here objects have to be read via reflection down to the even the type checking (type checking being something writing via text that is avoided) and then objects have to be populated on flush.

Ok so what are the fixes

  1. Run with a version of Spark compiled against Scala 2.11
  2. Use foreachPartition to write directly using the Cassandra java driver API and avoid reflection. You’ll end up giving up some performance tweaks with token aware batching if you go this route, however.
  3. Accept it! Spark is fault tolerant, and in any of these cases the job is not failing and it’s so rare that Spark just quietly retries the task and there is no data loss and no pain.

About Ryan Svihla

I consider myself a full stack polyglot, and I have been writing a lot of JS and Ruby as of late. Currently, I'm a solutions architect at DataStax
This entry was posted in Cassandra, Spark and tagged , , . Bookmark the permalink. Follow any comments here with the RSS feed for this post.