Logging The Generated CQL from the Spark Cassandra Connector

This has come up some in the last few days so I thought I’d share the available options and the tradeoffs.

Option 1: Turn ON ALL THE TRACING! nodetool settraceprobability 1.0

Probabilistic tracing is a handy feature for finding expensive queries in use cases where there is little control over who has access to the cluster IE most enterprises in my experience (ironic considering all the process). However, it’s too expensive to turn it up too high in production, but in development it’s a good way to give you an idea of what a query turns into. Read more about probalistic tracing here:

and the command syntax:

Option 2: Trace at the driver level

Set TRACE logging level on the java-driver request handler on the spark nodes you’re curious about.

Say I have a typical join query:

HiveContext hiveContext = new HiveContext(sc);
 DataFrame df = hiveContext.sql("SELECT t1.value as t1value, t2.value as t2value FROM test.my_table_2 as t1 " +
         "JOIN test.my_table_3 as t2 " +
         "ON t1.id = t2.id " +
         "WHERE t1.value = 'MY DATA'");
df.show();

On the spark nodes now configure the DataStax java driver RequestHandler.

In my case using the tarball this is dse-4.8.4/resources/spark/conf/logback-spark-executor.xml. In that file I just added the following inside the <configuration> element:

<logger name=”com.datastax.driver.core.RequestHandler” level=”TRACE”/>

On the spark nodes in the executor logs you’ll now have. In my case /var/lib/spark/worker/worker-0/app-20160203094945–0003/0/stdout, app-20160203094945–0003 is the job name.

TRACE 2016–02–03 09:49:52 com.datastax.driver.core.RequestHandler:
   [1660052691] SELECT “id”,”solr_query”,”value” FROM “test”.”my_table_2" WHERE token(“id”) > ? AND token(“id”) <= ?
TRACE 2016–02–03 09:49:52 com.datastax.driver.core.RequestHandler: 
    [1416550571] SELECT “id”,”value” FROM “test”.”my_table_3" WHERE token(“id”) > ? AND token(“id”) <= ?
TRACE 2016–02–03 09:49:52 com.datastax.driver.core.RequestHandler: 
    [1551498375] SELECT “id”,”value” FROM “test”.”my_table_3" WHERE token(“id”) > ? AND token(“id”) <= ?
TRACE 2016–02–03 09:49:52 com.datastax.driver.core.RequestHandler: 
    [632641584] SELECT “id”,”solr_query”,”value” FROM “test”.”my_table_2" WHERE token(“id”) > ? AND token(“id”) <= ?

You’ll note this is a dumb table scan that is only limited to the tokens that the node owns. You’ll note the tokens involved are not visible, I leave it to the reader to repeat this exercise with pushdown like 2i and partitions.

About Ryan Svihla

I consider myself a full stack polyglot, and I have been writing a lot of JS and Ruby as of late. Currently, I'm a solutions architect at DataStax
This entry was posted in Cassandra, Spark and tagged , . Bookmark the permalink. Follow any comments here with the RSS feed for this post.