Cassandra Query Patterns: Not using the “in” query for multiple partitions.

So lets say you’re doing you’re best to data model all around one partition. You’ve done your homework and all you queries look like this:

SELECT * FROM my_keyspace.users where id = 1

Over time as features are added however, you make some tradeoffs and need to start doing queries across partitions. At first there are only a few queries like this.

SELECT * FROM my_keyspace.users where id in (1,2,3,4)

You’re cluster is well tuned so you have no problems, but as time goes on your dataset increases and users are doing bigger searches across more users.

SELECT * FROM my_keyspace.users where id in
(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23)

Now you start seeing GC pauses and heap pressure that leads to overall slower performance, your queries are coming back in what happened?

Imagine the contrived scenario where we have a partition key with the values A,B,C with 9 nodes and a replication factor of 3. When I send in my query that looks like SELECT * FROM mykeyspace.mytable WHERE id IN (‘A’,’B’,C’) the coordinator has to do something like:

in query

In practical terms this means you’re waiting on this single coordinator node to give you a response, it’s keeping all those queries and their responses in the heap, and if one of those queries fails, or the coordinator fails, you have to retry the whole thing.

failed in query

This means a dead coordinator node means the entire query result is gone. Instead of trying to approach this like a relational database let’s embrace the distributed nature of Cassandra and send queries directly to the nodes that can get the best fastest answer.

In Java

PreparedStatement statement = session.prepare(
  "SELECT * FROM tester.users where id = ?");
List<ResultSetFuture> futures = new ArrayList<>();
for (int i = 1; i < 4; i++) {
 ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(i));
 futures.add(resultSetFuture);
}
List<String> results = new ArrayList<>();
for (ResultSetFuture future : futures){
 ResultSet rows = future.getUninterruptibly();
 Row row = rows.one();
 results.add(row.getString("name"));
}
return results;

In C#

PreparedStatement statement = session.Prepare("SELECT * FROM tester.users where id = ?");
List<String> names = new List<String>();
for (int i = 1; i < 4; i++)
{
  var resultSetFuture = session.ExecuteAsync(statement.Bind(i));
  resultSetFuture.ContinueWith(t =>
                        {
                            var result = t.Result.First();
                            var name = result.GetValue<String>("name");
                            names.Add(name);
                        });
  resultSetFuture.Wait();

}
return names;

Now doing a retry requires only one small fast query, you’ve eliminated the single point of failure.

enter image description here

With separate queries you get no single point of failure, faster reads, less pressure on the coordinator node, and better performance semantics when you have a nodes failing. It truly embraces the distributed nature of Cassandra.

The “in” keyword has it’s place such as when querying INSIDE of a partition, but by and large it’s something I wish wasn’t doable across partitions, I fixed a good dozen performance problems with it so far, and I’ve yet to see it be faster than separate queries plus async.

A note about distributed thinking

This and my ‘no batch’ blog post really drive a bigger discussion about distributed thinking. Most things that don’t really work as well in a distributed database as people think they should, bulk loading via batch, in queries, and ‘rollbacks’ are left over vestiges from a single machine thinking.

This highly consistent single machine world is easy to reason about, but it doesn’t scale easily, and has single points of failure, and when you do make the tradeoffs needed to scale, you find features like “in” queries don’t scale unless they happen to be all be on the same machine (like Cassandra). You’ll find when you try and scale highly consistent single machine technologies via sharding you run into the same problem set as we have with distributed databases, only without appropriate tools.

So embrace continuous availability, multiple replicas, and leave behind yesterday’s approaches.

About Ryan Svihla

I consider myself a full stack polyglot, and I have been writing a lot of JS and Ruby as of late. Currently, I'm a solutions architect at DataStax
This entry was posted in Cassandra. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • John Douglass

    Great article! This concept is counterintuitive for old SQL hands like me. Who would think having a lot of statements returning one row would be better that one statement returning a lot of rows?

    In my tests on a 14-node cluster, doing selects with 10,000 UUIDs in an IN clause tended to have throughput of about 1 record/msec, while launching a bunch of executeAsyncs with equals had performance of 40-60 records/msec!

    Do you have a feel for a good number of maximum asyncs to allow at once without the driver having problems? I tried submitting ~400,000 queries while limiting simultaneous asyncs to 10,000 and had no trouble. I’m wondering whether I should cap the number of requests or just let them fly.

    • rssvihla

      John,

      Most of this is tweaking for your cluster, but the general rule of thumb is the larger your cluster the more you can have in flight. I’d say if you’re doing 400k queries you maybe doing something wrong in your data model, or there maybe a “better way”

  • Michael Oczkowski

    Is there a reason that the CQL doesn’t just compile it underneath in this manner when the IN clause is across partitions?

    • rssvihla

      Cql happens on the server and not in the drivers and there are several drivers to maintain so its not the correct layer

  • Pingback: IN predicate in Cassandra CQL | Kirill Khistyaev blog()

  • Pingback: [Cassandra教程] (八)Java API 进阶 – 异步IO - 技术 - 爱好博客()