Cassandra: Batch loading without the Batch keyword

ATTENTION:

This post is intentionally simplistic to help explain tradeoffs that need to be made. If you are looking for some production level nuance go read this afterwords.

Batches in Cassandra are often mistaken as a performance optimization. They can be but only in rare cases. First we need to discuss the different types of batches:

Unlogged Batch

A good example of an unlogged batch follows and assumes a partition key of date. The following batch is effectively one insert because all inserts are sharing the same partition key. Assuming a partition key of date the above batch will only resolve to one write internally, no matter how many there are as long as they have the same date value. This is therefore the primary use case of an unlogged batch:

BEGIN UNLOGGED BATCH; 
INSERT INTO weather_readings (date, timestamp, temp) values (20140822,'2014-08-22T11:00:00.00+0000', 98.2); 
INSERT INTO weather_readings (date, timestamp, temp) values (20140822,'2014-08-22T11:00:15.00+0000', 99.2); 
APPLY BATCH; 

A common anti pattern I see is:

—NEVER EVER EVER DO 
BEGIN UNLOGGED BATCH;
INSERT INTO tester.users (userID, firstName, lastName) VALUES (1, 'Jim', 'James') 
INSERT INTO tester.users (userID, firstName, lastName) VALUES (2, 'Ernie', 'Orosco') 
INSERT INTO tester.users (userID, firstName, lastName) VALUES (3, 'Jose', 'Garza') 
INSERT INTO tester.users (userID, firstName, lastName) VALUES (4, 'Sammy', 'Mason') 
INSERT INTO tester.users (userID, firstName, lastName) VALUES (5, 'Larry', 'Bird') 
INSERT INTO tester.users (userID, firstName, lastName) VALUES (6, 'Jim', 'Smith') 
APPLY BATCH; 

Unlogged batches require the coordinator to do all the work of managing these inserts, and will make a single node do more work. Worse if the partition keys are owned by other nodes then the coordinator node has an extra network hop to manage as well. The data is not delivered in the most efficient path.

Logged Batch (aka atomic)

A good example of a logged batch looks something like:

BEGIN BATCH; 
UPDATE users SET name='Jim' where id =1; 
UPDATE users_by_ssn set name='Jim' where ssn='888–99–9999'; 
APPLY BATCH;

This is keeps tables in sync, but at the cost of performance. A common anti pattern I see is:

—NEVER EVER EVER DO 
BEGIN BATCH; 
INSERT INTO tester.users (userID, firstName, lastName) VALUES (1, 'Jim', 'James') 
INSERT INTO tester.users (userID, firstName, lastName) VALUES (2, 'Ernie', 'Orosco') 
INSERT INTO tester.users (userID, firstName, lastName) VALUES (3, 'Jose', 'Garza') 
INSERT INTO tester.users (userID, firstName, lastName) VALUES (4, 'Sammy', 'Mason') 
INSERT INTO tester.users (userID, firstName, lastName) VALUES (5, 'Larry', 'Bird') 
INSERT INTO tester.users (userID, firstName, lastName) VALUES (6, 'Jim', 'Smith') 
APPLY BATCH; 

This ends up being expensive for the following reasons. Logged batches add a fair amount of work to the coordinator. However it has an important role in maintaining consistency between tables. When a batch is sent out to a coordinator node, two other nodes are sent batch logs, so that if that coordinator fails then the batch will be retried by both nodes.

This obviously puts a fair a amount of work on the coordinator node and cluster as a whole. Therefore the primary use case of a logged batch is when you need to keep tables in sync with one another, and NOT performance.

Fastest option no batch

I’m sure about now you’re wondering what the fastest way to load data is, allow the distributed nature of Cassandra to work for you and distribute the writes to the optimal destination. The following code will lead to not only the fastest loads (assuming different partitions are being updated), but it’ll cause the least load on the cluster. If you add retry logic you’ll only retry that one mutation while the rest are fired off.

For code samples go read the article I mentioned above.

About Ryan Svihla

I consider myself a full stack polyglot, and I have been writing a lot of JS and Ruby as of late. Currently, I'm a solutions architect at DataStax
This entry was posted in Cassandra and tagged . Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • Thanks for the article Ryan. It’s very interesting and has got me rethinking how we do writes in our system. I’m a bit unclear though how the ‘Fastest option no batch’ method is optimal. By default, the RoundRobinPolicy is used, which will still send every other statement to the wrong node. Would a better strategy be to explicitly group statements with the same partition into unlogged batches, and then to use the TokenAwarePolicy while setting the routing key of each batch? This would have the benefit of larger writes, a reduced number of Tasks, and will send each batch to the correct node. A very interesting and complex topic with unfortunately, not a lot of documentation. Thanks.

    • rssvihla

      TokenAwarePolicy has been default for awhile and was a “best practice” before that (don’t get me started on why it wasn’t always default). This blog post is in no small part predicated on TokenAwarePolicy being enabled.

      As far as ‘optimal’ since 1.2 Batch has been a logged batch, so it’s been expensive as far as a bulk loading process, so I’m not sure how it’d be more optimal unless you’re referring to unlogged batch.

      Furthermore a Batch operation at a coordinator level will almost certainly require extra network hops while waiting for response even with token awareness since that coordinator will unlikely own all token ranges in the batch (Unless you have the rare situation with nodes = RF and are using cl 1)

      Other than the ways I mentioned in the article, namely using Unlogged Batch with all statements being applied to the same partition it’s not hard to reason why the no batch option is optimal in most cases, and this becomes truer and truer as you add nodes.

      • Hmm, strange. I just spent some time looking through the code and in the latest C# driver, the DefaultLoadBalancingPolicy is clearly set to RoundRobinPolicy. Also, setting breakpoints inside TokenAwarePolicy show that it’s never touched unless explicitly set when building the Cluster. I then had a look at the latest Java driver, and sure enough the default policy is a TokenAwarePolicy wrapping a DCAwareRoundRobinPolicy. Funny that they’re different.

        What I had referred to in my previous comment was having unlogged batches, with each batch consisting of statements belonging to just a single partition. I guess the issue on my part was exactly how they get routed correctly.

        Thanks.

        • rssvihla

          Actually sounds like a bug/unintended consequence..submitting to Jira now. In the meantime _please_ use TokenAware

          As for the unlogged batches I stated in my post that’s a performance boost if using the scenario you’ve described they’re all to the same partition, then it’s effectively 1 mutation. You’ll still want to send those batches in an executeasync fashion and not have unlogged batches that span partitions.

        • rssvihla

          submitted issue https://datastax-oss.atlassian.net/browse/CSHARP-188 Thanks for catching that, pretty problematic for a default setting.

  • Pingback: Cassandra – com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection « news-Knowlage FeeD()

  • Pingback: c# - Handling Exceptions for Deferred Tasks in Cassandra - CSS PHP()

  • Pingback: ACID is good-Complexity in Cassandra consistency – My tech blog()

  • Pingback: Асинхронный API драйвера как замена для неправильного использования IN и Batch statements — Not Only CQL()

  • Pingback: In java, how to execute similar insert queries in a single batch for cassandra database? - Questions Techfixes.net()

  • Thanks for finaloy talking about > Cassandra: Batch loading ᴡithout tҺe Batch
    keyword | ᗷig Data Nerd < Loved it!

    my web site; pokemon gߋ
    hack

  • Evеry weekend i useⅾ tο pay a quik visit this web site, аѕ i wish for enjoyment,
    as thіѕ this web site conatios ɑctually ǥood funny material too.

    Alsso vidit mү рage: pokemon go hack

  • Hello, ӏ ѡant to subscribe foг tһis blog to obtain latest updates,
    sߋ where cann i doo it please ɦelp οut.

    Here is my page pokemon go hack