Cassandra: Batch loading without the Batch keyword

Batches in Cassandra are often mistaken as a performance optimization. They can be but only in rare cases. First we need to discuss the different types of batches:

Unlogged Batch

A good example of a logged batch assuming a partition key is date. The following batch is effectively one insert because all inserts are sharing the same partition key. Assuming a partition key of date the above batch will only resolve to one write internally, no matter how many there are as long as they have the same date value. This is therefore the primary use case of an unlogged batch:

BEGIN UNLOGGED BATCH;
INSERT INTO weather_readings (date, timestamp, temp) values (20140822,'2014-08-22T11:00:00.00+0000', 98.2); 
INSERT INTO weather_readings (date, timestamp, temp) values (20140822,'2014-08-22T11:00:15.00+0000', 99.2); 
APPLY BATCH;

A common anti pattern I see is:

—NEVER EVER EVER DO
BEGIN UNLOGGED BATCH;
INSERT INTO tester.users (userID, firstName, lastName) VALUES (1, 'Jim', 'James')
INSERT INTO tester.users (userID, firstName, lastName) VALUES (2, 'Ernie', 'Orosco')
INSERT INTO tester.users (userID, firstName, lastName) VALUES (3, 'Jose', 'Garza')
INSERT INTO tester.users (userID, firstName, lastName) VALUES (4, 'Sammy', 'Mason')
INSERT INTO tester.users (userID, firstName, lastName) VALUES (5, 'Larry', 'Bird')
INSERT INTO tester.users (userID, firstName, lastName) VALUES (6, 'Jim', 'Smith')
APPLY BATCH;

Unlogged batches require the coordinator to do all the work of managing these inserts, and will make a single node do more work. Worse if the partition keys are owned by other nodes then the coordinator node has an extra network hop to manage as well. The data is not delivered in the most efficient path.

Logged Batch (aka atomic)

A good example of a logged batch looks something like:

BEGIN BATCH;
UPDATE users SET name='Jim' where id =1;
UPDATE users_by_ssn set name='Jim' where ssn='888–99–9999';
APPLY BATCH;

This is keeps tables in sync, but at the cost of performance. A common anti pattern I see is:

—NEVER EVER EVER DO
BEGIN BATCH;
INSERT INTO tester.users (userID, firstName, lastName) VALUES (1, 'Jim', 'James')
INSERT INTO tester.users (userID, firstName, lastName) VALUES (2, 'Ernie', 'Orosco')
INSERT INTO tester.users (userID, firstName, lastName) VALUES (3, 'Jose', 'Garza')
INSERT INTO tester.users (userID, firstName, lastName) VALUES (4, 'Sammy', 'Mason')
INSERT INTO tester.users (userID, firstName, lastName) VALUES (5, 'Larry', 'Bird')
INSERT INTO tester.users (userID, firstName, lastName) VALUES (6, 'Jim', 'Smith')
APPLY BATCH;

This ends up being expensive for the following reasons. Logged batches add a fair amount of work to the coordinator. However it has an important role in maintaining consistency between tables. When a batch is sent out to a coordinator node, two other nodes are sent batch logs, so that if that coordinator fails then the batch will be retried by both nodes.

This obviously puts a fair a amount of work on the coordinator node and cluster as a whole. Therefore the primary use case of a logged batch is when you need to keep tables in sync with one another, and NOT performance.

Fastest option no batch

I’m sure about now you’re wondering what the fastest way to load data is, allow the distributed nature of Cassandra to work for you and distribute the writes to the optimal destination. The following code will lead to not only the fastest loads (assuming different partitions are being updated), but it’ll cause the least load on the cluster. If you add retry logic you’ll only retry that one mutation while the rest are fired off.

In Java:

 Cluster cluster = Cluster.builder()
.addContactPoint("127.0.0.1")
.build();
Session session = cluster.newSession();
//Save off the prepared statement you're going to use
PreparedStatement statement = session.prepare("INSERT INTO tester.users (userID, firstName, lastName) VALUES (?,?,?)");

List<ResultSetFuture> futures = new ArrayList<ResultSetFuture>();
for (int i = 0; i < 1000; i++) {
 //please bind with whatever actually useful data you're importing
 BoundStatement bind = statement.bind(i, "John", "Tester");
 ResultSetFuture resultSetFuture = session.executeAsync(bind);
 futures.add(resultSetFuture);
}
//not returning anything useful but makes sure everything has completed before you exit the thread.
for(ResultSetFuture future: futures){
 future.getUninterruptibly();
}
cluster.close();

In C#:

 var cluster = Cluster.Builder()
 .AddContactPoint("127.0.0.1")
 .Build();
 var session = cluster.Connect();

 //Save off the prepared statement you’re going to use
 var statement = session.Prepare ("INSERT INTO tester.users (userID, firstName, lastName) VALUES (?,?,?)");

 var tasks = new List<Task>();
 for (int i = 0; i < 1000; i++) 
 {

 //please bind with whatever actually useful data you’re importing
 var bind = statement.Bind (i, "John", "Tester");
 var resultSetFuture = session.ExecuteAsync (bind);
 tasks.Add (resultSetFuture);
 }

 Task.WaitAll(tasks.ToArray());
 cluster.Shutdown();

Related Articles:

Post Footer automatically generated by Add Post Footer Plugin for wordpress.

About Ryan Svihla

I consider myself a full stack polyglot, and I have been writing a lot of JS and Ruby as of late. Currently, I'm a solutions architect at DataStax
This entry was posted in Cassandra and tagged . Bookmark the permalink. Follow any comments here with the RSS feed for this post.