Bulk processing with NHibernate


On a recent project, much of the application integration is done through bulk, batch processing.  Lots of FTP shuffling, moving files around, and processing large CSV or XML files.  Everything worked great with our normal NHibernate usage, until recently when we had to process historical transactional data.

The basic problem is that we had to calculate historical customer order totals.  Normally, this would be rather easy to do with a SQL-level bulk update.  However, in our case, we had to process the items one-by-one, as each transaction could potentially trigger an event, “reward earned”.  And naturally, the rules for this trigger are much too complex to attempt to do in straight T-SQL.

This meant that we still had to run our historical feed through the domain model.

A normal, daily drop of customer orders is processed fairly easily.  However, historical data over just the past year totaled close to 5 million transactions, each transaction being a single line in a CSV file.

Since a normal daily file would take about 2 hours to process, we simply could not wait the projected time to process all these transactions.  However, with some handy tips from the twitterverse, we were able to speed things up considerably.

Bulk import and export

The first thing we found is that if you have to do bulk, set-based processing, it was important to determine if this data is a bulk import, or bulk process.  Bulk import is extremely quick with tools like SQL Bulk Copy.  If you need to do a bulk import or export, use the right tool.  A bulk import of these rows into a single table takes about 2.5 minutes, versus days and days one transaction at a time.

In another process we needed to bulk load customer data.  The customer data matched fairly closely to our existing customer table, so we crafted a process that basically followed:

  • Create table to match shape of CSV file
  • Load CSV into table
  • Issue single UPDATE statement to target table with the WHERE clause joining to the CSV-imported table

In this manner we were able to very, very quickly import millions of customer records very quickly.  However, if it’s not straight bulk import or export, we have to go through other channels.

###

Optimizing NHibernate for bulk processing

It was a lot of work tinkering with several different ideas, but ultimately the churn was worth it.  Here’s a few tips I picked up along the way:

Utilize the Unit of Work pattern and explicit transactions

When I first started, all processing was done with implicit transactions, and copious use of the Flush() method on ISession.  This meant that every. single. save. was in its own transaction.  When you look at the number of roundtrips this entailed, our database was just getting completely hammered.

Additionally, an ISession instance was disposed after every write to the database.  This meant that we could not take advantage of any of the first-level-cache support (aka, identity map) inherent in ISession.

Instead, I switched the codebase to use an actual Unit of Work, where I created a class that controlled the begin, commit and rollback of a unit of work:

public interface IUnitOfWork : IDisposable
{
    ISession CurrentSession { get; }
    void Begin();
    void Commit();
    void RollBack();
    bool IsActive { get; }
}

Before, we really had no control or understanding of the lifecycle of the ISession object.  With this pattern, its lifecycle is tied to the IUnitOfWork, allowing me to take advantage of other NHibernate features.

Use MultiCriteria/MultiQuery to batch SELECTs

MultiCriteria and MultiQuery allow you to send multiple SELECTs down the wire.  For each row in our table, we had to issue a SELECT, as processing a single transaction meant I needed to potentially affect the customer record as well as any previous order transaction records.  Doing this one SELECT at a time can be quite chatty, so I batched several together using MultiCriteria.

Just going from 1 at a time to 10 at a time, while insignificant for <100 records, can really add up once you get into the millions.

Use statement batching

In addition to SELECTs, we can also batch together INSERTs and UPDATEs.  In our case, we parameterized the processing of the file to a certain batch size (say, 250).  We then enabled NHibernate’s statement batching in the hibernate.cfg file:

<property name="adonet.batch_size">250</property>

And now instead of one INSERT being sent down the line at a time, we send a whole messload at once.  Profiling showed us that statement batching alone dropped the time by 50%.

NHibernate is very, very smart about knowing when and in what order to save things, so as long as items are persistent, we only really need to commit the transaction for the bulk processing to go through.

Process bulk updates in batches

Finally, once we had a proper Unit of Work implementation in place, we could now process the giant file as if it were many, smaller files.  We split the incoming work into batches of 250, then created a Unit of Work per batch.  This meant that an entire set of 250 was processed in a single transaction, instead of 5 million individual transactions.

Without a proper Unit of Work in place, we would not be able to do this.

Profiling is your friend

Finally, we needed a way to test our processing and the resulting speed improvements.  A simple automated test with stop watches in place let us tinker with the internals and observer the result.  With tools like NHProf, we could also observe what extra fetching we might need to do along the way.  Its suggestions also keyed us in to the various improvements we added along the way.

Wrapping it up

Bottom line is, if you can reduce the operation to a bulk import or export, the SQL tools will be orders of magnitude faster than processing one at a time.  But if you’re forced to go through your domain model and NHibernate, just be aware that your normal NHibernate usage will not scale.  Instead, you’ll need to lean on some of the built-in features that you don’t normally use to really squeeze as much performance as you can.

Automating scheduled tasks