Reducing NServiceBus Saga load

When presented with concurrency issues with NServiceBus sagas, you’re generally presented with two options:

  • Relax the transaction isolation level
  • Reduce worker thread count, forcing serialized processing of messages

Both of these are generally not a great solution, as neither actually tries to solve the problem of concurrent access to our shared resource (the saga entity). The process manager pattern can be quite powerful to solve asynchronous workflow problems, but it does come with a cost – shared state.

Suppose we had a process that received a batch of operations to perform, and needed to notify a third party when the batch of operations is done. It looks like we need something to keep track of what’s “done” or not, and something to perform the work. Keeping track of work to be done sounds like a good fit for a saga, so our first attempt might look something like this:

image

Our process will be:

  1. Send message to start batch saga
  2. Send messages to workers for each item of work to be done
  3. Listen for work done messages, check if work done
  4. If work done, send batch done message

The problem with this approach is that we’re creating a shared resource for our work to be done. Even if we do something completely naïve for tracking work:

public class BatchSaga : IContainSagaData {
  public int TotalWork { get; set; }
  public int WorkCompleted { get; set }
  
  public void Handle(WorkCompleted message) {
    WorkCompleted++;
    
    if (WorkCompleted == TotalWork) {
      Bus.Send(new BatchDone());
      MarkAsComplete();
    }
  }
}

Even if we’re only tracking the count of work completed (or decrementing a counter, doesn’t matter), the problem is that only one “work done” message can be processed at a time. Our actual work might be isolated, letting us scale out our workers to N nodes, but the notification of done still has to get back into a single file line for our saga. Even if we up the worker count on the saga side, modifications to our saga entity must be serialized, done only one at a time. Upping the number of workers on the saga side is only going to lead to concurrency violations, exceptions, and an overall much slower process.

Reduction through elimination

I picture this as a manufacturing facility supervisor. A batch of work comes in, and the supervisor hands out work to workers. Can you imagine if after each item was completed, the worker sends an email to the supervisor, with their checklist, to notify they were done? The supervisor would quickly become overwhelmed by the sheer volume of email, to be processed one-by-one.

We need to eliminate our bottleneck in the supervisor by separating out responsibilities. Currently, our supervisor/saga has two responsibilities:

  1. Keep track of work done
  2. Check if work complete

But doesn’t a worker already know if work is done or not? Why does the worker need to notify the supervisor that work is done? What if this responsibility was the worker’s job?

Let’s see if we can modify our saga to be a little bit more reasonable. What if we were able to easily update each item of work individually, separate from the others? I imagine in my head a tally sheet, where each worker can go up to a big whiteboard and check their item off the list. No worker interferes with each other, as they’re only concerned about their own little box. The saga is responsible for creating the initial board of work, but workers can update themselves.

At this point, our saga starts to look like:

image

Our saga now only checks the sheet, which doesn’t block with a worker updating it. Our saga now only reads, not writes. In this picture, we still get notifications for every single worker, that still has to go in a single queue. We can modify our saga slightly by instead of getting notifications for every worker, we register a timeout message. Does the “batch done” message need to go out immediately after the last worker is done? Or some time later? If, say, we only need to notify batch done, we can use timeouts instead, and simply poll every so often to check for done.

With timeouts, we’re greatly reducing network traffic, and potentially, reducing the time between when workers are actually done from when we notify that we’re done. Suppose we have 100K items to send to our workers. That means we’ll have 100K “Work Done” messages needing to be processed by our saga. How long will it take that to process? Instead, a timeout message can just periodically check done-ness:

image

We can even relax our constraints, and allow dirty reads on checking the work. This is now possible since recording the work and checking the work are two different steps. We’ve also greatly reduced our network load, and provided predictability into our SLA for notifying on work done.

Reducing load

To reduce load in our saga, we needed to clearly delineate the responsibilities. It’s easy to build a chatty system and not see the pain when you have small datasets or no network load. Once we start imagining how the real world tackles problems like these, the realities of network computing become much more obvious, and a clear solution presents itself. In this case, a supervisor receiving notifications for everything and keeping track of a giant list just wouldn’t scale.

By going with a lower-intensive option and trading off immediate notification for predictability, we’ve actually increased the accuracy of our system. It’s important to keep in mind the nature of queues as FIFO systems with limited concurrency, and sagas having a shared resource, and what this implies in the workflows and business processes you model.

Related Articles:

Post Footer automatically generated by Add Post Footer Plugin for wordpress.

About Jimmy Bogard

I'm a technical architect with Headspring in Austin, TX. I focus on DDD, distributed systems, and any other acronym-centric design/architecture/methodology. I created AutoMapper and am a co-author of the ASP.NET MVC in Action books.
This entry was posted in NServiceBus. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • Charlie Barker

    I have not yet found a satisfactory approach to batch processing using NServiceBus, my feeling is square peg / round hole. We try and avoid batches where possible or avoid using NSB for processing them.

    • jbogard

      So what does the square hole look like? We had used cron jobs with manual processing, but then had to add on staging tables, statuses, retry flags, error flags (for dealing with poison messages) and so on. Eventually we got to a DB-backed queue. The *completely* async nature is what is annoying, though. I’m interested in hearing what others are doing, though.

      • Charlie Barker

        We have a fairly low tech solution SQL Server Jobs that populate a batch table and a process that polls the batch table looking for messages to dispatch. It’s not fancy but it does the job. We handle batches of up to 100k messages this way. I’m pretty sure this would not scale, the overhead of messaging would start to be a factor if your processing window was limited and you had millions of messages to process.

  • John Oswalt

    Great post! I have just recently implemented a solution for a batch job just like your last option. I have my process broken out in to multiple projects. I have the first project watching for batch files to process to create my batch update sheet. When a file comes in to the system, the process stores the batch information in a SQL backend to be used a batch job data sheet. The process then publishes a BatchStarted event that triggers the starting of my saga.

    The saga project then reads from the data sheet and pushes out the individual worker commands to the worker project to process the batch information for each row in the batch. I then set a Timeout in the saga to check if the data sheet is complete.

    The worker project then handles the command and process the information from the batch row into our back end systems. It then updates the data sheet to state the row has been processed.

    Meanwhile, the saga project is waking up from a Timeout. It checks to see if all of the rows in the batch have been processed. If not, the saga starts another Timeout. If all rows have been process, the saga project marks the saga as complete.

    This pattern that Jimmy described in this article appears to be working well for me. I also tested scaling out my worker project. I had 6 workers processing at the same time. I didn’t have any concurrency issues with the saga. I was able to process over 100K rows in a batch much faster because I was able to scale out the process.

    • jbogard

      Well now you know where I got the inspiration for this post ;)

  • jrv

    Would it make sense to have a shared “counter” that each worker is responsible for updating? Rather than having a (mostly sleeping) supervisor check for overall completion, have the last finisher notify the appropriate party that the batch is complete.

    • jbogard

      I think the problem with a single shared counter is you then introduce a bottleneck in updating that counter. Unless of course your counter is a CRDT (PN counter, for example)

  • http://www.neilbarnwell.co.uk Neil Barnwell

    For the avoidance of doubt, are you saying in your solution that the “work sheet” is no longer the state inside the saga, but is something else (that something else being left up to the reader)? I.e. perhaps literally just a document? That would still suggest some blocking. Would each work *item* be a separate document (with some common key/correlation ID linking them together) and the saga queries for all documents and publishes it’s completion only if they are all marked as completed?

    • jbogard

      Nah, I was thinking of something like a DB table, where you only lock one row, or a tally list.

      • http://www.neilbarnwell.co.uk Neil Barnwell

        Gotcha. I was thinking of a raven document because I’m thinking there’ll be a distributed transaction around the queues and raven storage already.