Sidekiq Iterable Jobs: With Great Power....

Jon Sully headshot

Jon Sully

@jon-sully

We’ve written about Sidekiq several times over the years — from planning your queue setup, to designing your architecture for the best possible scaling, to running jobs on repeat forever! This is both because we’re fans of Sidekiq (thanks Mike!) and believe in its capabilities and potentials. Sidekiq is an extremely powerful tool for Ruby applications.

So when Sidekiq announced that version 7.3 would include a new feature, “Iterable Jobs”, our interest was piqued! As a group of senior Rails developers with, collectively, forty+ years of experience, we already maintain some design ideas about how to break down large chunks of work with background jobs. Adding a new pattern for work-breakdown is interesting stuff!

We’d like to spend this article looking at this new feature, figuring out how it compares and contrasts to traditional computing designs, and maybe even give a few (early) recommendations about when to (and when not to!) use this new tool. Let’s dive in!

The Old Way: Parallel Division

For many years now we’ve built and suggested a design pattern we’ll describe here as “parallel division” — where we take a large heap of work and break it down to many ‘sub-jobs’ that can run in parallel. Rephrased, when we have a project or feature that needs to do some chunk of work on, or with, each record of a large set, we can break down that project to run an individual job for each record.

This is easier understood with an example and some tangible code. Let’s consider a library that wants to use an LLM to generate a textual summary of each book in their collection. In its simplest implementation, that job could look like this:

class GenerateBookSummariesJob < ApplicationJob # < ActiveJob
  def perform
    Book.all.each do |book|
      book.update! summary: FancyGpt.summarize_text(book.full_transcript)
    end
  end
end

And, while you ⚠️should not⚠️ call .each directly on your model classes (use .find_each instead!), this simple implementation would work and accomplish your project.

The issue here is that this single job is going to take a long time to run. Worse still, if it gets halted in the middle (be it an error or a dyno restart or any other number of events), it’ll restart from the beginning when you rerun it. Ouch.

The better route here is to break down the total work of this project into sub-jobs that each do one little piece of that work. The simple way to think about this is to have each record in your set run its own job, just for that record. This gives you two separate jobs — a parent ‘Batch’ job and a child sub-job:

class GenerateBookSummariesBatchJob < ApplicationJob # < ActiveJob
  # Simply kicks off a child job for each id in the set
  def perform
    sub_jobs = Book.without_summaries.ids.map { |bid| GenerateBookSummaryJob.new(bid) }

    ActiveJob.perform_all_later(sub_jobs)
  end
end

#

class GenerateBookSummaryJob < ApplicationJob # < ActiveJob
  # Does the work required on each record individually
  def perform(book_id)
    book = Book.find(book_id)

    book.update! summary: FancyGpt.summarize_text(book.full_transcript)
  end
end

Aside from decomposing into two separate jobs, we made a few extra subtle tweaks here. First, we’re using ActiveJob’s new(ish) perform_all_later API to bulk-enqueue all of the sub-jobs at once for added speed in the batch job. Second, we’re now using a scope, .without_summaries, to make our batch-job reentrant: we can safely re-run the batch job without worrying about all books getting new summaries made, should we need to, in the future.

But the major point here is the split of responsibilities and the flexibility we gain. The batch job’s entire role is to identify candidates and kick off a child-job for each — any books that don’t have a summary yet, in our case. The child-job’s role is simple too! It simply receives a book ID and generates a summary for that record.

The first advantage of this breakdown is that we’re completely impervious to dyno/system restarts. Should our systems crash, Heroku decide that right now is time for our daily restart, or we push a new deploy, our jobs are now ready for that Sidekiq shutdown-command. While the batch job should now run much faster (it’s doing far less work per iteration than before), it’s possible that the shutdown command may arrive while the batch job is still in progress. Since we added the aforementioned scope, this is fine. When Sidekiq boots back up, it’ll re-run the batch job and only grab records that don’t have a summary. This is far better than our initial implementation.

Things are even better for the sub-job when a Sidekiq shutdown/restart occurs. By default, Sidekiq waits up to 25 seconds for each thread to complete any work in-progress before outright killing it when shutting down. That’s good news for us since each sub-job is a very small chunk of work — an API call and a DB save! This essentially means that Sidekiq will have time to finish any in-progress sub-jobs when it shuts down and can continue chugging through the back-log of sub-jobs as soon as it boots back up!

Actually, the story gets even better when we start thinking about infrastructure and autoscaling with respect to these large batches of work. With a tool like Judoscale and our custom Sidekiq integration, we can automatically scale up to many more dynos/containers once we kick off a large number of jobs. This makes total sense visually, so allow me to steal an image from our Ultimate Guide to Autoscaling Heroku:

Background job dynos scaling up as more background jobs are kicked off

Put into context, that means our batch job can kick off a sub-job for each of the 150,000 books in our library collection, we can scale up to 100 dynos/containers to accommodate the massive backlog of jobs, and we can safely churn through all those jobs as fast as our LLM service’s API will let us (😉) without worrying about restarts or crashes. We can now execute the entire workload of our project in minutes instead of hours or days, as it would’ve been with the initial (single job) implementation!

Having a simplified mental model for the job responsibilities, a safer setup for restartability and idempotency, and an ultra-scalable job structure are the primary reasons we employ and encourage this pattern. It’s hard to beat!

So what’s the new pattern that Sidekiq 7.3 introduces?

The New Feature: Iterable Jobs

Mike Perham, the author of Sidekiq, announced iterable jobs with a dedicated post on his own blog, and he opened with a good question:

What happens if you have a job which processes a large amount of data serially, the infamous long-running job?

He then introduces the idea of Sidekiq iterable jobs: a setup where Sidekiq now understands that you have a large sequence of jobs to work through and Sidekiq itself maintains a cursor to keep track of what’s done and what isn’t as it breaks down the sequence into sub-jobs for you. That’s a mouthful! Let’s see the code samples:

class ProductImageChecker
  include Sidekiq::IterableJob

  def build_enumerator(*args, cursor:)
    active_record_records_enumerator(Product.all, cursor: cursor)
  end

  def each_iteration(item, *args)
    item.check_image
  end

  def on_complete
    logger.info { "Finished checking product images!" }
  end
end

Let’s look at each of these methods, but let’s go from the bottom up. At the end we see that we have support for lifecycle-style callback methods — on_complete being shown here (on_start, on_resume, on_stop also being supported!). Seems like a handy feature! Our Parallel Division approach above doesn’t actually have the means to do this since the batch job doesn’t maintain any awareness about the sub-jobs once they’re fired off. Neat!

👀 Note

Just FYI, you can get lifecycle-style callbacks with the batch decomposition / parallel division pattern if you use Sidekiq Pro and implement its Batch feature. All of the Pro features are great if you have the budget for it, and supporting the long history of Sidekiq’s open-source contributions is also awesome.

Alright, moving up the class we find each_iteration(item, *args). This will be where we write the code for the actual work that happens on/with each record in the set. If we continue to think about our library example, this would be where we execute item.update! summary: FancyGpt.summarize_text(item.full_transcript). The simplicity here is pretty nice. Having everything else hidden away behind the scenes makes this a friendly API!

Last, we have build_enumerator. Building custom enumerators isn’t exactly something we think of as a fun time, but we’re in luck: Sidekiq ships with several enumerator-helpers out of the box. We can see one in use here: active_record_records_enumerator. Really the only thing about this method that we’d change for our library example is the Product.all argument being passed into the enumerator helper. We’d simply change that to Book.without_summaries as so:

  def build_enumerator(*args, cursor:)
   active_record_records_enumerator(Book.without_summaries, cursor: cursor)
  end

So, even though everything around it looks a little complicated and unknown, the only piece of this method that really matters to us is the first argument of the helper. That’s where we define what the set of records that needs work done will be. Everything is fine to copy and paste!

See the simplicity here? All we have to do is setup a Sidekiq::IterableJob, copy the build_enumerator method and change the first argument, then set up an each_iteration method where we tell it what to do for each record. With that we’re granted shutdown-safety, item-by-item processing, and less responsibility for maintaining custom code! Very cool. And its all in a single job, not two!

An Example

There are a couple of key details with iterable jobs that we want to tease out here. There are definitely times when using this new feature is the right answer for the work that needs to be done, but there are definitely times when this feature is the wrong answer too.

The first piece of this puzzle is a detail Mike wrote in his initial post:

Iteration allows you to decompose some work into a sequence of steps, but which still execute serially as a single job.

And this quote is probably the most pivotal thought in the whole feature. Iterable jobs allow you to do some work as a serial execution (as in, one-after-the-other only), safely. This is wholly different from parallel division, where we want to do all the work at once (in parallel; all at-the-same-time).

The best way to describe the difference is an example. While Mike’s blog post used an example of checking an image on each product in a system, that might be confusing — checking product images is more likely an example of work that should be decomposed with parallel division! There’s likely no reason we can’t check all of the product images in parallel.

To find a fitting example for iterable jobs, we need to think about a system that really does need to execute one record/object at-a-time; where the sequence is part of the requirement.

To illustrate this, let’s consider a bank. More particularly, let’s think about a single account at that bank. In its simplest abstraction, each account is a ledger; and each ledger essentially looks like this:

Date Description Trans Balance
07/03/2024 Opened Account +500.00 500.00
07/08/2024 Paid John for XyZ -55.00 445.00
07/10/2024 Paycheck from work +500.00 945.00
07/17/2024 Kroger grocery -122.88 PENDING
07/17/2024 Shell Gasoline -42.22 PENDING
etc

The subtle detail about a bank ledger is that each transaction needs to be processed in sequence for historical correctness. This is true for many background processes at the bank, but in the most plain sense, it’s true because we need to show what the “Balance” was at the point in time that the transaction was processed.

We can see a couple of pending transactions in our example ledger — these haven’t been cleared, finalized, and processed yet. Let’s assume we’ll process those overnight with a job we’ll write here.

Considering all of the discussion around job patterns above, let’s think about how we might write this job. We’ll have an Account record that should be processed (this ledger), and we’ll need to process each pending charge on the account, but we have some options on how we might do it.

Let’s say that we begin by using the parallel division method — maybe something like this:

class AccountBatchJob < ApplicationJob # < ActiveJob
  def perform(account_id)
    account = Account.find(account_id)

    account.pending_transactions.find_each do |tns|
      PendingTransactionJob.perform_async(tns.id)
    end
  end
end

#

class PendingTransactionJob < ApplicationJob # < ActiveJob
  def perform(transaction_id)
    transaction = Transaction.find(transaction_id)

    # assume Transaction has_one :prior_transaction
    previous_balance = transaction.prior_transaction.balance_after

    transaction.update!(
      cleared: true,
      balance_after: previous_balance + transaction.amount
    )
  end
end

At first glance, this may seem fine. We kick off a sub-job for each pending transaction on the account and each sub-job determines the balance-after for that transaction.

Put into context, that means the batch job would kick off a sub-job for the “Kroger grocery” transaction and the “Shell Gasoline” transaction. Our job processor would then pick up the sub-job for the “Kroger grocery” transaction, determine its balance-after (from the already-cleared “Paycheck from work” transaction), save the data, and move on. It would then grab the sub-job for the “Shell Gasoline” transaction, determine its balance-after (from the just-finished “Kroger grocery” transaction), save the data, and move on. Everything works! Everything’s processed. Both records were cleared and had their balance-after set correctly.

Except that’s not how it would happen.

Our batch job would indeed kick off two sub-jobs, one for each pending transaction. But we can’t guarantee the order in which those sub-jobs will actually execute. What happens if the sub-job for the “Shell Gasoline” transaction runs first — before the “Kroger grocery” transaction is processed? How will it determine the balance-after when the prior_transaction.balance_after isn’t set yet? It can’t! We’ve got a problem.

Pure parallel-division decomposition doesn’t work in systems that require sequential execution. Parallel decomposition has no guarantees around the order in which sub-jobs are executed, and actually intends for them to be executed in parallel, with no knowledge of each other. As such, parallel division is the wrong solution for our bank-ledger problem.

This, however, is exactly where iterable jobs do solve the problem. This goes back to the “serially” bit of Mike’s blog post:

Iteration allows you to decompose some work into a sequence of steps, but which still execute serially as a single job.

In this case, we might draft up our iterable job in this way:

class BalanceAccountJob
  include Sidekiq::IterableJob

  def build_enumerator(account_id, *args, cursor:)
    account = Account.find(account_id)

    active_record_records_enumerator(account.pending_transactions, cursor: cursor)
  end

  def each_iteration(transaction, *args)
    # assume Transaction has_one :prior_transaction
    previous_balance = transaction.prior_transaction.balance_after

    transaction.update!(
      cleared: true,
      balance_after: previous_balance + transaction.amount
    )
  end

  def on_complete
    logger.info { "Account ledger is processed!" }
  end
end

While the decomposition of work looks similar here, it’s Sidekiq’s guarantee around iterable jobs being executed sequentially (one iteration at-a-time, in order) that provides the power here. Because of that promise, we can be certain that when each_iteration runs, the prior record has completed its own each_iteration pass. So we can be sure that transaction.prior_transaction.balance_after actually exists! This is the right solution for this problem.

Some Differences

What we hoped to highlight with that example is the difference between breaking down a large chunk of work into small chunks that can be executed in parallel (they don’t depend on each other) and a large chunk of work that can be broken down but where the small chunks need to execute in order (they do depend on one another). That is the key difference between batching and iteration.

And, it’s worth noting explicitly: to use one when you really ought to use the other is likely going to be unpleasant! So it’s very much worth taking the time to deliberately determine which pattern you need to implement for your given problem.

It’s also worth saying that, in the grand scheme of web computing, having the need for sequence-dependent work is fairly rare! It certainly is the right solution for logical cases where you need one-by-one, sequential processing, but those cases tend to be uncommon. We recommend being cautious when determining if the iterable job is right for your need!

Another contrast to keep in mind is how both of these patterns can scale to accommodate the work. This might seem obvious, but we want to highlight it anyway — when you need to do work in a one-by-one fashion, you can’t scale up more dynos/containers and expect the work to get done faster. It doesn’t matter how much extra capacity you have; capacity isn’t the bottleneck. When it’s sequential, the work will only execute as fast as a single processor can handle. This is the opposite of parallel division: scale up to the moon and finish everything in record time! Again, this may seem obvious, but it’s important to keep in mind. The only way to ‘speed up’ the overall runtime of sequential / iterable jobs is to use a faster CPU for the single core that’s executing the sequence of work (and often these gains are marginal — modern CPUs are all very fast). There’s really no way to leverage autoscaling for enhanced performance in the iterable/sequential job world.

Why Not Both?

Why not both meme

There may be interesting use-cases where we can actually combine both patterns to accomplish a greater system goal. Our bank example actually represents this concept very well. It goes like this: while processing one singular account may need to be sequential, and thus an iterable job, the bank needs to process all accounts every evening. Since processing one account doesn’t depend on any other account being already-processed, we can use a batch job and parallel-division to kick off iterable jobs for each account. The iterable jobs will still operate in their correct, sequential way, but now we’ve parallelized the processing of each account!

This can be extremely powerful — we gain the benefits of autoscaling thanks to the parallel division and the promises of sequential processing thanks to the iterable sub-jobs.

This pattern (combing parallel division and sequential iteration) requires a pretty specific use-case and likely isn’t applicable to most businesses and organizations, but it’s a neat idea that would be fun to implement.

Wrap Up

Okay, let’s circle back on a few things.

First, if you simply have a large chunk of work to get done and you want to break it down in background jobs to accomplish it quicker (with stability), use parallel division to give each record/object its own job. Then, after you’ve designed and set that up, prepare your autoscaling to spin up several more dynos/containers so that your workers crunch through all those jobs quickly!

Second, if you discover along the way that your parallel division actually breaks down since processing one object/record depends on another in the set having already been processed (an intra-set dependency), consider thoughtfully if there may be another way around. If not, you may need to reach for iterable jobs. There’s a speed cost here (a sequential job setup is almost certainly going to take longer, end-to-end, than a parallel division approach), but if that’s what your system requires to compute its data, so be it!

Third, take a moment to determine if you can leverage both patterns to get some of the best of both worlds. That could mean the example described above, where each record/object has a requirement for sequential processing internally but each record/object can be processed in parallel to other records. That could mean cleverly determining how much of the processing needs to be sequential and what other chunks could be parallel-divided. Only you can say! It’s worth taking a bit of time to determine the maximum amount of work you can do in a parallel-division style due to its performance benefits.

Follow those steps and you’ll be on your way!

Ultimately, we wanted to write this article to shed some light on this new feature, iterable jobs, and the various cases it may be helpful in. And, of course, the various cases it won’t be helpful in. Iterable jobs are a powerful new tool in our tool-belt and should be wielded with care — like any tool, they’re not the right solution for every job… but you’ll be thankful to have it when the particular need arises. We hope this article gave you some clarity around this new feature and sparked some ideas within you about how to leverage it!

👀 Note

Interested in other use-cases for Sidekiq Iterable Jobs? We published a follow-up to this post which introduces a novel idea: infinitely iterable jobs. Check it out here!