An Opinionated Guide to Planning Your Sidekiq Queues

A Judoscale customer recently wrote in asking for feedback and advice on their Sidekiq queue structure, and it’s a question we see often enough to warrant a full post of its own. Here are the specific questions we want to answer in this post:

“How should our team structure our Sidekiq queues?”
”How should I think about spreading queues across Sidekiq processes?”
“How do I avoid queue back-ups or over-provisioned resources?”

Being intentional about your Sidekiq setup can help you avoid an explosion of queues and unnecessary complexity, and efficiently utilize your resources (and cost 💰). We’re going to give you some simple strategies that will yield big gains!

Some Sidekiq basics

Before we answer the questions above, we need to understand how Sidekiq works and the basic terminology.

A job is a single unit of work in Sidekiq
- E.g. send_an_email_job
A job gets pushed onto a queue that lives in Redis
- E.g. email_jobs_queue
A Sidekiq thread pops jobs from one or more queues and executes the work.
A Sidekiq process manages the threads
- E.g. a sidekiq-worker process defined in your Procfile
The process runs on a container, which you might also call a “service” or “worker dyno” (Heroku’s terminology)
- E.g. the actual dyno Heroku spins up to implement that Procfile process

For a typical Heroku app, this might look like the following:

You, the app developer, decide how many queues exist, what you call those queues, and which jobs go into which queues. You also decide which processes monitor which queues (called queue assignment), how many threads to run per process (called concurrency), and how many containers to run in parallel (called horizontal scaling)…

That’s a lot of decisions! And we haven’t even touched on configuring the compute characteristics of the containers themselves (say hello to vertical scaling).

But have no fear—it doesn’t have to be complicated! With a clear mental model for how these concepts interact and a simple, and a sensible starting point, you (yes, you!) can easily scale your Sidekiq system.

How Sidekiq queues get out of hand

Out of the box, Sidekiq gives you a single queue called “default.” This means:

A job class that doesn’t specify a queue will enqueue jobs to the “default” queue
Running bundle exec sidekiq without specifying a queue will fetch jobs from the “default” queue

Could you just roll with this default setup? Sure!

Should you? Absolutely not! ❌

The reality is that for a brand-new app, this simple setup would work just fine… for a little while. But eventually, you’re going to want another queue. One of the following scenarios will pop up, and probably sooner than you think:

You have a job that needs to be run quickly, so you want it to have a higher priority than your other jobs
You have a job that takes a while to run, so you want it to have lower priority so it doesn’t block other jobs

When you encounter these scenarios, you might create an “urgent” queue and a “low” queue. You update your Sidekiq command to look like this:

bundle exec sidekiq -q urgent -q default -q low

This ensures that Sidekiq picks up jobs from the urgent queue first, then default, then low. And it’ll probably work great… for a little while.

The problem with this setup is that the queue names are ambiguous. How urgent is “urgent”? What does “low” even mean?

Inevitably, you’ll have many degrees of urgency, and this ambiguity will result in an explosion of Sidekiq queues. Now you’ve got very_urgent and most_urgent and seven other ambiguous queues. You might also end up with queues for specific business functions like csv_exports and forgot_password_emails.

This is a mess! And there is a much better way. Let’s level-up.

Latency is everything

Let’s take a step back and talk about Sidekiq metrics. We need to figure out how to measure our queues — how healthy they are, how fast they’re running, and when we’ve hit a bad point. Like most queueing systems, we’re typically talking about two key metrics:

Queue depth: how many jobs are in a queue waiting to be processed
Queue latency: how long any given job waits in the queue before it’s processed

Queue depth is easier to visualize, so it’s often mistaken as the most important metric. Unfortunately, queue depth is a lie, and I’ll show you why.

Imagine two queues, both single-threaded:

Queue A has 10 jobs enqueued. Each job takes one second to run
Queue B has 10,000 jobs enqueued. Each job takes one millisecond to run

One of these queues might appear to be “backed up” because it has a high queue depth (10,000 jobs), but in reality the queue “health” is the same—they will both clear their backlog in 10 seconds. That’s queue latency, and it’s the number that matters.

So, is a 10-second queue latency good or bad?

Of course, it depends. That’s a business decision based on the kind of work those jobs are doing. Here we find ourselves back at the concept of “urgency”, but now we can quantify it.

The clarity of latency-based queues

Now that we’ve established queue latency as the metric we care about, we can fix the ambiguity of our previous queues:

“urgent” becomes “within_5_seconds”
“default” becomes “within_5_minutes”
“low” becomes “within_5_hours”

That is, if I push a job to the within_5_seconds queue, I should expect that that job begins processing within five seconds!

Of course, you can change those numbers to whatever you want, and you can have more or fewer queues. The specifics don’t matter. What matters is being explicit.

We’ve encoded explicit latency expectations directly in our queue names, and by doing so we’ve avoided several problems:

When implementing a new job, choosing a queue is a business decision around when the logic of that job ought to start running, not an arbitrary technical decision
We avoid adding unnecessary new queues because every new job will fit into an existing queue
We have clear performance expectations for our queues, which will guide our scaling and auto-scaling plan

Each of these benefits is hard to overstate. As a developer writing a new job for your application, it’s so much easier to reason about the queue that job should be submitted to when the queue names communicate the latency expectation. If we’re writing a job to email someone their password-reset link, it’s as simple as, “well it should run faster than within_five_minutes, so we’ll put it in the within_five_seconds queue!” Developing jobs is a significantly nicer experience when our queue names represent our desired runtime latency expectations. But, of course, this only works when the jobs in the queue actually do run ‘within five seconds’! And to ensure they do, we need to introduce scaling.

Scaling Sidekiq queues the easy way

Scaling Sidekiq is typically all about avoiding a queue backlog, but as we’ve discussed, we won’t be distracted by queue depth. Latency is what matters, and now that we have our latency expectations encoded in our queue names, we can quantify our scaling goals:

👀 Note

Each queue’s latency should remain within its target SLA (name), using as few resources as necessary.

You can’t do this manually. Job throughput fluctuates too much to know how what resources you need at any given time.

Autoscaling makes this trivial as long as your autoscaler supports queue latency (sometimes called ‘job queue time’) as the trigger metric. This is the reason we built Judoscale, so check it out if you’re not already autoscaling Sidekiq based on queue latency.

But your autoscale settings will depend on your queue assignment, and that’s often the trickiest part to get right.

Assigning queues to processes

It’s easiest to think of queue assignment as a spectrum. On one end is a single process that’s watching all of your queues, and on the other end each queue has a dedicated process. In between are many multi-process setups, where some processes watch multiple queues.

The benefit of a single process watching all of your queues is that it’s the simplest to set up, and it’s typically the most resource-efficient. The downsides are that you risk long-running jobs blocking other high-priority jobs, and it’s harder to autoscale. Let’s dig into each of these.

Let’s say we have a single process that’s watching three queues. We obviously want our highest-urgency queue to have the highest priority when fetching jobs. When there’s nothing in the high-urgency queues, the process will fetch jobs from the lowest-priority queue. Now let’s say we have a bunch of low-priority jobs that are high-effort jobs, maybe taking up to several minutes to run (I would call any job taking more than a few seconds a long-running job).

Our process picks up those jobs and starts processing. Meanwhile, some high-urgency jobs start to enqueue. Even though that queue has highest priority, all of our worker threads are busy with other jobs. The long-running jobs have effectively blocked all of the queues. 😱

Regardless of using priorities or sampling rates for how a process should pull from multiple queues, nothing can help you once that process is actually busy processing long-running jobs! It simply can’t pull another — it’s already busy!

In the dedicated process setup, this could never happen. Long-running jobs would only block other jobs in their own queue, which, in this case, is a low-priority queue and not a problem. High-urgency jobs would only block other high-urgency jobs (and hopefully only briefly)!

That’s both simple and clear!

But aside from how the workers process through jobs in queues, what about autoscaling the worker processes? Let’s think about this starting with the dedicated worker-per-queue setup. With this structure It’s actually super simple: you autoscale each process based on the latency of the one queue assigned to the process. The latency expectation is codified into the queue name, so set that as your latency threshold, and you’re set. The within_five_seconds queue is pulled and run by only the within_five_seconds_process, which is autoscaled such that it always keeps the latency within five seconds. Everything is aligned.

Alternatively, if the structure of a single process (watching all the queues), you’re forced to autoscale based on your highest-urgency queue. Using our example queues from earlier, this would be five seconds. But we don’t want to scale up if our “within_5_hours” queue has a latency of over five seconds, and this is where things get muddy with a single process. You’ll typically have your autoscaling only monitor the highest-urgency queue in a process, but then you might not scale up when needed for your other queues.

That can be dangerous, so let’s walk through it a little bit. Let’s say we have one process watching all three queues: less_than_five_seconds, less_than_five_minutes, and less_than_five_hours. Since we only have one process, we setup autoscaling to only watch the less_than_five_seconds queue and scale up if that queue gets slow. At some point later, our app fires off a few thousand jobs in the less_than_five_minutes queue, each taking several seconds to run. After ten minutes, there are still lots of jobs in the queue! Thus, we’ve broken the expectation of the less_than_five_minutes queue and our autoscaling didn’t scale up to accommodate the latency because it’s only watching the less_than_five_seconds queue! This is the inevitable trade-off of having a single process watch/process more than a single queue. There will be opportunity for latency expectation failure.

What if we run a mixed setup? Where some processes are dedicated for a single queue and others watch a few queues? This setup can work, but we don’t recommend it overall. As mentioned above, any time you have a single process watching more than one queue, there’s opportunity for expectation failure. It’s just not worth it!

In reality, the best answer here is to run dedicated processes per queue. Aside from making the mental model simpler and clearer, this setup makes autoscaling a breeze. Of course, everything has trade-offs — what’s the downside to running dedicated processes? Cost… kind of.

At face value, the idea that we’ll run dedicated dynos for each queue does mean we’ll incur the cost of all those extra dynos, but again, autoscaling to the rescue! Judoscale can actually scale your worker process down to zero dynos when there aren’t any jobs to run! No cost incurred when you’re not running any dynos, right? Then, once a job comes in, Judoscale will scale your process back up to handle that job. Judoscale does this 24/7, so, while you’ll have a bit of extra cost when running dedicated dynos, it’s nothing close to actually running extra dynos 24/7 if you’re not processing jobs 24/7.

And with the cost being an almost-non-issue, that means there really are no downsides to running dedicated processes per queue!

Special queues for special jobs

Occasionally, you’ll run into a job that has unique performance characteristics:

A job that requires much more memory than other jobs. Import and Export jobs sometimes fall into this bucket
A job that requires much higher CPU than other jobs. Doing some LLM work, perhaps?
A job that depends on a slow or unreliable external API

These jobs still have some kind of latency requirement, but they might cause problems lumped into our latency-based queues. In cases like these, it can make sense to create a special queue that’s not tied to latency.

I still recommend that you name the queue based on its unique performance characteristics rather than a business function. For example, “high_memory” is better than “exports”, because future high-memory jobs can utilize the same queue instead of creating a new queue.

In terms of queue assignment, non-latency queues should almost always have a dedicated process. High-memory and high-CPU jobs live in their own queue so that you can control the compute resources you dedicate to them—there’s no reason to upgrade your entire worker fleet just because a few jobs are memory hogs. You might also want these dedicated queues to be single-threaded to constrain resource usage.

And it’s worth noting here too, scaling these queue’s processes down to zero instances may be very helpful for curbing the cost of these additional, dedicated, processes. Especially if (like many export jobs) they only run a handful of times per day. Don’t pay for compute you’re not using!

Your recipe for Sidekiq bliss (Tl;dr:)

Now that we’ve gone deep into the why, let’s come back and answer the questions from the beginning:

How should our team structure our Sidekiq queues?

Your queues should be named based on the latency requirements of the jobs in those queues. Three latency-based queues (less_than_five_seconds, less_than_five_minutes, and less_than_five_hours) is a good starting point. Over time you might need a small number of special queues based on unique performance needs, but do your best to stick to your latency-named queues.

How should I think about spreading queues across Sidekiq processes?

The easiest and most efficient answer here is to run a dedicated process per queue. This mental model is simpler, makes autoscaling much cleaner, and, when using the right tool (like Judoscale), you can scale the process down to zero dynos/instances, meaning you’re not spending much extra for the added clarity of separate processes.

How do I avoid queue back-ups or overprovisioned resources?

Setup autoscaling for each process according to its queue-time expectation (name). E.g. set up autoscaling for your less_than_five_seconds queue-handling process so that it scales up any time the job latency is above five seconds. This aligns your queue-time expectation with your scaling behavior! This should essentially keep your queue-time expectation true all the time. The moment it breaks that expectation, it’ll scale up another dyno to bring it back into spec.

On the other hand, you can also setup autoscaling to scale processes down to zero dynos when there are no jobs to run. This is essentially free savings 🤑! Don’t spend money on running dynos you don’t need to be running. This is a core feature of Judoscale, so check it out if you’re running another setup/methodology.

One note: we wouldn’t recommend scaling your less_than_five_seconds queue/process down to zero. Given the dyno boot-time, jobs that land in that queue when there aren’t any dynos running could take 30-45 seconds to finally start running, which is not “less than five seconds”!

Sidekiq Setup Can Be Easy

We hope this guide gives you several new ideas to consider, a plan for how to implement Sidekiq into your app, and some cost-savings to engage ASAP. Just remember the Tl;dr:’s above and you’ll eliminate almost all background-job-system headaches that most folks wade through year-over-year.

If you follow these steps, you’ll have a Sidekiq setup that’s simple, reliable, and scalable! And if you have questions along the way, feel free to reach out. We’ve seen the good, the bad, and the ugly, and we’re happy to share our time and assistance 🙂.

P.S. Container Size

While this guide is mostly around application strategies and autoscaling parameters for Sidekiq, we’d be remiss if we didn’t include a note at the end addressing vertical scaling just a bit: how big of a container/dyno should you use for Sidekiq?!

While it’ll be different for every platform, start simple.

On Heroku, we recommend starting with Std-1X’s for your Sidekiq processes. This is because, as we mention in our Ultimate Guide to Autoscaling Heroku, smaller containers result in higher-resolution (more modular) autoscaling steps! If we can use Std-1X’s, we should.

Like any Rails process, memory will almost certainly be your limiting factor here, not CPU. So use Std-1X’s with the default number of Sidekiq threads and see how the memory looks after running with a normal job-load after a few hours.

If the dyno is operating within the memory limits Heroku sets for Std-1X’s, you’re good to go! If you’re grossly over the memory limit, you’ll need to either reduce your Sidekiq threads or go up to Std-2X’s.