Latency-based Celery Queues in Python

If you’ve worked with Celery in production with real traffic, you’ve probably hit one of its many sharp edges. Maybe you’ve watched a simple background job silently pile up in an unmonitored queue.

Or maybe you’ve built out a tidy set of queues only to find your high-priority jobs are getting stuck behind slow (and unimportant) ones. Celery gives you powerful tools, but few guardrails.

These pain points usually stem from queue planning problems. Most teams slap labels like high_priority or emails on queues without defining what those mean.

If you plan your Python task queues around latency, you’ll have more predictable (and scalable) results. Ready to get started?

The basics of Celery Queues

Before we get into queue planning, let’s clarify some Celery terminology. If you already have a great understanding of how Celery works, feel free to skip to the next section.

Celery queue diagram, showing a Celery queue, full of tasks, with worker processes

Celery tasks

In Celery, a task is a single unit of work. For example, send_email_task might send a welcome email.

Celery queues

A queue in Celery refers to a named channel on the broker (like a Redis list or RabbitMQ queue) where tasks wait to be processed. By default, Celery uses a queue named "celery" (if you don’t specify one).

Celery workers

A worker is a Celery process that runs tasks. A worker can run multiple tasks concurrently, depending on its concurrency setting.

Celery concurrency

Concurrency refers to the number of tasks a worker can process at the same time. In prefork mode, this is the number of child processes (often defaults to the number of OS-reported CPUs).

Decisions you have to make when using Celery

In a typical deployment, you must decide how many queues to use and what they are called, which tasks go to each queue, and how many worker processes will consume each queue.

You also choose how many threads/processes each worker has (concurrency) and how many total containers to run (horizontal scaling). That’s a lot of decisions!

So let’s dig into how you can make these decisions with scaling in mind.

Why Celery queues run into problems at scale

Out of the box, Celery will use a single queue (usually named "celery" by default). If a task doesn’t specify a queue, it goes to the default queue. If you start a worker without specifying -Q, it will consume the default queue.

Could you build an app with just one queue? Sure. But please don’t.

Not every task is created equal

For a brand-new project, one queue might work fine for a short while. But very soon, you’ll encounter scenarios that push you to create additional queues:

You have a task that needs to run quickly (a high-priority job), so you want it processed before other tasks.
You have a task that takes a long time to run (perhaps several seconds or minutes), and you want it to have lower priority or even separate handling so it doesn’t block faster tasks.

In response, teams might eventually create ad-hoc queues like "urgent" for high priority and "low" for slow tasks.

Ambiguous queue names

However, there’s a big problem. Those queue names are ambiguous.

How urgent is “urgent”? What does “low” mean, exactly? As your application grows, you’ll find there are varying degrees of priority. One developer might add very_urgent or critical queues; another might introduce a queue for a specific feature like reports or emails.

Before you know it, you have a sprawl of Celery queues without a clear hierarchy or expectations.

Latency-based queues

Take a step back and consider what metrics define the “health” of a task queue. Three key metrics are commonly used:

Worker CPU: How taxed is the CPU for worker processes?
Queue depth: How many tasks are waiting in the queue (queue length).
Queue latency: How long a task waits in the queue before a worker starts processing it (sometimes called queue time).

CPU can be used, but it doesn’t actually tell everything about the queue. It simply gives an indication (and often a trailing indication) of the worker process during an individual task. And task queues often back up without spiking CPU at all, giving a false sense of worker health.

Queue depth is easy to visualize (a simple count of jobs), so many people focus on it. Queue depth can be very misleading. The number of tasks doesn’t tell you how long they’ll take to clear.

For example, imagine two queues, each handled by one worker process:

Queue A has 10 jobs enqueued, and each job takes ~1 second to run.
Queue B has 10,000 jobs enqueued, but each job takes ~1 millisecond to run.

Queue B might look “backed up” at a glancem, but in reality, both queues will finish their work in about 10 seconds. The latency (wait time) for jobs in both queues is the same ~10 seconds, which is the metric that truly matters.

✅ Tip

Queue latency tells the real story about how well a queue is doing.

So, is a 10-second wait time good or bad? It depends.

It depends meme, showing Celery queue latency health is a complicated decision

The acceptable latency for a queue is a business decision. It depends on what the tasks are doing and how quickly that work needs to begin. This brings us back to the notion of “urgency”, but now we can quantify it. Instead of calling a queue “urgent” in a vague sense, we decide what latency is acceptable for that queue’s tasks.

Latency SLA queue names

If you’re convinced queue latency is the right metric to measure performance, you should fix the ambiguity in your queue names. Naming your queues after their latency targets (SLAs) is a great way to set yourself up for success.

For example:

“urgent” becomes within_5_seconds (tasks should start within 5 seconds)
“default” becomes within_5_minutes (tasks should start within 5 minutes)
“low” becomes within_5_hours (tasks should start within 5 hours)

If I push a task to the within_5_seconds queue, I’m explicitly saying I expect that job to begin processing within five seconds. The name of the queue communicates the expectation.

You can choose whatever latency thresholds make sense for your app, the specifics aren’t as important as the explicitness of the naming.

By communicating latency expectations in the queue names, we get a few important things.

First, you’ll end up with fewer queues. You’re far less likely to create a new queue per feature or whim. Almost every new task will fit into an existing latency category. This should remove the temptation of one-off queues that don’t serve a strategic purpose.

Second, each queue now has a performance target (its name). This gives clarity for monitoring. If the within_5_minutes queue starts seeing 10-minute latencies, you have an unambiguous problem.

Of course, naming queues “within_X” doesn’t magically make tasks start within X time – you have to ensure enough worker capacity to meet those targets. That’s where scaling comes in.

Fortunately, this strategy makes it crazy easy to decide when to spin up more (or fewer) workers to scale, but we’ll talk more about that later.

Diagram showing latency-based celery queues with different tasks in each queue

Simple ways to scale Celery queues

Typically, scaling a Celery worker pool is with the goal of avoiding a queue backlog.

Now that our queue names encode latency expectations, we can define a clear scaling goal for each queue:

✅ Tip

Each queue’s latency should stay within its target (as named), without having overprovisioned resources.

For most people, traffic and job volumes fluctuate too much to maintain this manually. You’ll want to autoscale your workers based on queue latency. With autoscaling in place, meeting those latency targets becomes trivial.

When jobs start waiting too long, spin up more workers; when the queues are empty, spin them down.

For example, if the within_5_seconds queue’s jobs are waiting >5 seconds, your autoscaler should add another worker process (or increase concurrency) for that queue. If the queue’s latency stays under 5 seconds, you can maybe scale down. We’ll talk about how to assign workers to queues next, which affects how you set up autoscaling triggers.

👀 Note

Built-in autoscalers default to CPU usage for scaling. Judoscale is a great autoscaler add-on that can scale your queues based on queue latency!

Speaking of queue assignment, how should we split up queues across Celery workers? I have a few opinions!

Your options for matching workers to queues

When it comes to queue-to-worker assignment, you have a couple of options. On one hand, you have one set of workers pulling from all queues. On the other hand, you have dedicated workers for each queue.

In between these two extremes, you might run some workers that each handle a subset of queues.

Running a single worker pool for all queues

Running a single worker pool for all queues is the simplest setup. It’s resource-efficient since any free worker can work on any task, and you don’t need to worry about balancing workers between queues.

Diagram showing a single Celery worker pool consuming from multiple queues

However, the downsides are significant. You risk long-running tasks blocking high-priority tasks, plus it’s harder to autoscale effectively for all latency goals at once.

For example, suppose one Celery worker (with concurrency 4) is consuming within_5_seconds, within_5_minutes, and within_5_hours queues. If it picks up several very slow within_5_hours tasks (say tasks that each take minutes to execute) on all its worker processes, and then a bunch of new within_5_seconds tasks arrive, those fast tasks can’t start until a process is free.

All processes are busy churning on slow jobs, so even though the within_5_seconds queue is the highest priority, it’s effectively blocked. This defeats the purpose of having a fast queue!

Dedicated workers per queue

In this setup, each queue gets its own Celery worker process (or pool).

For example, you might start one set of workers with -Q within_5_seconds, another with -Q within_5_minutes, and so on. This completely isolates each latency tier.

The slow jobs in the 5-hour queue can never block the 5-second jobs, because they’re handled by different workers on possibly different machines.

Autoscaling becomes much cleaner because you can scale each worker deployment based on that queue’s latency threshold. The within_5_minutes workers only care about keeping that queue under 5 minutes latency, and if they’re idle, you can scale them down without affecting the queue time of unrelated queues.

The mental model is simpler, and each queue’s performance can be managed separately. The primary downside is the cost of running more separate processes.

The cost difference between one big worker vs. multiple smaller dedicated workers is often minor, and it’s far outweighed by the performance improvements. With dedicated per-queue workers, you also avoid starving out fast tasks with long-running ones.

A bit of both

One strategy is to try to group certain queues together on workers and isolate others. For example, maybe combine the within_5_seconds and within_5_minutes queues on one worker type, but keep the within_5_hours queue separate.

While this can work, any time you put multiple latency tiers on one worker, you reintroduce the possibility of interference. It also complicates autoscaling (which latency do you scale on for that combined worker?).

My recommendation

In summary, I recommend dedicated Celery workers per latency-based queue. It makes it straightforward to maintain each queue’s SLA.

Diagram showing Celery workers dedicated to their own queues

If you’re on an autoscaling platform, set each worker deployment to scale up whenever its queue latency exceeds the target. To mitigate the potentially higher resource usage of this setup, I also recommend autoscaling your lower-priority workers (5 minutes, 5 hours, etc.) down to zero when the queues are idle. (Of course Judoscale makes this super easy 😁.)

If you’re doing this manually, you still benefit from clarity: you can monitor each queue’s wait time and add resources accordingly without guessing which queue is starved.

You should also look into other ways to effectively scale Python task queues, like fanning out large jobs.

One thing to keep in mind for Celery queues

One Celery-specific consideration that doesn’t apply to every queuing system is task acknowledgment timing. By default, Celery acknowledges a task as “received” when a worker picks it up. If the worker crashes mid-task, that task is dropped.

Setting acks_late=True (either globally or per-task) delays acknowledgment until the task completes. This means crashed tasks get redelivered, but it also means your tasks need to be idempotent, since they might run more than once.

If you’re using acks_late with Redis as your broker, pay attention to the visibility_timeout setting. This controls how long Redis waits before assuming a task was lost and redelivering it. The default is one hour. If you have tasks that need to run longer than your visibility timeout, they’ll get redelivered while still running.

For latency-based queue planning, the practical advice is that tasks in your fast queues (like within_5_seconds, within_5_minutes) should be short enough that the visibility timeout is irrelevant. For your slow queue, make sure your longest-running tasks finish well under the visibility timeout, or increase the timeout accordingly.

Shipping performant Celery queues

This opinionated guide for setting up your Celery queues is very much inspired by the strategies we know work well in the Sidekiq world. I hope this gives you some fresh ideas and a solid game plan for taming your Celery queues.

Remember, planning your queues boils down to:

Name queues by expected latency.
Isolate latency tiers on separate workers to avoid cross-interference.
Monitor and autoscale by latency.

Follow these steps, and you’ll avoid most of the common background job headaches that plague teams as they scale up.