Autoscaling Insights: What Nearly A Decade Of Autoscaling Your Apps Has Revealed To Us

Jon Sully headshot

Jon Sully

@jon-sully

We’ve been autoscaling apps for a long time — almost a decade! That’s long enough to see patterns repeat across Rails, Node, and Python; Redis and Postgres; Heroku, AWS, Render, and more. From these experiences and insights we wanted to put together a compendium of mistakes, misconceptions, less-intuitive ideas, and “oh wow that does matter” moments as best we could.

We’ll organize this like a listicle, but we’ll connect the dots as we go so it reads like one story: what queue time really tells you, why shared hardware is noisy, how scheduling fits in, and how to keep your scaler from thrashing.

1) Shared hardware is noisy

Multi-tenant machines are cheaper because they’re shared, but providers overprovision their shared machines. It’s not “well we have 10 cores so we can host 10 apps, each getting 1 core of CPU”. Depending on how much the hosting provider wants to profit cram applications into a single host, it could be a lot more like, “well we have 10 cores and apps tend to underutilize what’s available so let’s stuff 30 apps onto those 10 cores” 😬

Illustration graphic showing two example servers with eight cores of compute; the first having only eight tenants across those cores, the second having tens of tenants, with the banner text on the top of the image reading ‘Shared Hardware is Overprovisioned’

When your app shares CPU with other tenants, one neighbor’s burst is another neighbor’s tail‑latency blip. We’ve talked about this several times, but the “noisy neighbor” effect is not groundbreaking or new. It’s also not a single-provider issue: every hosting platform that operates on shared hardware is subject to some degree of neighbor noise!

The takeaway: get familiar with what noisy neighbors look like in your app. Learn to read your charts so you can tell app‑wide issues from single‑dyno outliers (the noisy‑neighbor tell).

The action item: enable Judoscale’s Dyno Sniper (by-request only as of September 2025) feature to automatically detect and restart services that fall prey to a noisy neighbor’s delay. It’s free. It’s magic. It works everywhere. There’s really no downside.

2) Autoscaling “for headroom” is hard (most teams miss it)

Many teams use autoscaling specifically to keep a certain level of headroom available for unknown bursts of traffic ahead. It’s not the most efficient way to run an app, but the premise can prevent downtime for highly burst-prone traffic loads. That said, actually configuring your autoscaler to do that correctly is extremely difficult. Most teams end up very over-provisioned constantly (e.g. wasting money! 💰) or without the headroom they desire, and alerts when bursts arrive 🚨.

Most autoscalers operate on what we’ve come to call “reactive metrics”. These reactive metrics are excellent. They’ve always been excellent. When you’re using an autoscaler that’s watching them and reacts quickly (like Judoscale!), reactive metrics are absolutely the right answer for 90+% of applications. That said, if you’re in the other 10% (specifically looking for autoscaling-with-headroom), reactive metrics aren’t the right tool. If you need to maintain a proportional overhead of capacity relative to your scale as it changes, you’ll need Judoscale’s custom Utilization-based autoscaling. It allows you to say “Keep me at about 70% capacity utilization so that I always have 30% overhead”.

The takeaway: to our knowledge, Judoscale is the only autoscaler out there that offers true proactive, headroom-prioritized autoscaling via custom “Utilization” measurement 24/7. If you need that kind of behavior, get Judoscale installed and activated. You’ll be surprised at how useful extra headroom can be in high-burst loads!

An illustrated diagram showing a statically-scaled app failing to have capacity to handle requests as traffic load rises

3) Queue time ranges for healthy apps are lower than we expected

After several years of watching both customer queue time data as well as our own, we decided to lower the default queue time threshold for new apps on Judoscale. While that change could be worth an article of its own, suffice it to say that our realization was based around the stability of shared and dedicated hardware. Queue time thresholds for dedicated hardware (think Heroku’s Perf- series and/or Fir platform) can be very low. As in, “scale up if queue time hits 5ms”. And that’s ultimately a reflection of a very stable and operational stack — a request hitting Heroku, getting routed, and hitting your dyno consistently and predictably. The moment that dyno begins queuing requests for more than just a few milliseconds, we can be sure there’s a capacity problem.

Our realization came regarding shared hardware (Heroku’s Std- series). We’d long considered a higher default queue time for those tiers since shared hardware can experience blips of queueing even though you don’t have a capacity issue (yet) — those are the moments your neighbor’s code is running instead of yours.

What we found is that shared hardware can actually operate on the same degree of stability and low queue time threshold as long as you’re carefully and deliberately staying within the bounds of your own “Dyno Load” or overall “shared slice” lane. We’ll write about this more in the coming months, but here’s the takeaway: keep an eye on, and be careful about, how much of your slice of the shared hardware you’re really using. On Heroku, this means being careful about that “Dyno Load” metric. If you consistently push for more Dyno Load than you’re actually allocated, you’re going to have a bad time.

4) Scaling to zero is a super‑power (for workers)

Event‑driven workers don’t need to idle. If there’s no work to do, you shouldn’t be paying for workers. Sure, keep at least one of your low-latency queue workers around all the time — we do too. But if you’re following our “Opinionated Guide to Planning Your Sidekiq Queues”, you should allow both your less_than_five_minutes and especially your less_than_five_hours queues to scale to zero. Cold-start times once jobs actually hit those queues are around 30-45 seconds (YMMV) so you’ll be well-within your queue time SLA… while saving free money 💰.

The takeaway: scaling background job workers to zero when there’s no work in the queue is free cash in your pocket. Set up your Judoscale schedule and scaling range to allow for zero-scale and enjoy the free money you save 😎. Then, while you kick back, give our Sidekiq Queues Guide a read!

Background job dynos scaling up as more background jobs are kicked off

5) Y’all have too many job queues

We’ve seen it all: the “one queue per job class” approach, the “every feature gets its own queue” approach, and, of course, the “we’re just using this queue one time and we’ll clean it up right after” approach. All roads lead to more queues. The thing is, lots of queues create real issues:

  • More queues means more queues to watch, and that means lots of polling and overhead from your job system. Sidekiq’s own docs stress that they “don’t recommend having more than a handful of queues per Sidekiq process” and

Lots of queues makes for a more complex system and Sidekiq Pro cannot reliably handle multiple queues without polling… slamming Redis.

  • Long term maintenance of jobs, orchestration, and priorities is terrible when you have that many queues to think about. It’s more than you can comfortably hold in your head.

  • Setting up autoscaling policies across 10, 15, or even 20+ queues is a massive headache. Trying to keep them all reasonably in-sync is worse. It’s not worth it.

You end up here:

A visual depicting lots of queues leading to worker processes that are on fire

Trust us, you don’t want to end up there.

Simply put, if you’ve got more than about five queues, you’re probably headed in the wrong direction. We urge and recommend only having three! And we recommend naming them based on an expected queue-time SLA, then setting your autoscaling up for each queue to reflect that. It’s a beautiful, job-agnostic way of handling queues!

The takeaway: read this guide and audit your background job queues. If you’ve got more than five, determine why and what should be merged. KISS!

6) On the web side, downscale by one (almost always)

Several of our larger-app customers have run into interesting headaches and unforeseen issues by downscaling by more than one at-a-time. Judoscale’s highly configurable nature does allow you to do this, but we’ve learned over time that you probably shouldn’t.

The reasoning is simple: cost vs. benefit. The benefit, in short, is that you’ll shave a little off your bill. If you’re going to downscale off some dyno load anyway, doing it in bigger steps means slightly larger savings accumulated at the end of the month! ‘Slightly’ is the key word there. The cost, however, can be less pleasant. You can end up downscaling too far! At that point your users could experience slowness, you could spark alerts, and you’ll inevitably upscale again soon. There’s just no need for all the thrashing.

The takeaway: downscaling by more than one at-a-time, for web services, isn’t worth the marginal gains. The risk of downscaling too far is real! Stick to downscaling by just one-at-a-time.

Our next step: we actually feel pretty compelled to remove this option altogether in the future. It’s exceedingly rare that downscaling by more than one service at a time is the right move. TBD, but this lever may disappear!

7) Understand intra-dyno concurrency

The short version of this story is based on understanding the interplay between request routing and how requests are handled within a single service. Many PaaS’s use simple random-based request routing: a new request can go to any active service/instance in the cluster. It’s not intelligent or load-based. So a single service could receive multiple requests in a row, even while processing its prior, while another service may get none!

It’s important, then, that each single service is able to handle multiple requests concurrently. Otherwise those randomly-routed new requests will be queued and you’ll have a consistently higher, but sporadic, average queue time for your app. 😬

In each of the runtime languages we support autoscaling for (Ruby, Python, and Node.js), the web frameworks cannot process requests in true-parallel (concurrently) per process — only asynchronously in alternating execution. That’s a mouthful, but we recently wrote a post that walks through that idea with great diagrams; give it a read here! The key is that a single process isn’t capable of true concurrency, but running multiple processes within your service does yield intra-service concurrency (the ability for a single service to handle more than one request in true parallel).

The takeaway: wherever possible, run more than one process per web service instance (“dyno”, “service”, etc.). This is a particular challenge on Std-1x-style, single-CPU-core, service tiers. But, if all other variables are held constant, it’s better to run a single service with two processes than two services that each have a single process!

8) Scheduling: boring, obvious, and wildly effective

Scheduled scaling is the easiest money‑saver with the least risk. If your traffic has a weekly rhythm, tell Judoscale about it. Keep your minimum scale higher on weekday business hours, then drop it during nights and weekends (for example). You can even leverage a tighter schedule to pre-scale up before large releases, big events, and other known spikes!

You can even get the best of both worlds by running a schedule and autoscaling together. Instead of scheduling hard-locked service counts, you can schedule the range of scale that you want to tweak at a given time. That gives you the flexibility to respond to unknown load changes while still controlling baselines in accordance with known load changes! 🎉

Of course, if you choose to not build a schedule, autoscaling itself will ensure that your capacity grows and shrinks to meet need. But depending on how sharp your traffic changes can be at known times, you might either end up with: 1) a few minutes of slow service as autoscaling ramps up your capacity amidst a large traffic burst, or 2) wasted dollars as your baseline (minimum) service/dyno count stays higher than it needs to be on off hours!

The takeaway: take a couple of minutes to think about your app’s weekly traffic patterns by day (even by hours!). Also consider any weekly rhythms your app has in terms of events, releases, or other in-domain things which drive influxes of users! Then bake all of those into a dynamic autoscaling schedule with Judoscale ✨

A screenshot of the Judoscale configuration UI showing an example schedule operating on a live application, dynamically shifting the scale range depending on the time of day.

Wrapping up

If there’s a single throughline in all of this, it’s that autoscaling is a feedback loop living in a noisy world. The mechanics aren’t mystical: measure what users feel (queue time), add capacity when you’re near the edge, and give your system enough time to absorb changes before you decide again. Do that consistently and the chaos of shared hardware, random routing, and spiky traffic stops feeling like chaos.

Choose boring on purpose. Keep your queues few and meaningful. Let workers scale to zero when there’s nothing to do. Downscale web by one so you don’t saw off the branch you’re sitting on. Run real in‑dyno concurrency so random routing doesn’t turn little bursts into instant queue time. And if a single dyno is having a uniquely bad day, assume a noisy neighbor before you assume a rewrite (and let Dyno Sniper handle the whack‑a‑mole so your team doesn’t have to).

That’s it. No silver bullets—just a handful of defaults that make your platform feel calmer and your costs feel saner. After nearly a decade of watching this stuff in the wild, the “secret” is that stability isn’t a hero move; it’s a series of small, boring, repeatable decisions. Make those decisions once, encode them in Judoscale, and let the loop hum 🔁