Autoscaling: Proactive vs. Reactive

Jon Sully headshot

Jon Sully

@jon-sully

From the beginning, Judoscale has been focused on providing the fastest, most reliable queue-time-based autoscaling on the market. We believe that queue time is the metric that matters most for real applications out in the wild and that an autoscaler ought to be extremely responsive to queue time metrics as they arrive (most Judoscale applications scale up within 10 seconds of a queue time spike!). Our pitch and goal has remained more or less constant since we began: queue-time metrics scaled fast!

But what if queue time isn’t always the best answer? 👀 What if there’s another metric or style of autoscaling that’s more effective in certain cases and applications? Tl;dr: there is, we found it, and we built it. 🎉

When Queue Time Isn’t Best

We’ve talked (at length) about our love for queue time and how, as Nate Berkopec puts it, “Friends don’t let friends autoscale… based on CPU utilization. Or response time. Or requests per second.” But there’s some nuance to queue-time based autoscaling that’s worth exploring.

A generated sketch graphic showing a request, anthropomorphized as a sheet of paper sitting on a desk chair, waiting for a server, anthropomorphized as a file cabinet both holding a phone and feverishly typing on a keyboard at a desk, to be less busy so the request can be handled

At its most fundamental premise, the queue time metric identifies how long requests wait before your servers can handle them. As it so happens, this metric is directly tied to server capacity: if you want your queue time to go up, remove some of your server capacity. Fewer servers handling the same number of requests yields a higher queue time (and eventually that “everything’s on fire” fun). Conversely, if you want your queue time to go down, add more server capacity! Your queue time will eventually bottom out as every incoming request is routed to an available server in real time. Nice!

The beauty of the relationship between queue time and server capacity is its simplicity and reliability. The two actions we just explained (“scaling down” and “scaling up”) can reliably cause the two effects (more or less queue time) to occur every single time. It’s like gravity. (Maybe we should call it the “law of scaling”?) Anyway, it’s helpful because it allows us to abstract away some complexity. We can confidently define a simple autoscaling premise: when queue time rises, add more servers/dynos. When queue time is suitably low, remove some servers/dynos.

And that’s Judoscale’s service in a nutshell.

We built Judoscale to be the most responsive and dynamic autoscaler based on queue time metrics. Judoscale leads the industry in speed-to-scaling and transparent, real-time data aggregation. This style of queue-time driven autoscaling, when dialed in, gets pretty close to the (impossible) perfect case: you only ever use exactly as much server capacity as your traffic demands. With Judoscale, you can get pretty close — where your scale is always enough to handle your traffic load, but not so much that you’re burning cash. You want to “hug the line”, as it were:

A chart showing the relationship between traffic loads (as a line) with server scaling levels as a bar chart behind that line

👀 Note

In the case above, that means you want an autoscaler that automatically adjusts your scale level to match the green bars… following the traffic levels as much as possible while staying above them rather than below. The red bars represent scale capacity that would be wasted since you’d have more capacity than you need for the given traffic at that moment.

And that’s what queue time, as a metric, allows us to do: follow the traffic load curve efficiently so that we’re always at the right level of scale for our current traffic levels. Queue time is efficient, correct autoscaling.

But what if you don’t want efficient autoscaling?

For the record, most teams and applications do. Efficient autoscaling optimizes cost according to traffic and allows you to only pay for what you use. That’s great! But it’s not actually every company’s goal. We’ve heard from several teams over the years that they need something else from an autoscaler. That they’re not optimizing for cost. That the efficiency of their autoscaling isn’t actually their goal at all.

When The Requests Come All At Once

The hallmark attribute of the teams and applications that have asked for an alternative style of autoscaling is their traffic patterns. They nearly all receive large waves of requests all at once.

A generated sketch graphic showing lots of web requests, shown as pieces of paper, in a giant tidal wave headed directly for two web servers that have frowny faces on them

This is where we find one of the realities of queue-time based autoscaling: it’s ‘correct’ to current traffic loads and efficient in keeping scale accurate to current traffic loads, but it’s a reactive metric. Meaning that queue time spikes only as/after requests wait for capacity and even the fastest autoscaling tool can only get a new server online 30-45 seconds after queue time begins rising. That’s totally fine for normal applications where traffic loads jump around within 10-30% of relative RPS. But if your traffic looks like this, it might be a different story:

A chart graphic illustrating traffic levels over time where there’s a huge, nearly vertical spike of traffic all at once

For applications that regularly expect this kind of traffic, they see a massive queue time spike:

A screenshot chunk showing a queue time chart where queue time spiked from 1-2ms up to 2600ms as a large wave of traffic came in all at once

And while the aforementioned ‘law of scaling’ remains reliable (scale up and that queue time will come back down) our queue time reading at the moment of slowdown (~2600ms) is missing one detail: how much to scale up by.

Queue time is a fantastic metric for declaring that you should scale up. For most apps, as soon as queue time surpasses around 50ms or 75ms, we essentially know that it’s time to scale up. But queue time doesn’t imply how much to scale up by — just that you should scale up because requests have started queueing/waiting. And given the complexity of the modern application, there’s really no reliable way to guess how much to scale up by based solely on the queue time value alone. So we (Judoscale) allow you to configure how many servers/dynos you want to scale up by as a setting in our Web UI:

Small screenshot of the “Upscale Jumps” setting from Judoscale, where you can configure how many dynos/servers get added per upscale

But even that’s just a best-effort to capture how much extra capacity your app needs during most spikes. For applications that receive massive waves of traffic all at once, queue-time autoscaling can often look like this:

A chart graphic illustrating traffic levels versus serve scale as a large traffic wave hits, indicating that as autoscaling upscales by 2-at-a-time, there’s still a large capacity gap until the 2-at-a-time finally gets enough capacity to meet the big spike traffic load level

That is, where the traffic spike is large enough that the upscale jumps still doesn’t get enough capacity in time to handle all the requests. The gap between those bars and the line is ultimately failed requests 😔… and probably alerts and error messages for your team (“Everything’s on fire!!”).

A chart graphic illustrating traffic levels versus serve scale as a large traffic wave hits, indicating the gap between scale (even as it autoscales up) and the traffic line, with red skull emojis in the gap (representing failed requests)

You don’t want that.

But that’s one of the realities of queue-time based autoscaling for apps that receive very large waves of traffic all at once. Since the value of the queue time spike can’t hint at how much to upscale by, we’re beholden to static upscaling steps which may not accommodate the traffic spike as fast as we’d like. We’re autoscaling reactively as fast as possible, but it’s not perfect.

🚨 Warning

Just adding an additional note here: this behavior is not representative of most applications. We’re talking about traffic spikes of ~2x normal traffic loads (or more) hitting within less than a minute when ‘normal’ is already close to or beyond triple-digit RPS. If this behavior doesn’t match your application or team, we still recommend sticking with pure queue-time based autoscaling!

Of course, if you just prefer a simpler autoscaling paradigm and recognize that it’s not as efficient as pure queue-time, that’s fine too!

Introducing Utilization

So, with all of that preamble and context out of the way, we’re really excited to announce that Judoscale is rolling out a new form of proactive autoscaling: Utilization-based Autoscaling. And the idea here is incredibly intuitive. You simply tell us how much of your server capacity, as a percentage, you’re aiming to saturate at any given time:

I always want about a 40% overhead available for spikes, so let’s target 60% utilization as our norm.

And we continuously autoscale your servers/dynos to ensure that the utilization ratio is maintained!

A generated sketch graphic showing a server with a smiley face and arms giving a thumbs up in front of a paper chart that reads ‘utilization 60%’

This is proactive autoscaling because we can scale up your application before queue time spikes ever happen. Instead of waiting for queue time to spike and things to slow down, we instead continuously monitor your Utilization percentage 24/7. We can now constantly answer the question, “how much overhead do I have?” And “how much of my hosting capacity am I actually using right now?”

👀 Note

The “utilization” nomenclature has been used around CPU utilization for some time, but Judoscale’s utilization-based autoscaling is not that. Friends don’t let friends scale based on CPU metrics, remember!? Our utilization metrics are based on process/thread saturation and not something we’ll cover in this post. Instead, the goal is to provide a stable foundation for talking about “utilization” in as simple terms as, “we’re utilizing 40% of our capacity right now”. We want to simplify the thought process around capacity and headroom by simply referring to utilization as ‘how much of my capacity is actually being used’.

For instance, your application might slowly increase from utilizing 68% of its capacity to 79% as your daily traffic begins to come in each morning. In the queue-time world, that probably wouldn’t trigger a queue time spike — 79% load is reasonable! But if you know that your application experiences sharp load spikes throughout the day, 79% doesn’t leave you much headroom! This new Utilization metric will allow you to proactively keep your scale high to ensure you have enough headroom. In this case, you’d probably tell Judoscale to keep your Utilization down around 50% or 60%. Or maybe even lower:

A screenshot of the Judoscale UI now showing the utilization metric in action driving autoscaling for an application

The beauty here is that it’s still dynamic! 79% utilization means you’re using 79% of your current capacity. As Judoscale scales up your application, suddenly that percentage will fall since you have more capacity! So when we say “let’s target 60% utilization” it really means, “let’s dynamically upscale and downscale so that we’ve always got a large amount of headroom for spikes”.

If/when traffic spikes happen, our already-provisioned headroom should be able to handle them. Plus when that extra headroom handles the spike, our Utilization will increase up closer to 100%, so Judoscale will upscale to add even more capacity (getting us back down around 60%) just in case. Once the spike is over and utilization falls below 60%, Judoscale will downscale again until we’re at that 60% utilization target.

That’s a whole lot of autoscaling logic simplified into one clear number!

Utilization-based autoscaling is proactive in that it accounts for headroom ahead-of-time and allows you to declare how much headroom you want to maintain. Does your app get massive traffic spikes all at once? Try targeting a very low utilization percentage, like 25%, so that you always have lots of headroom!

Utilization Is Less Efficient

We’re thrilled to roll this feature out to all Judoscale subscribers in the coming weeks, but we do want to note the big caveat here. Utilization-based autoscaling is built to let you define how much extra headroom you want your capacity to maintain. Inherently, that means you’ll be spending more money than you need to be spending to accommodate your actual traffic in real-time — you’re spending money to lessen the risk of outages when traffic spikes. That is, spending money to maintain a (configurable) degree of over-provision.

For most teams and companies that experience these kinds of spikes at the load levels they operate at, the extra capacity spend is completely worth it. But the reality is that utilization based autoscaling is less efficient: it won’t save you as much money as queue-time based autoscaling. The choice is yours (and you can actually use both in Judoscale — more on that to come), but keep that in mind as you plan out your autoscaling tooling.

A chart graphic illustrating scale levels with built in overhead buffer that covers sharp traffic spikes by already having the capacity provisioned

How Will You Use Utilization?

Utilization-based autoscaling will be coming to all Judoscale customers and plans with no extra charge or subscription style. It’s just an alternative metric you’ll be able to use in your existing Judoscale setup once it’s live! So… how will you use utilization-based autoscaling? What percentage are you shooting for? How does your traffic spike!? Let us know!