How Judoscale's Utilization-Based Autoscaling Works

Jon Sully headshot

Jon Sully

@jon-sully

A few weeks ago we announced that we’d been cooking up a brand new way of autoscaling — a proactive scaling approach based on maintaining a preferred level of server ‘utilization’. That announcement and post served to explain how this new form of scaling works at a high level and what you should expect from it, but we’d like to pull back the curtain and give you some insight into the gritty details. What is “utilization”? What does it have to do with computers and servers? How are processes and threads involved here? Strap on your deep-dive snorkel, we’re going in!

On ‘Utilization’

One of the most difficult components of software development is naming. Determining the best vernacular term to capture a programmatic complexity is no small task. Names themselves carry meaning! “Utilization” is no different. But the term has been used for several different well-known metrics and concepts throughout the software universe… so let’s zoom way out.

Let’s start with the plain-English definition of the word. According to Dictionary.com:

A screenshot from Dictionary.com’s definition of the word ‘utilization’, reading: “An act or instance of making practical or profitable use of something”

That is, to utilize is simply to use. Adding the -ation suffix, to us, means ‘how much we use’. So, in layman’s terms, utilization is actually simple: how much of a thing you’re using. It’s a fairly universal term, requiring only that the ‘thing’ have some upper bound — there is no “how much” of infinity! As long as you have a clear quantity of a thing, you can reason about your utilization of it.

We’re on the verge of over-explaining something that’s probably quite intuitive, but let’s take a moment to talk through just a couple of real world examples to set the stage for digital examples. First, a refrigerator! It’s easy to think about how much of the space inside we’re utilizing. But it’s only easy because the upper bound of that space is extremely apparent — it’s a large physical box in front of us that we can see and touch. It’s clear how much of that box we’re using at any given time. So, just after a grocery-store run, we might be utilizing 80-90% of our refrigerator space. After a week or two, maybe that’s down to 30% or 40%! In fact, it’s so intuitive for us to reason about utilization when the quantity is physical space, that many actually take a quick look into their refrigerators to determine when to go to the store in the first place! Now that’s autoscaling 🥗!

An AI-generated graphic in the form of a pencil-sketch drawing depicting a refrigerator with an open door and an “80%” on the door; essentially showing an 80%-full fridge

On the other hand, it’s a little less intuitive for us to reason about a quantity that’s not physically backed. Think about a car’s horsepower, for example. Maybe you drive a vehicle with 300 horsepower. Are you utilizing it? What’s your average utilization of all that power? And is that good or bad?

It’s a little trickier to answer! Your car’s current horsepower utilization is (hopefully) zero as you read this article. You’re not driving and your car is off, right? But even as we drive around, our utilization constantly varies. We stop, we go, we accelerate and brake, and we cruise on highways that themselves are varied in terrain and path. Our horsepower utilization is constantly changing as we experience each of these things!

But, without diving too deep into cars and engines, a slightly-opinionated take here is that you’re probably not utilizing all 300 horsepower often, if ever. Even if you’re someone who likes to accelerate promptly when the light turns green (🙋🏻‍♂️), it’s exceedingly rare that I accelerate at full pedal-to-the-floor strength. We just don’t use all that power.

A gif of a monster truck jumping several cars while on fire and shooting flames
Too much power.

The more interesting question may be the latter: is having all that horsepower good or bad? The answer is tricky. Having all that extra horsepower available, even if you never use it, has a price. Fuel. More power often means higher fuel consumption, even when you’re not using all of that power 💰. So whether it’s good or bad to drive a high-power vehicle is a personal choice — a balance between long term costs and how often you might really use all that power.

👀 Note

This actually reveals an underlying truth of most things with respect to utilization: maintaining a low level of utilization of something with a large upper bound (like driving a 400 horsepower vehicle but averaging 100 horsepower of use) tends to be less efficient than maintaining a high level of utilization with a lower upper bound (like driving a 150 horsepower vehicle and averaging the same 100 horsepower of use). In this case, that efficiency is felt in fuel consumption. As we’ll see, the utilization/efficiency relationship plays out with servers, too!

Computer Utilization

If we apply our zoomed-out / high-level utilization definition,

how much of a thing you’re using… requiring only that the ‘thing’ have some upper bound

to computers and servers, we find an interesting question. What’s our upper bound? What does “using 100% of your server” mean? There are a few quantifiable pieces of hardware involved in a typical computer / server; is it related to those? Am I utilizing 100% of my server if its RAM is full? If its disk storage is full? If its CPU is constantly busy?

A computer is a complex thing and legitimately answering the question “how much of it am I using?” is actually quite difficult. But determining which quantity to consider, the upper bound of that quantity, and the utilization over time with respect to that upper bound is required if we ever want to proactively scale in any kind of useful way.

A screenshot of a Windows system resources chart showing CPU use at 100%, memory use at 100%, and GPU use at 100%, beside the “Everything’s on fire” meme with “This is fine” text bubble

Traditionally, folks have decided to answer this question based solely on the CPU. We’d call this “CPU utilization”. That is, the average amount that the CPU is in use represents the overall utilization level of your server. If the CPU is in 100% use, on average, you’re utilizing 100% of that server’s capacity.

The thing is, the CPU usage alone doesn’t actually capture a web server’s ability to handle additional requests. A multi-process, multi-threaded web server can easily be fully saturated (unable to handle any additional requests) while its CPU usage is low.

We discuss this quite a bit further in “Understanding Queue Time”, but the reality is that most web requests aren’t CPU-heavy, they’re I/O heavy. The CPU ends up waiting for database responses, cache responses, and third-party HTTP responses for far longer than it spends rendering a view. As such, a web server running 10 total threads could be fully saturated working on 10 responses, all of which are doing some form of I/O while the CPU sleeps.

That situation would be a metric mis-match. Your CPU usage is reading low, but your actual, general “utilization” should be feeling high — you can’t accept any more requests! That’s not great. So, tl;dr: CPU utilization isn’t a good metric for measuring web servers’ actual utilization.

To really determine the utilization of a server, we need an upper bound / capacity measurement that takes into account the reality of how multi-process / multi-thread web-servers work. We can’t rely simply on CPU usage. If the end-goal of our web servers is to handle requests and we know ahead-of-time how many requests a server can handle in parallel (based on configuration), can’t we build a utilization value based on how many requests the server is handling in a given moment?

That’s exactly what we did.

Processes and Threads

The idea started fairly simple: the utilization metric of a web server should be based on how many requests it’s currently handling vs. how many requests it can handle at maximum. But for most languages and frameworks, the maximum number of requests a server can handle (concurrently) at any given moment is a product of several factors and features! Most notably, process counts and thread counts.

When it comes to asynchronicity, true concurrency, and how both of those topics relate to running multiple processes and multiple threads per process, things get a little tricky. So let’s start with a concrete example.

Let’s theorize that we have a Rails server, running Puma, with two processes (Puma workers) and three threads each. That means we have six total threads by which we can handle requests. Does that mean our utilization at any given moment is simply the number of requests currently being handled divided by six?

The intuition is temping and understandable — even reasonable in another language or runtime — but Ruby’s old friend the GVL would have other things to say here. We cover this topic much deeper in “Why Did Rails’ Puma Config Change?!” (an important read we highly recommend!!) but the short-story is that Ruby can actually only process one stream of execution at a time.

This isn’t a new concept for most Rubyists, but its implications for utilization are non-trivial. You might have three threads per process, but that doesn’t mean you get three independent lanes of execution. If all three are working on active Ruby code, only one of them is actually making forward progress.

So when we talk about threads in Ruby web servers, we’re really talking about potential concurrency — most notably during I/O. When a request is blocked waiting for a database or Redis or HTTP response, the GVL is released, and another thread can take the stage. This is why Puma threading works at all. But once multiple threads are simultaneously trying to do actual Ruby work (rendering views, running logic), they’re back to taking turns. It’s cooperative multitasking, not true concurrency.

This gets thorny when you try to reason about utilization in terms of threads. If a process has three threads and one request is active, you might say, “That’s 33% utilized.” But what about when a second request comes in? Is it still just 66% utilized, even though both requests are now competing for that single GVL? Or if both are waiting on I/O, are we back to 0% CPU usage? Is that a utilization valley or a peak? Oof. 😅

The reality is that these micro-dynamics produce very noisy data and the results aren’t reliable. Every app’s workload and specific I/O constraints are different, so every app’s instant thread-saturation and overall response-time impacts are also different. This avenue wasn’t going to work.

So we stepped back and asked: what are we really trying to measure?

… Just Processes

We realized the most useful mental model was actually at the process level. If a process is actively handling a request, it’s working. If it isn’t, it’s idle. In this model, the unit of capacity is the number of processes.

Let’s say you have 4 Puma workers (4 processes). If 3 of them are currently handling at least one request, we say you’re at 75% utilization. Even if each worker could be handling more than one request thanks to threading, we don’t assume that. We take a conservative stance: one active request per process equals 100% utilization.

That means our utilization calculation becomes:

Utilization = (number of processes actively handling at least one request) / (total number of processes)

Why be conservative? Because when you breach that first-request-per-process threshold, performance almost always degrades. That second request, even if it finds an open thread, is going to fight the first one for time on the CPU. If you’re using Ruby, you already know this. Things get slower.

Again referencing “Why Did Rails’ Puma Config Change?!”, a key takeaway was that running fewer threads per process often improved overall response time. It turns out, giving each request its own dedicated CPU time (via separate processes) produces better performance than packing multiple requests into a single process and letting them duke it out.

Real-World Modeling

So when we built our proactive autoscaling engine, we leaned hard into this approach. We count utilization as the percentage of processes currently handling at least one request. If you’ve got 4 processes and all of them are working, that’s 100% utilization.

Some might ask, “But what if those processes are all handling requests efficiently? Don’t you have room for more?”

Maybe. But the risk of degraded performance is real, and we don’t want to rely on luck or hand-wavey assumptions about GVL behavior. By modeling this way, we err on the side of performance.

And that ends up aligning well with real-world application behavior. Servers at 90-100% utilization under this metric tend to have longer response times. Servers sitting around 60-70% are comfortable. The metric, while simple, maps nicely to the thing we care about most: how it feels to run your app.

We wanted to build an autoscaling system that understands GVL quirks and single-threaded-ness — one that just works! An autoscaling protocol that’s highly effective but simple enough to understand. We believe process-busyness does exactly that.

What About Queue Time?

We want to be up front here — utilization and utilization-based autoscaling as we’ve designed it is not intended to be an improvement to queue-time based autoscaling! Queue-time based autoscaling is actually, technically, better in most cases. Queue time remains the best indicator of scale health as it pertains to requests already in-flight: if requests are queuing, you need more capacity! If they’re not, you’re probably fine.

The short answer is that queue time and utilization are a bit like apples and oranges. They’re just different! The long answer is that we wrote a different post specifically comparing the two and when one might be better for you than the other: Autoscaling: Proactive vs. Reactive. We recommend giving that one a read if utilization-based autoscaling sounds interesting to you!

Back to the Car

We noted an earlier analogy around the fuel-cost of driving around a vehicle with lots of horsepower, even if you never use that additional horsepower. Let’s bring it back full-circle.

Autoscaling in this model is like driving a car that can swap out its engine on the fly. But it’s not just about scaling up and down for fun. It’s about keeping a steady level of headroom — enough engine to cruise efficiently, but always with the extra torque available to pass that slow-moving truck in front of you without redlining or lagging.

A black-and-white cartoon shows a happy car driving down a highway while arrows illustrate it swapping between two anthropomorphized engines — a smaller one and a larger one — symbolizing dynamic autoscaling of server capacity on the fly.

You don’t want to burn extra fuel hauling around an overpowered V8 if you’re mostly commuting in the right lane. But you do want to know that, the moment you tap the gas to merge or accelerate, there’s enough engine under the hood to make it effortless.

That’s what Judoscale’s utilization-based autoscaling does for your application. It watches how many of your server processes are active, keeps you in that sweet middle lane of efficient cruising, and ensures you always have just enough overhead to handle sudden bursts without stalling out.

It’s like cruise control that adjusts the size of your engine instead of just your speed — one that knows the terrain ahead, anticipates the climb, and keeps your drive smooth, stable, and efficient.