Maximizing Performance with Judoscale: Target Queue Time Range

Maximizing Judoscale: Target Queue Time Range (Pt. 1) (This page!)
Maximizing Judoscale: Scheduling (Pt. 2)
Maximizing Judoscale: Setting Sensitivities (Pt. 3)

We’ve written several articles around application performance, architecture, and efficiency… which is fair — keeping an application running efficiently at a high level requires a lot of knowledge! Many of our previous articles help to prepare someone for why they ought to use autoscaling, what they should look for in an autoscaler, and what autoscaling should really do for you. These are all great things to understand, even if you don’t use Judoscale.

But while reviewing this knowledge-base recently, we realized we probably need to expand into Judoscale — to cover more deeply the specific dials and knobs available to you once you’ve got your application up-and-running in Judoscale.

So buckle up. We’re going to get nerdy.

Maximizing Performance with Judoscale is a series that covers all of Judoscale’s features and options. Jump to other posts here:

Target Queue Time Range (this page!)
Scheduling your Scaling
Setting Sensitivities

👀 Note

We’re going to use the term “dyno” or “dynos” in most places here; this is Heroku’s word to represent an application container instance. Render calls these “Services” while Amazon ECS calls them “Tasks” wrapped up under a “Service”. All of the theory in this article is just the same between all three, “dyno” just remains our default phrasing.

Target Queue Time Range

We might as well start with the root of the theory! If you’ve read our guide on Request Queue Time, you’ll know that request queue time is the single most reliable metric for determining whether your application is currently over-scaled, under-scaled, or well-scaled (that is, doing fine). If you haven’t read that guide, take our word for it 😜. You can setup autoscaling based on overall request response time, CPU % saturation, RAM % usage, or other metrics, and that might work in some cases, but those metrics are unreliable factors for scaling and can leave you in an unexpected rut! Stick to queue time!

So, for starters, we plot queue time on a chart over time so you can quickly understand how your dynos are performing. For this entire post we’re only going to consider web dynos, but these principles apply just the same to your workers too — instead of request queue time, it’d be job queue time, but the theory is all exactly the same.

And while I haven’t given any sense of scale to the axes here (yet), we can already begin to make some high level judgements:

In a simplistic sense, when your average request queue time is high, requests are waiting for a dyno to handle them and that is bad. When your average request queue time is low, a dyno was ready to take the request the moment it came in — that is good.

The Upscale Threshold

Let’s try to overlay a threshold line for where request queue time is too high and we should start to scale up:

Annotations showing queue time levels that should cause upscales

Here we’re acknowledging a couple of important truths.

First, that queue time isn’t always going to be zero. In fact, it’ll never be truly zero. Even if you’re way over-provisioned, requests still take time to travel from a load balancer to your server processes — and this is part of queue time. Beyond that, there is an acceptable level of request queue time for every application. And that specific threshold is different per application. Your app’s unique endpoints, resource capacity, resource usage, and many other factors drive this threshold.

Second, we’re acknowledging that some request queue time (micro-) spikes will resolve themselves. This chart annotates the spike on the left and the spike on the right as “Should Upscale” but it doesn’t point to the little spike in the middle. That’s important! Some applications will have spikier queue times than others and not every spike should trigger upscaling.

This really depends on your app’s endpoints and performance. If your application tends to run only a few dynos/services (say, 1-2) but you know you’ve got a few really slow endpoints (say, CPU churning through huge CSV file uploads) that will stop any other web threads from handling other requests in that time — that’s going to be a queue-time spike! The thread churning through that CSV file will prevent other threads from handling other incoming requests. But that doesn’t necessarily mean you should upscale. That’s a natural queue-time spike. “Natural” because it’s the natural result of non-performant code. But also “natural” because it will likely recover on its own. As long as most of your application’s endpoints are reasonably performant and most of your traffic is on those endpoints, it’s likely that your application will recover from those hiccups naturally.

👀 Note

Have you ever taken a sip of water and it “goes down the wrong pipe”? You cough a few times, sure, but you recover after a moment and continue breathing (or drinking). Your dynos/services will do the same thing here.

On the other hand, you can have cascading queue-time increases. These are not natural queue-time spikes — these are where you simply need more capacity! Where natural spikes recover on their own thanks to being temporary and short-lived requests on low-performance endpoints, cascading queue-time increases tend to come from simply taking on more traffic to any / all of your endpoints.

This illustration (below) can help show the difference. At first we see a small queue-time spike — a natural spike that recovers on its own. Scaling up at this little spike would be premature and wasted cost. But soon after we start to see the cascading queue-time growth! This is an exponential curve on the queue-time chart — as traffic increases and current capacity can’t keep up, additional traffic only means additional wait!

So we want to carefully place our red “upscale” line above the natural spikes but low enough that upscales will kick in before cascading queue-time gets too bad:

Let’s convert this concept back over to Judoscale and its dials and knobs. In Judoscale we have the Target Queue Time Range sliders and a chart that illustrates this same concept. For now let’s just focus on the top line of the shaded green area. That top line is the Upscale Threshold line, and it’s exactly the same concept we just discussed.

This is the value that we want to tweak so that it’s above the natural queue-time spikes, but low enough that cascading queue-time growth causes upscales. This particular screenshot was taken from one of our applications where we’ve determined 50 milliseconds to be the appropriate threshold that doesn’t upscale for natural spikes but does upscale when queue-time trends upwards overall.

The key word there is, “we’ve determined”. Every application varies in its efficiency, endpoints, and overall performance, so there’s no magic number here. The right answer is to spend a day sussing out what a good upscale threshold for your application is.

The simplest way to figure out a reasonable upscale threshold for your app is to statically over-provision for a day (by just a bit) and watch your natural spikes. That is, if your application normally scales between 2-3 dynos for the day, turn off autoscaling and set your dyno count to 4. If your service is normally between 20-24 dynos, set your dyno count to 26 or 27. Then watch the queue time chart in Judoscale throughout the day. You shouldn’t see any cascading queue-time increases since you know you’re already over-provisioned and should have enough capacity. You should see natural queue time spikes that resolve themselves automatically. That should look something like this:

👀 Note

If you’re worried about having autoscaling disabled completely, even with a higher-than-usual dyno/process count, you can do this same experiment by:

Leaving autoscaling enabled
Setting your minimum scale count (the lower end of your Scale Range) to the higher-than-usual number mentioned above so that you should already be over-provisioned all day
Setting your upscale threshold to a very high number, like 250ms

This will let you observe the natural spikes for your app while over-provisioned, but will still scale up if things get out of hand. Just make sure to write down your previous scaling settings so that you know where to reset them to once you finish your day-long experiment.

For this application, we can then see that setting an Upscale Threshold of 45ms would sit above the natural spikes:

But in reality, we want to give a little more breathing room for natural spikes as they happen. For this app, we’d recommend 60ms instead of 45. That puts the line here:

A little higher than above the spikes than the 45ms line, but that’s a good thing. Remember that we only want to go ahead and autoscale up if we’re facing a cascading queue-time increase! The nice thing about a cascading increase is that it will… keep cascading! So the relative difference between an upscale threshold of 45ms and 60ms is relatively minor — likely only a second or two before autoscaling would otherwise kick in. We can visualize that, too!

Alright, that’s about it for upscale threshold. The “tuning” process is the process of learning what natural spikes vs. cascading queue-time looks like for your particular application and setting a reasonable threshold for that value in the Judoscale Target Queue Time Range.

The Downscale Threshold

Now that we’ve got upscaling theory in our minds and our Judoscale dials tuned to our app, let’s talk about the other side of scaling — downscaling! Let’s refer back to our initial queue time chart first:

And let’s recall that we have two goals working in parallel. First, to have low queue times so that our users’ requests aren’t waiting to be processed. Second, to run as few dynos / services as possible. We want to satisfy both of these at the same time! In a single sentence, we want to run as few dynos / services as possible while still making sure our queue time is low.

The logic to accomplish this is rather simple and clever: if your queue time is stably low, downscale (gently) and see if it stays low. If it does, repeat. If it doesn’t, you’ve either found the correct scale for your application at its current load, or you need to upscale again (which should happen automatically; see above).

Let’s break that pseudo-algorithm down a bit.

“If your queue time is stably low” — What is low? When is queue time low enough to support off-loading capacity? For starters, we call this value the “Downscale Threshold” — the queue-time level that is low enough to downscale one step from the current scale. We represent this as a horizontal line in the queue time chart too! But how high that line should be in the chart becomes the question.

Here’s the same queue-time curve as shown above but with three different options for the downscale threshold level. Orange, Green, and Blue. (And for the keen-eyed among you, no, the colors don’t mean anything 😉 green is not a ‘right’ answer here):

If we had our downscale threshold set to the orange line/value, we’d, in theory, have downscaled twice (see the two orange arrows). If we’d had it set to the green line, we’d have downscaled once. And, of course, if we’d had it set to the blue line, we wouldn’t have downscaled at all.

So what should we have done? It might surprise you, but in general our recommendation is closer to the blue line than the orange or green. We’ve found over the years that a healthy application ready to be downscaled actually has a queue time line that looks more like this:

That is, extremely stable and very low while traffic continues — a nearly flat line. If your queue time looks like our hand-drawn chart above (lots of flux and roller-coasters), your application isn’t stable enough yet to downscale.

So how should we set our downscale threshold? Low enough that we don’t downscale until our queue time is a nearly-flat line as shown above? Not exactly. Setting our downscaling threshold is all about zooming in. Way in!

We want to see queue time numbers at a minute-by-minute basis. The Judoscale charts will give those numbers to you happily! For reference, let me give you the chart above with a more zoomed-in view:

Here I can see second-by-second queue time data as I hover my mouse across the chart and find that the queue time varies between 1.7 and 2.3 milliseconds. That is extremely stable. This app could downscale safely here. In general, an application can only achieve this extremely stable and flat queue time when they’re over-provisioned (have more capacity than they currently need)!

On the other hand, this is what an application that’s at its correct scale (for the current load) looks like:

This app has a mostly-low queue time (around 6ms) but is seeing little spikes of queue time (natural spikes that resolve themselves; see above) that peak at 18-30ms. That’s great. When your application is experiencing natural queue time spikes and your queue time is low but not ultra-stable, you’re at the right scale.

This app has their downscale threshold (green line) set very well, too. The peaks of these natural spikes are above their downscale threshold (20ms), so they won’t downscale here. That’s a good thing. The data clearly shows they’re at the right scale now, so downscaling here would only cause ping-pong scaling.

Ping-pong scaling is something to watch out for, and something you may have experienced. That is, downscaling only to find that you’ve gone too far and need to upscale back up to the prior scale… then a little while later doing the same thing. Repeating ad nauseam. That looks like this:

We do not want this. Luckily, the solution for ping-pong scaling is the same as above: zoom in!

If your application is already ping-ponging, zoom in to your queue time in the minutes before the downscale events. You’ll likely see some amount of variance in your queue time but with micro-spikes that peak below your current downscale threshold. Like these:

While these spikes are minor, they’re enough to signal that your app has enough capacity for its current load, but not so much extra capacity that it could downscale. So you’ll want to lower your downscale threshold until those spikes are above that line. That will prevent the next downscale event and should cease your ping-ponging!

So, where the goal of the Upscale Threshold is to set it high enough to ignore natural spikes but take action when there’s an actual capacity problem, the goal of the Downscale Threshold is the inverse: to set it low enough that natural spikes are above it but true over-provisioned-stability is below it.

Putting It Together

In Judoscale we display the Target Queue Time Range as a shaded area bounded by the Upscale Threshold on the top and the Downscale Threshold on the bottom:

But, with all the theory we just covered, we can think about that chart more like this:

The key there being that natural queue-time spikes should be captured in the shaded area. And that’s going to be different for every app as platform differences, system differences, code efficiency differences, and all sorts of other variables work together.

One piece of advice we can lend pertains to shared hardware vs. dedicated hardware (Std vs. Perf dynos on Heroku). It’s likely that you will experience more frequent, and taller, natural queue-time spikes when you’re on shared hardware and not dedicated hardware. In order to avoid needlessly upscaling on these natural spikes, you will likely have a higher Upscale Threshold on shared hardware.

On the other hand, if you’re running on dedicated hardware, your natural spikes will likely occur less often and be shorter. In order to prevent ping-ponging and correctly capture natural spikes into the ‘natural spike zone’, you’ll likely have a very low Downscale Threshold on dedicated hardware.

As an example, this application runs on dedicated hardware and maintains a downscale threshold of just 5ms. That’s very low, but it’s reasonable when on dedicated hardware. And indeed, when the application has room to downscale, its queue time is consistently below 5ms. Dedicated hardware is nice!

Conversely, this application runs on shared hardware and thus has natural spikes that peak into the 800ms and 900ms range frequently. We don’t often see applications impacted this heavily by shared hardware, but I wanted to use it as an example for how high your upscale threshold may need to be when on shared hardware:

If you’re in this boat: don’t panic. While these spikes look concerning and you may be tempted to reach for upscaling, these spikes are natural spikes. They resolve themselves without any scale changes and, outside of them, the app maintains a very low queue time (8-9ms). Upscaling to fight shared hardware issues can be futile — you’ll simply be adding another shared resource to your group! That new resource will have the same problems 😅. If you’re running on Heroku, this can also just be Heroku’s random routing giving you an unlucky hand!

The last piece of tuning your target queue time range is to simply ensure that you’re checking on it once in a while! Applications change over time — new features ship, hardware providers migrate to new machines, languages and frameworks become more efficient, etc. It’s a good idea to hop into Judoscale every couple of weeks to peek at your dynos’ performance and queue times. Judoscale exists to stay out of your way, but its charts can be extremely helpful as your application grows and changes.

And that’s it. You should now have all of the tools you need to fully assess your application’s queue time and determine a healthy range for it, based on a well-tuned upscale threshold and downscale threshold. Give it a shot and let us know how it goes!

If you made it through this whole article but haven’t actually signed up for Judoscale yet… kudos! We’d recommend you check out our forever-free plan — you can have free autoscaling forever and only need to install a simple Ruby Gem, Node Package, or Python ~~Cheese~~ Package.

P.S. We’re around to help you through this process too! Just click the “Help” button in the Judoscale UI and pick your style!