Let me set the narrative stage here just a bit before we dive in. First, Judoscale runs on Heroku and, like any good web SaaS, is itself scaled by Judoscale. Second, we’ve been experimenting with running Judoscale on Standard-1X and Standard-2X dynos — trying to profile which is better (…TBD). Third, several weeks ago we broke our still-in-beta feature, Dyno Sniper, and didn’t realize it for a few days. Nothing major, just a little bug.
Let’s rewind though. There’s some context here we need to cover — how Heroku’s architecture works, how metrics work, how neighbors can be rough, etc.
Welcome to Heroku
Heroku’s been around a long time — nearing twenty years! While they’re generally known for starting the Platform-as-a-Service industry (PaaS) by offering highly-automated hosting, they actually remain the best choice for new startups, small apps, and quick operations today. Can we take a moment to consider how wild that is? A startup-oriented hosting product that’s aged twenty years with few feature changes and still remains the best choice? Heroku is both a dinosaur and a unicorn.
👀 Note
Did you know that Heroku was acquired by Salesforce in 2011; only 4 years after it was created? If you’re like me, you probably thought it was much later than that! That there were many ‘great years’ of Heroku before Salesforce came in… but not so!
Oh, and the good news here is that Salesforce appears to be putting more time and money into Heroku (finally?). The last year or two have seen lots of new feature rollouts, upgrades, and tweaks. Check out their change-logs sometime. Heroku’s getting back in the game!
But where there’s history, there’s competition! Newer contenders to the PaaS space, mostly implementing a similar model as Heroku (e.g. Render and Fly), are actively trying to win over the market. We’ve yet to see that happen, and still choose Heroku for our own green-field applications, but it’s neat to see competition driving innovation in the space.
And innovation we do need. Particularly because one of the most fundamental premises implemented in these platforms is that of resource-sharing. PaaS’s rent servers on lower-level cloud providers (AWS, Google Cloud, Azure, etc.) and divvy up that server into smaller chunks of processing power that they then rent to our applications. We get a few benefits in this exchange — namely that we don’t have to worry about what size of server to rent and we don’t have to manually set up all the stuff that goes into a modern web application running on bare metal. But the PaaS’s get a few benefits too — our money, for starters, but also the efficiency of running many different applications on the same, large-capacity, servers.
It’s this trade-off between ‘we pay less per server because we get the big ones’ (the Costco approach) and ‘we put many applications into a single server so they can all run together and efficiently utilize all the horsepower of that server’ (the carpool-lane approach), that causes so much tension. If you’ve ever gotten into a carpool lane, you’ll know that it’s fast and great when there aren’t too many other vehicles in it. You’ll also know that it can be just as slow (or slower) than the other lanes if there are too many vehicles in it.
This metaphor is an accurate depiction of what’s happening on Heroku servers when they run multiple dynos. Too many applications vying for a slice of a pie that’s not big enough to support all of them! So, at times, they each can get cut a little short. Now, this isn’t an all-the-time thing — the pie cuts are constantly changing as different applications spike in their resource needs! It’s a real-time cacophony of “GIVE ME MORE CPU” between every app on the server:
👀 Note
Just a side-bar here — we’re talking about PaaS hosting options where your application will be hosted on shared hardware. Most of these PaaS companies, Heroku included, also offer dedicated hardware hosting options! They’re generally much (much!) more expensive, but they do work great!
But this is ultimately the game of risk that we play as developers: we want cheap(er), easy hosting for our application and are willing to have not-quite-100% performance. So we accept that the pie is constantly changing and use a tool like Judoscale (hey, it’s us! 👋) to autoscale our application instead. Autoscaling will add more dynos when performance dips, so overall everything should be okay, right?
Well… it’s a little more complicated than that. Let’s talk about metrics.
What’s a Metric, Anyway?
I want to step back a little bit (but hopefully not too) far with this question: how do we even know if our app is performing well in the first place? Sure, we could point our browser at the app (assuming it has a UI), load it up, and see how long it takes. But that’s a little like measuring the height of a single wave to determine if the entire ocean is healthy. A sample size of one, as they say 😉.
Alternatively, we could watch every single request in our real-time logs and assess how long it was queued, how long it took to process, and the various data-points within. That would give us some confidence around knowing that our app is running well, but it sounds exhausting. And if your request volume is more than about one request per second, good luck keeping up!
So, and hopefully I’m not being too reductive here, instead we use aggregate metrics. Basically just algorithms that batch up all of those data-points from every request and boil down that data into (typically) a single number or value that we can digest easier. The most common aggregate metric is actually the simple average. I think we’re all familiar with this concept! Here’s an example:
But aggregate metrics are like zooming out of a high-resolution image. You’ll still be able to make out the gist of the image, but you won’t necessarily be able to see the fine details anymore. And the more you zoom out, the more you lose sight of the little bits! For example, did you see that req.10 had an inordinately high response time of 234ms? An outlier, for sure.
But this zoom-out detail-loss isn’t a bad thing, generally. There’s real value in being able to quickly glance at an image from far away and determine that the image is of a tiger (for example) without having to see all the fine-details and spend time assessing the subject. But this is the tradeoff: assessment speed versus detail.
✅ Tip
Let’s make that last bit a little more visual. Here are two cuts of the same image. On the left is a very zoomed-out view. Depending on what screen-size you’re currently reading this article on, it should be a pretty small image. But you can still make out that it’s a tiger, right?
On the right, however, we can actually see that this tiger’s whiskers are halfway to turning gray. If your job was that of a tiger age specialist, the zoomed out picture wouldn’t have helped much, even though it was easy to identify! Details and resolution are always going to be relative to your responsibility with the system. Just like our response-time average example, the average of 88ms is the zoomed-out view, but req.10 taking 234ms is a detail you’d only see by zooming in.
Neat tiger shots aside, the metaphor remains true for application metrics, too. There is value in having a single number we can quickly observe to determine if our application is healthy and stable! But there are caveats of details we might miss in the process. This can be further complicated by the choice of aggregation algorithm in place, too.
Let’s take another example data-set. Here we’re observing queue time, which, if you’ve read our Understanding Queue Time guide, is the single value to watch when trying to determine if you need to scale up or down. There are several requests to various endpoints here across three different dynos:
We can make a couple of early observations here since we’re down in the weeds, looking at the highly detailed, request-by-request data. First, we see that the normal queue time appears to be around 30-40ish milliseconds. Most of the requests are in that band. Second, that leads us to see the few queue times in the 300-400ms range as outliers — higher than the typical. Something’s going on! We have some kind of issue.
Now, if we assess this data using an average aggregate, we’d get a result of 137ms. It’s important to remember how averaging works, in this case. Every value in the collection influences the average and pulls the value until it’s in perfect tension with all the values. We can plot those requests on a histogram-style chart and see this:
If we know our normal queue time is around that 30-40ms mark, then the average is telling us that we’re far above normal. That’s good! The average is allowing the very-high values to influence the metric enough to be visible. This is correctly signifying that we currently have an issue.
Alternatively, we could use a median aggregate. A median simply uses the middle values after sorting. That’d look like this:
And, in our case, the median value is 42ms. Which is totally within our normal expected range for this data… but that’s not good! We are having some kind of issue. If our metric isn’t telling us that, that’s a problem!
The flaw with using a median for this kind of data is that it gives the outliers no weight to impact the metric value. Our illustration shows that — a median is purely just a middle-line after sorting all the data. That means that half (or more) of our requests would need to be in that very-high (bad!) state before the median reports any real problem. We don’t want to get that far! We want to fix our issue sooner than later.
Of course, there’s always the third-M in the aggregate trio we could use: the mode. Unfortunately, the mode is actually quite a bit less useful for understanding system health. Being that the mode of any set of data is simply the value which occurs the most frequently in that set, its resulting metric doesn’t tell us anything about broad system metrics at all. If a server was handling requests with queue times of [33, 33, 418, 381, 332, 583, 427, 470], clearly the majority of the requests have a high queue time and there’s an issue! But the mode in this case would report 33ms simply because that value was repeated. Yikes!
Out of these three options (average, median, mode) the average is the only aggregation algorithm that correctly results in a metric that indicates we’re out-of-normal. That’s great! But let’s see what happens when shared hardware gets involved and we put this into the context of autoscaling.
Averages with Shared Hardware
The tricky thing with averages is that they give every value in the collection an equal weight in the resulting metric. Like our prior example, they all get to tug the ‘average line’ with the same strength, regardless of how outlying the value is. That can get tricky when you’re trying to determine your overall system queue time across different dynos that are behaving differently.
Imagine we have an application running two dynos. Those dynos are on separate physical servers and dyno #2 unfortunately has a couple of noisy neighbors. Over the course of a 20 second window, the queue times for the requests to each of these dynos looks like the following:
Dyno.1 (Quiet)
Dyno.2 (Noisy)
queue_time = 4ms
queue_time = 34ms
queue_time = 3ms
queue_time = 142ms
queue_time = 8ms
queue_time = 88ms
queue_time = 7ms
queue_time = 71ms
queue_time = 2ms
queue_time = 50ms
queue_time = 8ms
queue_time = 22ms
queue_time = 4ms
queue_time = 12ms
queue_time = 5ms
queue_time = 224ms
queue_time = 7ms
queue_time = 163ms
If we take Dyno.1 to be our sense of health for the application, let’s say that, in normal circumstances, the queue time for our app should be below 10ms. That means Dyno.2 is in a pretty bad state.
Obviously, as one might expect, our average queue time across the whole system is going to be pretty high and indicate that we should scale up. Indeed, doing the math, our average system queue time is 47ms. So we scale up! Let’s look at the next 20 seconds, now with three dynos instead of two:
Dyno.1 (Quiet)
Dyno.2 (Noisy)
Dyno.3 (Quiet)
queue_time = 8ms
queue_time = 228ms
queue_time = 4ms
queue_time = 4ms
queue_time = 316ms
queue_time = 8ms
queue_time = 3ms
queue_time = 39ms
queue_time = 5ms
queue_time = 4ms
queue_time = 301ms
queue_time = 3ms
queue_time = 3ms
queue_time = 36ms
queue_time = 5ms
queue_time = 3ms
queue_time = 183ms
queue_time = 7ms
queue_time = 5ms
queue_time = 151ms
queue_time = 9ms
queue_time = 9ms
queue_time = 72ms
queue_time = 6ms
queue_time = 5ms
queue_time = 164ms
queue_time = 7ms
Alright, how’s our average doing, then? It’s now… 58ms. Wait, what? Okay, well, we can talk about that, but in the meantime our autoscaler is still seeing a high queue time and scales us up again. Now, we’re assuming that the new dynos added to the cluster are getting provisioned on non-noisy hosts (which is quite an assumption), but let’s keep on:
Dyno.1 (Quiet)
Dyno.2 (Noisy)
Dyno.3 (Quiet)
Dyno.4 (Quiet)
7ms
259ms
5ms
6ms
5ms
184ms
8ms
4ms
3ms
78ms
7ms
7ms
6ms
245ms
6ms
5ms
4ms
144ms
4ms
8ms
7ms
312ms
8ms
7ms
3ms
438ms
9ms
6ms
6ms
132ms
3ms
5ms
5ms
47ms
4ms
4ms
And the average queue time now is… 55ms. This isn’t good. We keep upscaling but it’s not helping our metric! That’s not how autoscaling is supposed to work! Ugh.
Let’s jump ahead a few minutes. We’re now up to ten dynos — nine of which are on quiet, normal hosts! That looks like this:
Dyno.1 (Quiet)
Dyno.2 (Noisy)
Dyno.3 (Quiet)
Dyno.4 (Quiet)
Dyno.5 (Quiet)
Dyno.6 (Quiet)
Dyno.7 (Quiet)
Dyno.8 (Quiet)
Dyno.9 (Quiet)
Dyno.10 (Quiet)
6ms
289ms
7ms
4ms
8ms
6ms
5ms
7ms
4ms
5ms
5ms
314ms
6ms
5ms
4ms
8ms
7ms
3ms
5ms
7ms
7ms
238ms
3ms
7ms
6ms
4ms
8ms
5ms
3ms
6ms
4ms
372ms
8ms
6ms
7ms
5ms
4ms
6ms
8ms
4ms
6ms
158ms
4ms
8ms
5ms
7ms
6ms
4ms
7ms
3ms
5ms
427ms
5ms
7ms
4ms
6ms
7ms
5ms
6ms
8ms
4ms
341ms
7ms
3ms
8ms
5ms
3ms
8ms
5ms
6ms
7ms
192ms
6ms
4ms
6ms
8ms
5ms
6ms
7ms
4ms
3ms
305ms
8ms
5ms
7ms
3ms
6ms
7ms
4ms
7ms
And our queue time metric? Still 34ms. Still far higher than what we considered ‘normal’ (< 10ms) even though nine of our dynos are very much within normal range! The influence of a single rogue dyno is skewing our metric!
👀 Note
It’s worth stepping back and observing that, before Dyno.2 got noisy, this application was running totally fine on two dynos. We know that because the queue time of Dyno.1 was very low and consistent — that indicates that this app didn’t have a capacity issue at all! It simply encountered a noisy-neighbor issue. And yet, we now have a cluster of five times the number of dynos we started with, still haven’t alleviated our queue time metric, and are essentially throwing money away 💸. This sucks!
So what’s really going on here, then? It comes back to aggregate metric algorithms.
[Averages] give every value in the collection an equal weight in the resulting metric
And, in our case here, that means that the single outlier dyno having a bad day is pulling our average up even though nine (!!) of our dynos are perfectly happy!
All of this leads to a single, poignant, conclusion: we need to be able to root out and halt dynos that are disproportionately impacted by noisy neighbors so that their metrics don’t skew our average metric across the whole cluster.
Luckily, this is exactly what the Dyno Sniper does. We just broke it for a short time 🙃.
Our Experience
While we’ve known about the noisy-neighbor downsides and impacts for quite some time, it’s only after building and running the Dyno Sniper on Judoscale itself that we’ve come to understand just how bad the noisy neighbor problem on shared hardware can be. Especially when you turn it off.
As I mentioned far above, we’ve been experimenting for some time with running Standard-1X and/or Standard-2X dynos on Heroku. We have a few hunches here that we’re still working out (hope to post about those soon!), but the point is that we’re very much running on shared hardware.
Additionally, Judoscale is a fairly high-traffic app. We routinely run between 5 and 25 Standard-1X dynos to handle all of our traffic. With Dyno Sniper running, on an average day, we stick to around 15-16 dynos.
As you can see, we only stick to 14-16 dynos because the Dyno Sniper is constantly recognizing single-dyno spikes that are disproportionate to all the other dynos and sniping them! Fascinatingly, just about every single queue time spike in that chart (and subsequent drop) correlates perfectly to a snipe event. Our sniping UI is still a work-in-progress, but you get the idea! The sniper is busy!
But when sniping is disabled (or borked)? Well. That example we walked through where we spun up to ten dynos without fixing our issue? That was us. Except it was 25 dynos and never let up!
We didn’t actually have a capacity issue at all, but we autoscaled to our max scale limit and stayed there for days on end. For us, that’s 25 dynos 💸. That didn’t feel great. Unfortunately it’s also extremely opaque! It’s very difficult to know when you’re facing a problem only impacting a single dyno since almost all of Heroku’s stats and metrics are aggregated across all dynos. That darn average!
Snipe On, Sniper!
The good news is that we got the Dyno Sniper back up and running quickly and have since had no issues with our app or our other beta-testing apps that have opted into sniping.
To that end, if you’re a current Judoscale customer running on shared hardware and you haven’t opted in for running the Dyno Sniper, we really recommend you do. We’re actively iterating on the application UI for it, but it’s functionally very available and impactful from the moment we enable it for you.
So… Shared Hardware — How Bad Can it Get? Bad 🙁. If your particular dyno instance happens to get stuck on a host with some noisy, unpleasant neighbors, there’s almost no way out. Metrics will be skewed, add-ons will be confused, and your application alerting might trigger even though most of your dynos are fine. We hope that Dyno Sniper will allow folks to fully fix, and side-step, this issue.