How to Roll Your Own Autoscaling

Jon Sully headshot

Jon Sully

@jon-sully

Alright, we understand that this may seem like a counterintuitive article for us to write since Judoscale is itself (the best!) autoscaling solution available, but there’s two important factors at play here. First, Judoscale doesn’t work everywhere or with everything — we support autoscaling on Heroku, Render, and Amazon ECS with response time or, more importantly and primarily, queue time. But that’s not everyone’s cup of tea! Second, we’re a tiny team of devs that earnestly want to help people. If you‘re set on rolling your own autoscaling, we’d love to empower you to do so by sharing some of our experience and a few tips. And we’re here if you have questions. We’re happy to help, even if you’re not a customer!

Now, that we’ve got that out of the way, let’s take some time to seriously look at all the components required to roll your own autoscaling.

What We’re After

Figuring out what we actually want is probably a good place to start. For the sake of this entire post, let’s presume we have some application — call it Zephyr. We’ll say that Zephyr is a typical Ruby on Rails application running on Amazon ECS and has both web processes and background job processes. Let’s also assume Zephyr is some kind of US-centric e-commerce application (perhaps selling consumer-grade blimps, zeppelins, and other dirigibles 😜).

Zephyr is currently over-provisioned: running far more tasks for both its web processes and worker processes than it actually needs to service all of the traffic it receives. Ergo, Zephyr is burning money 💸! At least for the money spent, Zephyr is stable. Better to burn money than fail to service (potential-customer-) requests, right?

The ideal state is that Zephyr only utilizes the number of tasks/resources it really needs to suitably serve all of the requests its receives, and that as traffic levels change through the days, more or less tasks are spun up/down to accommodate the need. We want to have our cake and eat it, too — a fast, reliable system for low-cost! The good news? We can do it.

What We’re Watching

We briefly linked to another article above, but it’s worth having a brief discussion about metrics here. It’s perhaps the most core question of autoscaling in general — how do you know when you ought to scale up or scale down? What value or metric can we observe to make scale changes by? How do we know that this value is reliable? What math proves that this value is stable and correctly models capacity needs?

We won’t dive into all those questions here since we have a post that covers all of those details in depth — “Understanding Queue Time: The Metric that Matters”. We very much recommend giving that post a thorough read if you don’t yet have a firm grasp on how queue time is the oracle of all things scaling!

But we need to keep in mind that we’ve got two separate processes and systems in play here. We have web processes and background job processes — web requests and background jobs. So we’re going to look at different things for each: our request queue time on the web side, and our job queue time for the background side. That is, how long any given web request is waiting before being serviced by one of our web processes and how long any given background job is waiting before being serviced by one of our background job processes. That’s a mouthful!

What We’re Calling

Let’s step sideways for a moment and consider the scaling action itself. Sure, we’re going to build a system that watches Zephyr’s queue times, but then what? We shouldn’t gloss over the “scale Zephyr up or down” step, as it’s not always that simple! In fact, it’s not always possible. Unless the platform you’re hosting on has an API or other programmatic solution that we can call from our autoscaling-system, we won’t have a means to tell that platform, “Hey scale Zephyr up by +1!” And if we can’t do that, we can’t autoscale.

So before we dive any deeper, if you’re currently considering rolling your own autoscaler for your application, check with your hosting platform and make sure that you can, in some way, programmatically change the scale of your application. If you can’t… we might recommend switching providers! Otherwise you may be stuck with any autoscaling they offer directly as a platform (which won’t be queue-time based, and likely won’t be great) or no autoscaling at all (💸💸).

What We’re Architecting

How we go about building this system becomes an important factor in this discussion thanks to an alluring idea: “what if we just build autoscaling into the application itself? It can scale itself!” ⚠️ Don’t do that!

While that would be an easier implementation, it can catastrophically fail in the worst moments 😬. An application that handles its own self-autoscaling can fall prey to negative feedback loops! Meaning that, as queue time rises and the application slows down, the logic to increase scale also slows down. As the application gets slower, the up-scales it so-desperately needs fail to execute, and the problem compounds.

Of course, this huge red flag means a bigger overall architecture: we have to externalize autoscaling away from Zephyr itself. It needs to be a separate system, specifically dedicated to autoscaling our app — separate processes, a separate deployment, and likely a separate code-base.

Let’s pause here and describe what we’re going to try to build, then. We need:

  • A separated, isolated system for managing Zephyr’s scale
  • A means of observing and keeping a record of Zephyr’s web-request queue time and background-job queue times
  • Some kind of setup for (carefully!) executing scale-changes for Zephyr on its hosting platform

Let’s dive in to some of the nitty-gritty here. We’re going to start with point 2 above.

The Client

Here at Judoscale, we call our open-source, installable packages “adapters” or “clients” (or just “packages”). Regardless of how we want to name them, they’re essentially client-packages for collecting and sending metrics back to the Judoscale servers. We’re going to need the same thing here!

In short, since we’re creating a stand-alone system that isn’t built into Zephyr directly, we’ll need some kind of code running in Zephyr that sends queue-time metrics over to our autoscaler.

But things get a little bit tricky here for performance reasons. Obviously our goal is to ensure that Zephyr is scalable for performance and traffic reasons, but if our client-package slows down Zephyr in any way, we may be at a paradox! Have we really gained anything if Zephyr can autoscale but each individual request is actually slower? Or do we just have a system that consumes more resources at the end of the day for the same net result? 💸

We must do everything we can to ensure that our client-code doesn’t slow down our actual web requests (or background jobs)! How do we do that? We spawn.

Judoscale’s client packages are a bit complicated as we need to support several different frameworks and libraries at the same time and automatically detect which Judoscale clients are running (e.g. the judoscale-rails and judoscale-sidekiq packages are separate but need to just-work© when both are present) — but let me pull back the curtains a bit.

At a high level, Judoscale’s clients spawn a new, separate thread which is responsible for collecting metrics and sending them back to the Judoscale server. We call this thread the ‘reporter’ and we’ve spent a lot of time optimizing it to use as few processor cycles as possible while doing its job.

image-20240803084828945

We begin to add some complexity when we consider that many web/application servers these days are both multi-process and multi-threaded within each process! And, of course, we may have multiple tasks running on our hosting provider. We need to be very precise about where, when, and how, we setup our reporter process, then. We need a reporter for each process that’s handling web-requests, and we need each thread that handles requests to push individual metrics to that process’ reporter.

This is easier to visualize:

image-20240803085048716

And let’s now add that last bit of logic we discussed: a background reporter (thread) for each process that’s handling web-requests and something in each thread to push individual metrics to the reporter in the background.

image-20240803085442280

With all of that in place, the workflow ought to go something like this: when the Rails application process boots, before it forks separate processes (we do this in a Railtie), we spawn the background reporter thread. This thread will live as long as the Rails process (web or background job) does!

In the meantime, we wire up a small hook that fires for each web-request. For Rails, this is Rack middleware. Its job is simple and efficient: right before Rails begins processing any incoming request, this middleware grabs the timestamp of the request and calculates the amount of time that request was waiting to be processed (this is its queue-time!). In essence, this is the difference between the time denoted in the X_REQUEST_START header and now() — the moment when the middleware runs. Finally, the middleware shoves the queue time data into a small memory store for the reporter to later read.

Remember, the middleware hook needs to be as minimal as possible and should use the most efficient Ruby calls and math as possible. This middleware ought to fully execute in less than one millisecond for any request. Every CPU cycle counts!

The last step in the flow here is the reporter’s reporting. The reporter periodically (we use 10-ish seconds) flushes all the metrics from the in-memory store and POSTs them up to Judoscale.

And that’s the bones of the client! A background reporter for each process, a middleware hook that extremely-efficiently grabs the queue time for each request from each thread, and a period POST to deliver those metrics up to the autoscaler.

If you’re tackling this yourself, make sure that you write some specs around memory leaks, launch failures, and graceful crashing. Trust us, you don’t want your client package to be responsible for killing an application! An application with zero tasks running is way worse than one that can’t autoscale for a while. 😅

The Autoscaler

So you’ve got metrics coming in from each of your processes on each of your tasks, but what to do with them? There are a few things we need to cover in this section: aggregation, statistics, scanning, and storage.

Aggregation

Remember that what’s getting POSTed to our autoscaler periodically is raw queue-time metrics with timestamps. This may be as simple as:

  • 2024-08-01T14:08:31Z - 1ms
  • 2024-08-01T14:03:57Z - 3ms
  • 2024-08-01T13:57:06Z - 2ms
  • (..a few thousand of these..)

Which isn’t inherently useful to our broader goal yet. We can’t usefully scale Zephyr based on individual metrics; we need to aggregate them. But how do we aggregate them?

Our first several years at Judoscale we actually used a fully custom setup. Essentially we created our own aggregate-level records in dedicated Postgres tables and ran a fleet of background jobs, all operating at different layers, to aggregate individual metrics into roll-up metrics, then into time-span metrics that were well-indexed.

We don’t recommend that 🙂. There were too many layers, too much complexity to keep in our minds at once, and it had a slight NIH smell to it. Aggregating data over time is a well-solved problem these days!

That’s why we switched to a Timescale-based backend a few years back (when we migrated from Rails Autoscale to Judoscale!). Timescale is essentially just a Postgres database plus time-series data-handling and aggregation — batteries included! At its core, Timescale essentially does all of our previous complicated Sidekiq aggregation jobs for us, but down in the DB layer directly for much better performance and stability.

Timescale allows us to throw metrics into ‘raw data’ tables and query roll-up tables directly without having to worry about any of the plumbing or machinery that does the actual aggregation. Put in individual metrics; query “okay what was the average queue time over the last 30 seconds across XYZ tasks”. It’s a neat platform.

So that covers aggregation tooling, but let’s talk about statistics.

Statistics

As with any data aggregation, we must pick an algorithm for how to aggregate. Are we after average queue time? 95th percentile queue time? Maximum queue time? Each of these will present slightly different numbers. The most important thing here is considering how each of these different algorithms will represent how Zephyr is performing in real-time.

As any application handles requests over time, it’s important to remember that not all endpoints are created equal. Any typical app is going to have a wide range of fast and slow paths, controllers, and connections. While this doesn’t directly impact queue-time, it certainly can indirectly. A given web process handling one of the slower endpoints can get stuck for a short time and cause other requests headed to that same process to be queued (wait) until the slow request is completed. This is essentially a localized, short-term backup.

All that to say, there’s a natural variance with queue times in web applications. You shouldn’t expect to have a super-steady queue time unless you have an extremely optimized application or you’re very over-provisioned.

This rules out using maximum as our aggregation algorithm. This would essentially cause our application to scale up virtually any time our slower endpoints get traffic. Our queue time aggregate would be inflated according to our slowest endpoints! What’s worse, we almost certainly wouldn’t need to scale up at those points. The application will recover and continue on without issue once the slow request(s) are finished — we don’t need to spin up more machines!

That leaves the 95th percentile (p95) and average as good candidates for our aggregate algorithm. And luckily we have experience with both! We aggregated on p95 for several years then switched to average a couple years back, and we have some opinions. We need to talk about the other major component of statistics in play here first, though: bucket size.

The very premise of an aggregation algorithm is that it condenses information from a wide range of values over a large period of time to a smaller (single) value representing that period of time. When we think about queue time and autoscaling, this means aggregation condenses our mass of datapoints over a bucket-time into a single aggregate value for that bucket. E.g. “our average queue time was 12ms over the last sixty seconds.” But a critical question here is how large to make the bucket — sixty seconds? Two minutes? Thirty seconds? Five seconds!?

It’s a trade-off. Going with a smaller bucket means that your app can have more responsive autoscaling — your queue time suddenly spiking will be far more impactful to a ten-second rolling bucket than a two-minute rolling bucket — but be cautious. Using too small of a bucket, along with too sensitive an algorithm, can make for an autoscaling system that’s altogether too touchy.

To drive this point home, we’ve found over the years that you need a significantly larger bucket when running a p95 aggregate than when running an average. And that’s due to the same logic mentioned above. When we used p95, we ran a two-minute rolling bucket. p95 is a powerful aggregate and filters out all but the 5% most outlying data; using a bucket any smaller than two minutes simply didn’t have enough data to give a good representation of the app’s queue time — any smaller of a bucket and it behaved a lot more like maximum!

For contrast, since we switched to using an average aggregate, we now maintain a ten second rolling aggregate! Since the average aggregate works across every data point instead of only the maximum 5%, there’s plenty of data in even a ten-second window to get a fair representation of the app’s general queue time. And since we’re using a ten-second window, the autoscaling is super responsive! Any major change to traffic or capacity that would require an up-scale can be detected in only a few seconds!

So with both bucket-size and algorithm in mind, here’s our suggestion: stick with an average aggregate and use a smaller bucket size for the aggregation — ten to thirty seconds is ideal. This will grant you the benefits of both autoscale responsiveness and an autoscaler that’s not too touchy, thanks to average including all data-points. Plus, it’s cognitively simpler!

Scanning

We’re calling this section ‘scanning’, but in some sense it’s really about timing. And we’re not talking about ensuring that the raw data is ingested into a time-series data structure with the correct timestamps — we rely on Timescale to ensure that the records are in correct time! We’re talking about the timing of the machinery that actually ‘watches’ your application.

We just spent several paragraphs above discussing aggregation algorithms and bucket sizes, but once you make that decision, that should all run automatically in your time-series system. What doesn’t happen in that system is the autoscaler’s logic for “constantly check if Zephyr is beyond its thresholds”. The time-series system aggregates the data into a more convenient table, but it’s up to the autoscaler to actually read that table and determine if scale changes need to be made!

And read that table, you must. But we’ve got to be careful with timing here. First, we need to be acutely aware of our time-series system’s own timing. We might have a ten-second aggregation going on, but we need to consider the timing between how the data comes in, the range of possible timestamps of the incoming data, and how far back the time-series system will aggregate. If that sounds complicated, it’s unfortunately because it is. Ultimately, we need to be highly confident that whatever time range we query the time-series data for is fully aggregated with all possible data — we don’t want to query the time-series data for a range that isn’t fully processed yet or is missing data. Autoscaling with an incomplete picture of application health is a big no-no.

So, while this is a little hand-wavy, we’ll need to setup a background job system that checks the aggregate Zephyr metrics periodically to determine whether Zephyr needs to be scaled up or scaled down. In essence, that means we’ll likely setup a job that runs every fifteen seconds (forever!), whose code simply queries the time-series data and determines if the average queue time (across a ten-second window) is too high, just right, or even low. This is what we call ‘scanning’. It’s the constant checking of queue-time for an application. At Judoscale we have highly tuned queries that check the queue time of all of our client applications at once (✨), but in your autoscaler it’ll be much simpler.

This can be over-thought, but at its core it’s a simple idea: every fifteen seconds, query the time-series data and make sure Zephyr’s average queue time is within normal range. If it’s not, scale up or down accordingly!

Storage

The final critical component of your autoscaler is storage and monitoring. The last thing you want is a system that will ingest all of your raw data, store it into a finely tuned time-series data structure, scan it for outliers constantly, and even execute up-scales and down-scales on Zephyr’s host… but not have any means of seeing what’s going on!

There are two aspects of this section that we need to cover: storing the data, and viewing the data (monitoring). Luckily for the ‘storage’ side, we’re already storing all the queue-time data. Now, we’ll want to setup a retention policy on that data so our database doesn’t grow infinitely, but the time-series data is there for the reading!

But we do need to store some other data, too — the scale events themselves! Any time the autoscaler kicks off an up-scale or down-scale for Zephyr, we’ll want to store that event. Both as a safe-history moment, but also because we’ll need to see that event in our monitoring to understand system health better.

Let’s talk monitoring and visualization, then. The reality is that, even with the best autoscaler system running seamlessly and smooth, your human team will still want a way to open that system, see (visually) how Zephyr’s queue-time health looks over time, see the autoscaling history, and make some changes. The autoscaler needs a UI. And not just any UI, but a UI that renders time-series data charts, scale graphs, and scale-change events that correlate to both! If you just gulped, trust us, we understand. This UI took a long time to build and dial in:

image-20240802063958404

But we submit it as a guide for the data your team will want to see over time. A selectable time-range, the queue-time data for that range, and the scale (and scale changes) throughout that time. We’ll leave the implementation as an exercise for the reader, but we highly recommend spending some quality time on this interface. It’s not the step you’ll appreciate skipping if something goes wrong in the future 😬. The critical question you’ll need to be able to answer in a past-tense context is, “Why did we autoscale here?” That’s the question this UI seeks to provide an answer to!

The Platform

We talked early on in this article about needing to run Zephyr on a host that supports programmatic scale changes — that if we can’t change Zephyr’s scale via some API call, we can’t autoscale at all. So we’ll assume here that we’ve got that figured out. What’s more important to discuss in this section is the great responsibility that comes with that great power.

It’s easy to jump to the simple statement of, “we scanned the data and it looks like we need to upscale, just fire the API request!” But we’d urge caution there. While idempotency / writing reentrant code is a popular topic for several areas of programming, it’s one that bares special consideration when writing code that automates scale changes. What ought happen if your API call to increase your scale count fails? Do you re-run the API call? What if that increase scale by more than you previously thought? What even is the current scale of the application, and how can you be sure of that?

These are all pertinent questions that are worth thinking through on your way to building your autoscaler, but we’ve got some tips here too.

First, your platform API calls will fail. Be it random, sporadic network failures from time to time, the platform having an API outage (which happens more often than you might think), or simply application hiccups on your end. You need to count on the fact that the API call to “add another task to the service” will fail, and will do so in varying ways. It may fail but still have increased the scale; it may fail without changing scale at all; it may respond with a success code but not actually apply the scale change. It’s a black box that you don’t have control over. What we recommend here is double-checking the current scale both before and after you execute a scale-change request. This ensures confidence that, first, the scale-change you’re about to execute is still valid (we didn’t already fire up a new task), and second, that the scale change request succeeded in its intention.

Second, timing is hard and writing reentrant code that mutates records often requires locking. You’ll want to be locking. We take advantage of pessimistic, row-level record locking via our database to ensure that while we’re attempting to scale up or down an application, no other resource can attempt to do the same. The last thing you want is multiple copies of background jobs all trying to scale up Zephyr at the same time and suddenly have far more tasks running than you intended 😬.

Finally, the last piece of the platform discussion we want to touch on is noisy-neighbors. Depending on which platform and which specific hardware tier Zephyr runs on, you may be sharing hardware with other applications. Any hosting platform that shares hardware across applications is challenged with spreading those resources around equally. It’s a very difficult program to solve programmatically! Unfortunately, it virtually cannot be done perfectly, and some degree of noisy-neighbor syndrome is likely to occur. We’ve written about Heroku’s Noisy Neighbors in one of our other blog posts, “How to Fix Heroku’s Noisy Neighbors”, where we introduce the idea of “sniping”. We bring that to your attention here because you may want to consider doing the same thing. If you’re hosting on shared hardware and have the means or API access to restart individual tasks (especially on a different container host), you may want to do that. We’ll likely write a future article on the topic, but we would stress that the impact a single misbehaving / errant task can have on your overall application health is larger than expected. A noisy neighbor (resource hog) is a big deal!

Roll Your Own!

And there we have it! The three major components that we think are required to roll your own autoscaler, with some batteries-included opinions about the best way to get there. We recommend:

  • Watching application queue-time closely using an as-efficient-as-possible client adapter
  • Aggregating datapoints into useful metrics using a powerful time-series data structure
  • Visualizing and monitoring aggregate data for spikes and threshold breaches to kick off scaling
  • Layering safety, locking, and re-checking all over your platform “change scale!” calls, and keeping a lookout for noisy neighbors

Each of these pieces is a major slice of the autoscaler pie. It’s worth really spending some time to ensure that you’ve gotten each one right! Ample testing, specs, and validations are worthwhile when the risk of a system failure is either lots of money spent on over-provisioning or loss of company revenue by not being able to serve requests (lack of capacity)! Yikes!

If rolling your own autoscaling is in the cards for your team, we fully believe you can safely do it. We hope that our thoughts and advice here helps guide that process and offers some unique tips that could be overlooked until you’re waist-deep in the autoscaling waters 🌊.

Or.. Maybe Not

Of course, though we need not remind you given that you’re reading this article on our website, if you don’t want to roll your own autoscaler, Judoscale is here. We’re coming up on 10 years of autoscaling experience, iteration, and design tweaks to perfectly dial in each of these systems (and many others!). While autoscaling isn’t something we recommend not building yourself (live your best life!), we do recommend giving Judoscale a shot before you dive into that world.

Judoscale’s out-of-the-box adapter packages (for Ruby, Python, and NodeJS) enable you to fire up queue-time autoscaling as soon as they’re deployed, and we even offer response-time autoscaling out-of-the-box for all Heroku applications — no adapter required! The idea is to get you autoscaling quickly, easily, and without having to worry about all of the complexity in this article.

Whichever direction you choose, we’re here if you need a hand! Just grab some time on our calendar or give us a shout. We’d be happy to help.