Post-mortem: No upscaling for 12 hours

Adam McCrea
@adamlogicIt’s an embarrasing day for Judoscale. Last night through this morning we had our longest and most severe production incident in history, and we didn’t know anything was wrong for almost 12 hours. It was caused by some unexpected data and a line of code that never should have been written.
In this post I’ll air our dirty laundry and tell you exactly what happened, where we screwed up, and how we’re fixing it.
The timeline
- 00:25 UTC: Upscaling stopped working for most Judoscale customers. We were not aware of this at the time.
- 12:00 UTC: Carlos begins his day and opens our support queue to find 30 new messages (1-2 is typical). He updates our status page and begins investigating.
- 12:36 UTC: Carlos identifies and removes the data that broke upscaling.
- 12:43 UTC: Carlos confirms that upscaling is working and updates the incident.
- 13:25 UTC: I finally get to my computer and Carlos catches me up via Tuple. We make a plan.
- 13:52 UTC: I update the incident with a summary and mark it resolved.
How did upscaling fail?
Let’s start with an overview of how Judoscale works:
- As we ingest our customers’ metrics, we aggregate the data into 10-second buckets.
- Our autoscaling jobs (separate for upscaling and downscaling) run every 10 seconds.
- For each customer with autoscaling enabled, our autoscaling jobs compare their recent metrics with the thresholds they’ve configured, and may or may not trigger a separate job to execute a scale event.
The 10-second buckets are key—they’re what allow us to be a far more responsive autoscaler than any other autoscaler we’ve seen. But when checking the most recent metrics, we need to make sure we’re looking at the right bucket.
Our ingest pipeline is fast, but it’s not instantaneous. Metrics data in our database might be a few seconds delayed, so we can’t assume that the latest metrics bucket is the current time bucket. We approached this by fetching the latest metric in our database at the beginning of our upscale job:
latest_metric = metric_class.where(time: METRIC_RECENCY.ago..).order(:time).last
There are two problems here:
- There’s no upper bound on the time condition, so a metric inserted with a far-future timestamp would (and did) cause our autoscaling algorithm to check for metrics in the wrong bucket.
- There’s no customer isolation—one customer can (and did) send us data that impacts other customers.
It somehow slipped past me at the time, even though I can look at it now and ask “WTF WAS I THINKING?!!”
👀 Note
It’s only our query for the latest time bucket that failed to isolate customers from one another. A separate query determines whether to autoscale, and that query is properly isolated.
So here’s what happened:
- A customer sent us metrics data with a far-future timestamp. It appears they sent it from their laptop, and it’s still unclear to us how or why they sent us this data.
- We ingested this data into our database (which we should not do).
- Our autoscaling algorithm treated that as the “latest metric bucket” (which it should not do).
- The autoscaler found no matching metrics in that bucket, so it did not trigger upscaling.
Was downscaling impacted?
Downscaling was not impacted because our downscaling algorithm works a bit differently. We have to look at much larger time range for downscaling—it’s customizable per-customer, but generally we check a span of 3-15 minutes to ensure the target metric has “settled” enough to downscale.
With a larger time span like that, we don’t have to worry about a few seconds of ingest lag, so we don’t check for recent metrics like we do for upscaling.
How did it take so long for us to notice?
We have a lot of checks in place to alert us when something goes wrong. We have a health-check page that runs domain-specific queries, an observability dashboard with threshold alerts, and platform-specific alerts on Heroku. All of this has served us well for catching production issues and fixing them quickly, but we didn’t have this scenario covered.
Here’s just a sample of the scenarios we have covered with our monitoring:
- One of our scheduled jobs is no longer running
- An ingest queue has backed up
- A Sidekiq queue has backed up
- One of the platform API’s we use for scaling is down
- Our app is slow to respond
But this scenario was different. All of our jobs were running as expected, and we had all the metrics data we expected. Autoscaling was still partially working —downscaling was working fine, and even upscaling was working for some customers, depending on the type of metric they use for autoscaling.
So our alerting just missed it. Our customers noticed, but we didn’t.
Also, the timing couldn’t have been worse. The incident started soon after Carlos and I had logged off for the day. Since we didn’t have alerting set up for this exact scenario, we weren’t aware of the issue until checking our support queue this morning.
What are we doing about it?
We’ve already implemented the following code changes:
- Ignore future metrics during ingest.
- Never look at future data for autoscaling.
- Isolate customer data when autoscaling—one customers’ data should never impact autoscaling for another customer.
For alerting, we’ve set up anomaly detection based on actual scale event data. With over 1,000 teams using Judoscale to autoscale their apps, our scaling trends are very consistent. If the trend deviates, it almost certainly means something is wrong, and now we’ll know about it immediately.
Did anything go right?
We have a lot to be embarrassed about, but I feel good about two things: teamwork and transparency.
I’m on support this week (Carlos and I alternate weeks), and this morning he had my back (as he always does). I had an early morning appointment and didn’t get to check the support queue like I usually do first thing in the morning. Carlos checked in anyway and immediately saw there was a problem. He found the issue on his own and had it resolved before I was back at my computer. He updated our status page and was already responding to customers. Our team is indeed small but mighty!
Ultimately this is my company and I fully own this failure. I can’t ignore it, and I won’t try to sweep it under the rug. As much as it kills me to damage our customers’ trust, I owe it to them to be fully transparent about our mistakes. It’s humbling to write this post, but I also know it’s the right thing to do.
If you still have questions or concerns about what happened, please email me. I’ll be as transparent as I can.