On Saturday, May 11, we decided to try the new Heroku Router 2.0, which is still in beta. The result was 50 minutes of downtime and a completely rebuilt production application, back on the legacy router. Here’s our full story.
Enabling Router 2.0
We’d been running Heroku’s new router in our staging environment for 24 hours with no issues, so we decided to give it a shot on production. Worst case scenario, we’d just roll it back… right?
Anyway, I settled in with my coffee, looking forward to an easy Saturday morning upgrade.
I ran the command, and immediately I started seeing slow requests across the board—all dynos, all app endpoints. I assumed the app was just catching up after the switch, much like it has to do after a restart or deploy. Catch-up usually just takes a few seconds, so I waited.
But minutes later, nothing had changed. Our autoscaler (Judoscale, naturally) was scaling us up, but requests were slow no matter how many dynos we were running.
Our application response times looked great—this wasn’t an issue with Rails or our database. But overall response times were awful—request queue time was off the charts.
Reverting the change
By this point, our Slack was going crazy with alerts, and my teammate Carlos offered to help. We hopped on a call to investigate it together.
We tried restarting all of our dynos, and we tried deploying a new release, but neither helped at all. We decided to bail and revert to the old router.
Unfortunately, reverting to the legacy router didn’t help at all. We thought maybe we were still using the new router, but we confirmed the legacy router in our router logs:
At this point we updated our status page to notify our customers about the incident. We thought we had an “undo” button if the router migration didn’t work out, but we were now in uncharted territory. We were back on the router where we started, but nothing was the same. We had no idea what was going on.
We tried restarting the app again. We tried scaling all dynos down to zero then back up. We re-examined our metrics to make sure it wasn’t an upstream database issue. Our requests were still performing great in Rails, but requests were timing out all over the place.
As a last resort, we tried re-enabling Router 2.0 again, but there was no change to our response and error metrics.
Recreating our production app
Our dynos were way over-provisioned. We should have had plenty of capacity for our traffic, but requests were still queuing and timing out. It seemed like a Heroku router issue, and there was nothing we could do about it.
So we reached for the nuclear option: We created a brand new production app on Heroku.
We really had nothing to lose at this point. Our app had been effectively unavailable for 20 minutes, and there was nothing else we could do except open a ticket with Heroku. We simply couldn’t wait for that.
Our thinking was: If switching routers somehow hosed our current production app, maybe a fresh app wouldn’t have the same problem.
Fortunately, it wasn’t as daunting as it sounds. We don’t use many Heroku add-ons, and the ones we do use aren’t mission-critical:
AppOptics for monitoring our infrastructure and performance metrics.
Scout APM for performance monitoring.
Judoscale for autoscaling.
Our data stores, error tracking, and log management are all third-party (not add-ons), so all we needed to do was copy over the environment variables from the existing production app.
We made sure the app worked as expected at the direct Heroku URL, then we decided to flip the switch by updating the domains.
We updated our CNAME in Cloudflare, and… OOPS!
In our stress and haste, we forgot about the SSL cert!
No problem. We created the origin certificate in Cloudflare, added it to Heroku, and we were in business.
Resolution
We watched as traffic flowed into the new app, and our response times dropped back to their normal levels.
We started breathing a little easier. We continued to monitor the app while we updated our status page and checked for support tickets from affected customers.
We were finally in the clear! The total time of the incident was about 50 minutes—it started at 2024-05-11 12:53 UTC and cleared at 2024-05-11 13:52 UTC. Fortunately the app was partially available for most of that time, so customer impact was minimal.
Lessons learned & next steps
In hindsight, we should have load-tested our staging app with the new router. Our staging app only sees about 5 RPS, while our production app is 1,200–1,500 RPS. It wasn’t fair to say we’d tested the new router by simply throwing it on our staging app.
On a positive note, it was super reassuring to know that we can recreate our entire production app within a few minutes! It felt sort of outrageous to do it at the time, but I think it was the right call.
I mentioned that we didn’t open a support ticket with Heroku during the incident, but we’ve since opened one so they can help us understand what happened. If they provide some insight, I’ll be sure to update the post.