Post-Mortem: How Heroku Router 2.0 Wrecked Our App

Adam McCrea
@adamlogicš Note
2025 update: We worked extensively with Heroku to isolate the cause of high response times with Router 2.0, and they published a post describing the issue in detail. We are now running Router 2.0 for all of our applications with no issues.
On Saturday, May 11, we decided to try the new Heroku Router 2.0, which is still in beta. The result was 50 minutes of downtime and a completely rebuilt production application, back on the legacy router. Hereās our full story.
Enabling Router 2.0
Weād been running Herokuās new router in our staging environment for 24 hours with no issues, so we decided to give it a shot on production. Worst case scenario, weād just roll it back⦠right?
Anyway, I settled in with my coffee, looking forward to an easy Saturday morning upgrade.
I ran the command, and immediately I started seeing slow requests across the boardāall dynos, all app endpoints. I assumed the app was just catching up after the switch, much like it has to do after a restart or deploy. Catch-up usually just takes a few seconds, so I waited.
But minutes later, nothing had changed. Our autoscaler (Judoscale, naturally) was scaling us up, but requests were slow no matter how many dynos we were running.
Our application response times looked greatāthis wasnāt an issue with Rails or our database. But overall response times were awfulārequest queue time was off the charts.
Reverting the change
By this point, our Slack was going crazy with alerts, and my teammate Carlos offered to help. We hopped on a call to investigate it together.
We tried restarting all of our dynos, and we tried deploying a new release, but neither helped at all. We decided to bail and revert to the old router.
Unfortunately, reverting to the legacy router didnāt help at all. We thought maybe we were still using the new router, but we confirmed the legacy router in our router logs:
At this point we updated our status page to notify our customers about the incident. We thought we had an āundoā button if the router migration didnāt work out, but we were now in uncharted territory. We were back on the router where we started, but nothing was the same. We had no idea what was going on.
We tried restarting the app again. We tried scaling all dynos down to zero then back up. We re-examined our metrics to make sure it wasnāt an upstream database issue. Our requests were still performing great in Rails, but requests were timing out all over the place.
As a last resort, we tried re-enabling Router 2.0 again, but there was no change to our response and error metrics.
Recreating our production app
Our dynos were way over-provisioned. We should have had plenty of capacity for our traffic, but requests were still queuing and timing out. It seemed like a Heroku router issue, and there was nothing we could do about it.
So we reached for the nuclear option: We created a brand new production app on Heroku.
We really had nothing to lose at this point. Our app had been effectively unavailable for 20 minutes, and there was nothing else we could do except open a ticket with Heroku. We simply couldnāt wait for that.
Our thinking was: If switching routers somehow hosed our current production app, maybe a fresh app wouldnāt have the same problem.
Fortunately, it wasnāt as daunting as it sounds. We donāt use many Heroku add-ons, and the ones we do use arenāt mission-critical:
- AppOptics for monitoring our infrastructure and performance metrics.
- Scout APM for performance monitoring.
- Judoscale for autoscaling.
Our data stores, error tracking, and log management are all third-party (not add-ons), so all we needed to do was copy over the environment variables from the existing production app.
We made sure the app worked as expected at the direct Heroku URL, then we decided to flip the switch by updating the domains.
We updated our CNAME in Cloudflare, and⦠OOPS!
In our stress and haste, we forgot about the SSL cert!
No problem. We created the origin certificate in Cloudflare, added it to Heroku, and we were in business.
Resolution
We watched as traffic flowed into the new app, and our response times dropped back to their normal levels.
We started breathing a little easier. We continued to monitor the app while we updated our status page and checked for support tickets from affected customers.
We were finally in the clear! The total time of the incident was about 50 minutesāit started at 2024-05-11 12:53 UTC and cleared at 2024-05-11 13:52 UTC. Fortunately the app was partially available for most of that time, so customer impact was minimal.
Lessons learned & next steps
In hindsight, we should have load-tested our staging app with the new router. Our staging app only sees about 5 RPS, while our production app is 1,200ā1,500 RPS. It wasnāt fair to say weād tested the new router by simply throwing it on our staging app.
On a positive note, it was super reassuring to know that we can recreate our entire production app within a few minutes! It felt sort of outrageous to do it at the time, but I think it was the right call.
I mentioned that we didnāt open a support ticket with Heroku during the incident, but weāve since opened one so they can help us understand what happened. If they provide some insight, Iāll be sure to update the post.