Troubleshooting Judoscale: Handling Queue Time Spikes and Errors

But First!

Before diving in to any particular troubleshooting or debugging questions, please make sure you’re running the latest version(s) of the Judoscale gems for your app. Updating to the latest version has solved many issues for other users in the past!

Why Does My Queue Time Spike Whenever I Deploy My App?

Unless you’re using Heroku’s Preboot feature, your app will be temporarily unavailable while it boots, such as during deploys and daily restarts. During this time, requests are routed to your web dynos, where they wait. All this waiting is reflected in your request queue time, which will likely cause an autoscale for your app.

This is not a bad thing! Your app autoscaling during a deploy means it’ll quickly recover from the temporary downtime during boot, and of course, it’ll autoscale back down once it catches up.

If you want true zero-downtime deployment, you’ll need to use Preboot.

Why is my queue time in Judoscale different from Scout or New Relic?

Request queue time is measured the same, but aggregated differently. Judoscale aggregates an average every 10 seconds, while other APM tools aggregate using buckets of a minute or more. This makes queue time in Judoscale a bit more “spikey”, but it also allows us to respond faster to a slowdown.

Why do I see API timeout errors from the Judoscale package in my logs?

The Judoscale adapter package reports metrics to our API every 10 seconds for every web and worker processes you’re running at a given time. For an application running 10 web containers with two processes on each container, this is over 170,000 requests in a day, not counting worker processes. This is a lot of API activity!

Our typical response time for the adapter API endpoint is around 5 milliseconds, but at our volume we do see occasional outliers taking several seconds. A few of those inevitably hit the 5-second timeout we specify in our adapters. This might cause an HTTP failure to be logged in your app’s logs, but it’s not a cause for concern. A tiny fraction of failed reports will not impact your autoscaling behavior at all.