Long-Running Background Jobs
Issues with long-running jobs
Anytime a worker dyno is shut down — due to autoscaling, a deploy, or a daily restart — background jobs are at risk of being terminated.
Heroku and job backends like Sidekiq work together to gracefully handle this in most cases. When Heroku shuts down a dyno, processes are given 30 seconds to shut down cleanly. During this shutdown period, Sidekiq stops accepting new work and allows jobs 25 seconds to complete before being forcefully terminated and re-enqueued.
Sidekiq also recommends that our jobs are idempotent and transactional so that if they are prematurely terminated, they can safely re-run. This is good advice for all job backends on Heroku since Heroku can reboot your dynos at any time.
If we’re following these best practices, we’ll have no issues with long-running jobs and autoscaling worker dynos. Our apps are imperfect, though, so we may find ourselves with long-running jobs that cannot be safely terminated and re-run.
Handling long-running jobs with Judoscale
Judoscale provides a mechanism to avoid downscaling your worker dynos if any jobs are currently running. To enable this option, ensure you’re on the latest adapter version, then:
Add the following option to your initializer file:
config.sidekiq.track_busy_jobs = true
This tells the Judoscale adapter to report the number of “busy” workers (actively running jobs) for each queue.
Enable Long-Running Job Support in the Dashboard
Once these metrics are being collected, you’ll see a new advanced setting in the Judoscale dashboard:
Check this option, and you’re good to go! Judoscale will suppress downscaling if there are any busy workers (running jobs) for the relevant queues.
Be careful, though. If you have fairly constant job activity, your workers will never have a chance to downscale. 😬 This feature is intended for queues with sporadic, long-running jobs.
Judoscale can't prevent downscaling completely during active jobs. The
track_busy_jobs option reports this data every 10 seconds, so it's possible a long-running job is started between the time data is reported and the time autoscaling is triggered.
Also, this optional configuration mitigates the issues of automatic downscaling killing long-running jobs, but be aware that long-running jobs are still an issue. Deploys and restarts will still potentially terminate your long-running jobs, so if you’re able, you should break up your large jobs into batches of small jobs.