Why Did Rails' Puma Config Change?!

Jon Sully headshot

Jon Sully

@jon-sully

Back at the very beginning of 2024, DHH started a conversation in the Rails repository around an important performance topic: how many Puma threads a Rails application ought to run per-process by default. If that sounds like a mouthful, that’s because it is! Being your resident inspectors of all things Rails performance, scaling, and knob-dialing, this conversation perked up our ears! We followed along back when the conversation was ongoing, but now that its results are live in Rails 7.2, we think it’s a topic worth exploring, explaining, and understanding deeper.

And, just another quick bit of context — this conversation was had in public by many very smart folks in the Rails community. It’s totally viewable on Github and will forever remain so. That’s awesome! But these brainy-brawns got into some serious nitty-gritty details and technical concepts that are, perhaps, less approachable to the everyday-dev! While we recommend checking out the original thread for all the context and data-goodness, our goal with this article is to understand the idea and the changes in plain English!

Puma: A Review

We’re not going to dive too deep into the history here, but Puma is a multi-threaded web- (and application-) server that can also split and run multiple processes as well. For a typical deployment, you’ve got multiple servers/containers/dynos, each running multiple Puma processes, where each process is then running multiple threads. A picture is much easier to understand:

image-20240702141932283

For the sake of this article and our brains, let’s just assume we’re working with a single container/dyno. The big idea with Puma is that you have two primary knobs to control and tailor your overall application performance: the number of processes running (Puma calls these ‘workers’), and the number of threads running per process (Puma calls these ‘threads’).

The first question is obviously, what’s the correct number of workers and threads?!

And if you’re a common Judoscale reader, you’ll know our answer is absolutely going to be “it depends!” But luckily in this case, there are some commonly agreed-upon guide-rails!

When it comes to workers (processes), community wisdom (and our blog) has long-held that you should run as many processes as you have CPU cores… and maybe a little more if you can get away with it (that is, if you have enough memory available)! If you’re running on a container/dyno with 2 CPU cores, run 2 Puma processes. Try 3 and see if that requires too much memory for your setup, but otherwise stick to 2. Etc. We won’t elaborate here, as the number of processes running isn’t the central point of the Rails Repo conversation we want to summarize.

When it comes to threads, for many years the community has essentially settled on the number 5 as the correct default for applications. So much so that even Heroku’s docs for deploying apps on Puma recommend running 5 threads! And indeed, it’s what Rails comes with out-of-the-box. Where our multiple-process knob (above) is mostly limited by memory, adding more threads to a process tends to be limited by the container’s CPU and your appetite for potential latency, not memory (adding threads does add a bit of memory, just much less than adding processes).

So let’s talk about those down-sides, starting with CPU saturation. On Heroku, CPU usage is aggregated and summarized as simply a “Dyno Load” metric. In short, your goal should be to never use so many threads (or processes, technically) that you exceed Heroku’s Dyno Load limit for whichever type of dynos you’re using (chart source):

image-20240608072235746

Heroku will throttle your application if you exceed these limits and, we can confirm, there be dragons! It’s always a good idea to check on your dynos from time to time and ensure that you’re still under your Dyno Load limit.

image-20240702112204993

But, more importantly, let’s talk about the second down-side to increasing your thread count: latency. Indeed, latency was the primary driver of the discussion around changing Rails’ default number of Puma threads! But let’s back up.

Enter: Latency

Let’s start with a reminder: Ruby is a single-threaded language. That means that, while Ruby threads are asynchronous, they are not concurrent. So, while Ruby can do two things “at once”, it can’t do them at the same exact time:

image-20240702113223769

Any of the things Ruby is doing “at once” are in some partial state of progress but halted while Ruby switches to work on some other thing. This is similar to Javascript, human minds, and several other languages! Just for contrast, this is what a truly concurrent flow would look like:

image-20240702113615846

But we live in a single-threaded land! So the next best question is, ‘how does Ruby decide when to switch work?’ That is a great question! The simplified answer is, once a thread starts waiting on I/O. That is, once some actively-running Ruby starts waiting for a response from a database query, a Redis lookup, an HTTP request, etc.; any time Ruby is no longer actively running instructions but is just waiting on some external thing, the Ruby interpreter will switch context to another thread that is ready to run some Ruby code! So, in reality, our diagram looks more like this:

image-20240702114646199

And that feels nice, right? Ruby isn’t wasting time on threads that have nothing to do — it’s spreading out its code-crunching ability between threads that can actually use it! That is nice! Threads allow us to take advantage of Ruby’s single-threaded nature more efficiently when our application code calls external services. And boy does it! Almost every Rails request is going to make several (or many) database calls and perhaps some Redis calls. All of these various wait-moments are prime candidates for Ruby to work on something else in the meantime. This is great!

For contrast, think about what it would look like if we only had a single thread running. The waste!

image-20240702120308757

So running a multi-threaded web server like Puma really does bring about some big efficiency gains.

But, of course, there are downsides. And they can mostly be summarized by this idea: what if the database call in our diagram actually finishes here (blue arrow)?

image-20240702115357391

That is, not even half-way through the chunk of time that Ruby is working on Thread 2’s code? That’s the bad news — Ruby won’t come back to finish the work in Thread 1 until Thread 2 decides to wait on some I/O. Unfortunately, that means Thread 1 is now spending precious response-time doing… nothing. 😓

image-20240702115555137

Now, in general, this tends to be a pretty rare case. In most Ruby code (especially Rails code), the chunks of actual Ruby code processing tend to be pretty thin, so Ruby is constantly swapping between threads and it’s rare to lose more than a couple of milliseconds in the overall workflow. It looks more like this (see the tiny blue slice):

image-20240702123019844

But nonetheless, it can happen, and it’s exacerbated by running even more threads. When Ruby switches to a new thread there’s no guarantee it will switch back to the previous thread, even if it’s ready to process. Other threads might be ready too!

To illustrate this concept, let’s considering the following scenario. Here’s what it would look like if we ran four threads and they were concurrent-capable (again, Ruby is not). If each request was to process straight through, it’d look like this:

image-20240702124248203

Looks pretty straightforward… but what happens when we bring this same request flow into the Ruby / single-threaded paradigm?

image-20240702131412387

Now, I’ll be the first to say that Ruby’s scheduler is almost certainly better than my makeshift diagram-algorithm, and that my chunks of processing time certainly aren’t to scale, but you get the idea. The grand tradeoff here is that, instead of having to run four processes, each with a single thread (which would, in essence, accomplish the ‘Example Concurrent Flow’ above), we ran a single process with four threads. We spent less in overall server costs, but the response time per request rose. Not uniformly — some requests experienced more blue (wasted) time than others, but overall there will be some waiting between threads.

And this is the core premise of running a multi-threaded web server with a single-threaded language. You get to take advantage of time the Ruby interpreter would otherwise spend doing nothing in a single-thread web server, but occasionally that means one (or more) threads could be waiting for the interpreter to become available again. You save on capacity costs since you can handle more requests with fewer processes, but your overall response time will rise a bit, and your maximum response time could increase a lot!

Act Two: p95 Response Time

If you read through the Rails Repo conversation in depth, one thing you’ll see is that folks are consistently looking at and comparing both throughput and a p-number ‘latency’ between different thread counts. We discussed above how running multiple threads allows a single Ruby process to handle many request at once (-ish), thus increasing the throughput capability of a single Ruby process (nice!). We can measure that fairly easily — just monitor how many requests per second a Ruby process can handle in a benchmark!

But when it comes to the ‘latency’ side of the equation, we’re trying to get a grasp around what the overall impact to our system’s average response time will be. In the final chart above, did you notice that, while the single Ruby process handled all four requests, all four took longer than the purely-parallel model? Our average total response time increased!

This is what the percentile metrics are attempting to summarize for us. When we look at the 95th percentile, or p95, response time metric, we’re essentially observing the response time that all but the top 5% of our requests fell under. This gives us a fairly holistic picture of our system response times without the worst-offender (slowest) endpoints included. It’s the “almost all of our traffic is this number” sort of metric. And that’s helpful! We’ll always have a couple of rough edges. p95 helps remove those edges as distractions and keeps us focused on the rest of the system!

The lesson here is to ensure that when you change your thread count, ensure that you’re watching your p95 closely (as well as your other / typical metrics and dashboards).

Now: Optimize

So… the question remains: what’s the right number of threads to minimize latency impacts but maximize throughput / capacity gains?

Three! Or at least, that’s where the Rails repo discussion landed. But the benchmarks they ran and data they combed through to reach that conclusion was fascinating.

What’s more, running five threads — a default that’s been baked into the Rails community for several, several years — is actually not a great trade-off for most Rails apps! Running five threads tends to increase your average and p95 response time significantly more than you might think, at a benefit of only allowing a few more requests per second than three or four threads! That is, mostly slower requests for a very slight gain in how much throughput the Ruby process can handle! That’s not great.

Ultimately, Nate Berkopec ran a few benchmarks that yielded some fascinating results:

  1. For 50% I/O wait apps, 3 threads in the threadpool gives us ~70% higher throughput for 1.3x the average latency (at that 1.7x throughput).
  2. Increasing threads in the threadpool does not increase average or p99 latency when the server is not heavily utilized. Effectively, in the 0-80% “low utilization” regime, perf looks pretty similar.
  3. At the same # of req/sec, increasing threadpool size doesn’t increase latency by a measurable amount.
  4. When very highly utilized (95%+ in our benchmarks), higher threadcounts “fail harder” with higher latency and higher p99 (what your benchmark showed).
  5. Higher I/O wait apps benefit from higher threadpool sizes.

With all that in mind, I think 3 threads represents an ideal compromise for the average 25-50% I/O wait Rails app on MRI.

So, the first piece of optimizing thread counts in your application is determining how much I/O waiting your app does, on average. Most Rails apps are indeed in that 25-50% band, but yours may be more or less tuned. Grab your favorite APM tool and start inspecting your requests. See how much time is spent (as a percentage), on average, in the database queries, Redis lookups, or external HTTP requests.

If you find yourself closer to 75%/85% of request time being spent in those services, you can likely increase your thread count since Ruby will be idle more often in those cases. Conversely, if you find that a very small amount of your request time is spent in I/O, you may actually want to reduce your thread count to ensure that your response time is snappy!

At the end of the day, a fascinating discussion full of data was laid before us in the Rails repo and we all got to benefit from the knowledge being shared! And, thanks to it, Rails now has a new Puma thread count default! 3 indeed. Three should support nearly as much throughput as five did while maintaining much lower and more stable response times across the board. That’s a win in our book!

The original thread is totally worth a deep-dive if the charts and diagrams here made sense, so we do recommend giving that a read. Otherwise, please feel free to reach out to us with any questions if you’re having trouble applying these concepts and/or simply want to boost your performance! The discussion around processes, threads, and performance DevOps is absolutely what Judoscale is here for.