Outgrowing Heroku: How TeePublic Conquered Black Friday with Amazon ECS

Adam McCrea

Adam McCrea

@adamlogic

Picture this: You're running TeePublic, a bustling online marketplace where artists sell their creations. It's Black Friday, and your website is swarming with customers eager to snag holiday deals. But there's a problem—your hosting platform, Heroku, is struggling to keep up. Pages load slowly, errors pile up, and your phone won't stop buzzing. It's a nightmare scenario for any e-commerce site.

E-commerce web servers are overwhelmed by frustrated customers

TeePublic isn't just another online store. It's a vibrant marketplace where artists share their designs, and shoppers find unique apparel and accessories. But with great popularity comes great challenges, especially during the holiday season when sales skyrocket.

Here's Matt Tarantino, dev-ops tech lead at TeePublic:

During our holiday season—which starts Black Friday and runs through Christmas—we’re in total code freeze, just focused on performance.

Recognizing the need for change, TeePublic turned to AWS, with a bit of help from Judoscale. This wasn't just about moving to a new hosting service; it was about ensuring the platform could scale dynamically, making every transaction seamless, regardless of traffic volume.

This is the story of how TeePublic conquered Black Friday by migrating from Heroku to AWS, turning their busiest sales day from a challenge into a triumph.

The Limitations of Heroku

TeePublic's Marketplace app is the heartbeat of their operation, hosting thousands of artist storefronts and a dizzying array of unique products. On Heroku, the setup was straightforward for a Ruby on Rails monolith application—Heroku web containers (called dynos) serve the web app while worker dynos process background jobs via Sidekiq. Heroku also provided data services for Postgres and Redis.

TeePublic's Marketplace application architecture on Heroku

However, as TeePublic's community grew, so did the limitations of their Heroku infrastructure. Scalability, performance, and cost became significant concerns. During high-traffic events like Black Friday, the platform struggled to keep up. Heroku's dyno model, although initially a boon, became a bottleneck; there was a hard cap on scalability.

We needed more dynos to support the traffic, but we couldn't get them out of Heroku.

Heroku limits applications to 100 dynos, and TeePublic was hitting that ceiling during peak times.

Database connections were also an issue—Heroku's Postgres database has a hard limit of 500 connections. TeePublic was using PgBouncer to create a larger pool of connections, but it still wasn't enough.

We had a primary and three replicas to get around the Postgres connection limits. Our app had a bunch of code to be able to support multiple replicas. All of that was just extra overhead, and every replica adds additional cost.

TeePublic's Heroku site issues on Black Friday

These limitations underscored the urgent need for a more robust, scalable solution to handle peak loads without sacrificing performance or breaking the bank. It was clear: for TeePublic to continue thriving, a migration was necessary. AWS promised not just scalability and performance improvements but also a more cost-effective infrastructure suited to TeePublic's expanding needs.

The Decision to Migrate to AWS

The decision to migrate to AWS marked a pivotal moment in TeePublic's journey, signaling a commitment to scalability, performance, and operational efficiency. Leading this ambitious endeavor was the newly formed ops team, tasked with navigating the complexities of the migration and ensuring a smooth transition.

While many cloud providers can handle Ruby on Rails applications at scale, AWS emerged as the clear choice, primarily for its unmatched scalability, robust ecosystem, and the flexibility it offered for managing high-traffic loads.

Cloud hosting options for Ruby on Rails

The allure of ECS Fargate lay not just in its serverless nature but also in its familiarity. Several team members had prior experience with ECS, reducing the learning curve and ensuring a smoother adoption process. This prior knowledge, coupled with the desire for a solution that could automatically scale resources as needed, made ECS Fargate an obvious choice over other cloud hosting options.

When I had used ECS in the past, Fargate was not a thing yet—you had to maintain your own fleet of EC2 instances. And so when it came time to figure out where we wanted to go, serverless was just the easiest answer there.

Despite the confidence in AWS and ECS Fargate, the migration was not without its challenges and concerns. The team had to consider the potential for downtime, data migration complexities, and the learning curve associated with adopting new technologies and practices. There was also the overarching goal of maintaining the same level of performance, if not improving it, without significantly increasing costs.

Planning and Executing the Migration

The migration from Heroku to AWS was a meticulously planned operation. The dev-ops team, with Matt Tarantino at the helm, charted a course that would ensure a smooth transition without any hiccups.

The first step was to test the waters by migrating a small internal tool to AWS. This "guinea pig" project allowed the team to gain valuable insights and refine their migration strategy before tackling the main Marketplace app.

We started off migrating a very small internal tool. Went fine, didn't really experience any issues with that, and we felt pretty good about it. So then we went and did Marketplace. That was kind of the big one that took the most time.

For a seamless transition, TeePublic leveraged Terraform—the popular infrastructure-as-code (IaC) tool—to manage their AWS resources. By codifying their infrastructure, the team could easily version, replicate, and manage the AWS setup, significantly reducing the risk of human error and streamlining the deployment process. Additionally, they developed internal tools to facilitate the developer environment setup, ensuring that the team could hit the ground running on the new platform.

Despite the team's thorough planning, the migration process encountered several challenges. One major concern was the data transfer cost, since—unlike Heroku—AWS charges for external data transfers. They chose Crunchy Data to house their Postgres databases, and even though Crunchy Data was in the same AWS region, they initially weren't VPC-paired. They resolved this with a private link to avoid the external data transfer costs, but it was a painful and expensive lesson.

Another hurdle was ensuring minimal downtime and maintaining data integrity during the migration. The ops team meticulously planned the database migration with the involvement of Crunchy Data for their Postgres databases. They employed real-time data replication from Heroku to AWS, facilitated by Crunchy's expertise, and carefully scheduled cutover windows. This approach minimized the impact on the marketplace's operation, allowing TeePublic to transition smoothly without compromising their service or data integrity.

TeePublic's new app architecture on AWS

The planning and execution of the migration to AWS were testaments to TeePublic's commitment to providing a stable, scalable platform for their community of artists and shoppers. By addressing each challenge with thoughtful solutions and leveraging powerful tools like Terraform, the team not only navigated the complexities of the migration but also laid a solid foundation for future growth and innovation on AWS.

Autoscaling ECS with Judoscale

On Heroku, Judoscale was invaluable for TeePublic, providing dynamic scaling that kept their platform humming. However, when they migrated to AWS, they faced a challenge: Judoscale wasn't initially available for ECS. TeePublic had to rely on manual scaling and overprovisioning to avoid downtime, and their compute costs were massive.

TeePublic worked closely with the Judoscale team to develop an autoscaler tailored for ECS, addressing a critical gap in their infrastructure. This partnership allowed them to implement a queue time scaling approach, moving beyond the limitations of manual scaling and traditional, less responsive metrics like CPU or memory usage.

Before we had started talking, there was a period where we were going to have to use CPU or memory to be able to determine auto scaling. And it wasn't exactly what we needed. Could it have done the job? Maybe. But it's not looking at queue time.

The impact of integrating Judoscale into their AWS environment was profound. TeePublic handled traffic surges seamlessly, improving site reliability and user experience—moreover, the autoscaler optimized resource usage, cutting nearly $10,000/month in compute cost by eliminating the need for overprovisioning.

Judoscale scaling ECS services based on queue time

This collaboration not only solved TeePublic's scaling challenges but also paved the way for Judoscale's expansion into AWS, highlighting the power of partnership and innovation in overcoming technical obstacles.

Black Friday: A Testament to Success

The ultimate test of TeePublic's migration to AWS came during Black Friday Cyber Monday weekend (BFCM) of 2023. This event was not just a sales opportunity but a critical moment to validate the effectiveness of their new infrastructure. Remarkably, the weekend passed without a single incident, a stark contrast to previous years, when the team braced for issues and interruptions.

The difference between the Heroku and AWS setups was night and day. On AWS, TeePublic could effortlessly and without limit scale to meet demand, ensuring smooth and fast user experiences even at peak traffic. The auto-scaling capabilities provided by Judoscale played a pivotal role in this success, allowing the platform to adapt in real-time to the influx of customers.

E-commerce web servers are serving happy customers

The metrics from this Black Friday spoke volumes about the migration's success. Matt Tarantino, the tech lead, hinted at the scale of improvement, noting that this Black Friday saw the highest number of orders in TeePublic's history. The seamless handling of this sales volume and traffic on AWS marked a significant milestone for TeePublic, showcasing the robustness and scalability of their new hosting environment.

The Impact of AWS and Judoscale on TeePublic

The migration to AWS, enhanced by Judoscale's auto-scaling capabilities, marked a transformative period for TeePublic, yielding significant operational improvements across the board. This transition not only addressed the immediate challenges of scalability and performance but also laid the groundwork for a more resilient, efficient, and developer-friendly infrastructure.

Feedback from the development team has been overwhelmingly positive. The move to Docker-based environments, both in production and development, standardized workflows and reduced the time spent on environment setup and troubleshooting. This shift not only improved productivity but also fostered a more collaborative and innovative engineering culture.

I definitely feel like we've come a lot further as an engineering group. Our tools have gotten better. Our just overall composition of everything, it just feels like it's tighter, it's cleaner.

From a financial perspective, the move to AWS, facilitated by Judoscale, led to notable cost savings and efficiencies. The dynamic scaling capabilities ensured that TeePublic paid only for the resources they needed when they needed them, avoiding the cost of overprovisioning while maintaining the flexibility to scale during peak times. This optimized approach to resource management translated into direct cost savings, making the platform not just more scalable and performant but also more cost-effective.

Reflecting on the migration, TeePublic's tech team viewed it as a resounding success. The journey taught them valuable lessons in planning and execution and the importance of choosing the right partners and tools for scaling in the cloud. The move to AWS and Judoscale not only addressed their immediate needs but also laid a solid foundation for future growth and innovation.

Wrapping Up: Lessons and Insights from TeePublic's Cloud Journey

TeePublic's journey to AWS is a masterclass in strategic cloud migration. The phased approach, starting with less critical systems, allowed for a careful assessment of risks and a valuable learning curve. This careful planning, combined with the expertise of a dedicated dev-ops team, ensured a seamless transition and a robust foundation for future growth on AWS.

For companies contemplating a move from Heroku to AWS, TeePublic's experience is instructive. If you're pushing against Heroku's limits and can dedicate resources to a specialized dev-ops team, TeePublic's story demonstrates the potential rewards of meticulous planning and choosing the right partners.

I feel like AWS definitely gives you more flexibility in every area. Yes, there's more that you have to kind of think about. But you're in control of those areas.

Ultimately, TeePublic's migration resulted in significant gains in scalability, performance, and operational efficiency. It's a compelling testament to the strategic advantages of cloud migration, executed with precision and the proper support.