If you’re lucky enough to build out a successful and enduring SaaS product, chances are you’ll have to deal with technical scaling issues.
An advantage of being in the eCommerce fulfillment space is that large spikes in usage are usually predictable and can be planned for (with a notable exception: pandemics!), so while there are a few spikes throughout the year, there’s one everyone in the space is well aware of: Black Friday & Cyber Monday (Peak Season). From late November up until a day or two before Christmas, usage of our systems spikes by 2-5x.
For ShipHero’s use case, the real-time pressure on our systems is applied from people working at the warehouses and not end-users shopping from their homes (we get hit with that as well, but it comes in at different times so that we can control the flow). We have an advantage over most systems: physical space limits how much can be done simultaneously. You can only fit so many more people in the same warehouse during peak season.
So, how do we avoid having your systems break when they get several times more than their regular traffic? Well, there are a few things we do:
We Break it Ourselves Ahead of Time
We look at spikes in usage throughout the year, use our peakiest days as a reference, and target to comfortably do 2-3x that amount of traffic comfortably. We do that by artificially sending traffic to a non-production set of services, applying pressure to them, and tracking key performance metrics for the service (primarily, ensuring the servers remain healthy and response times don’t suffer).
If something breaks, we know where we have some work to do, and if it works, we go back and apply more pressure until it fails. We want to know when it’s likely to break and how. This is a relatively common practice known as “load testing.”
We also do something often overlooked but have caught issues more than load testing: we scale up our existing production infrastructure beyond what we think we’ll need. So, for example, if we’re usually running 100 AWS EC2 instances for a service at peak hours of the day, we slowly spin up more until we get 500 EC2 instances, all of which process real production customer requests.
If it goes well, customers get a slight performance improvement that day, and we burn through some money. However, when you add capacity, you start to hit service limits that aren’t related to load, the most common ones being the number of concurrent connections to databases and AWS default limits. So what we’d sometimes find is we need to request more IP addresses or a higher quota of elastic load balancer, both specific things to do ahead of time but very stressful and disruptive when in the middle of heavy usage.
We’ve also found that we can continue adding servers, but there are too many concurrent connections to a database at some point, and adding more just flat-out breaks everything.
No Big Changes leading up to Peak Season
What use is stress-testing a system if you make significant changes afterward? Because most of the stress during the busy system is with the humans at the warehouses, we ensure we don’t make any substantial changes to our system after we try to break it. This means no database version upgrades, performance improvements, or new features. With complex systems, it gets tough to predict how even small changes might affect other parts, so we carefully consider any changes in the 6-8 weeks ahead of Black Friday.
We change as little as possible during Peak Season
We make very few meaningful changes ahead of the busy season. We make fewer changes during the most active month. We froze our codebase for a few years and didn’t allow deployments to production. It was a great way to ensure no unexpected changes, but it had the side-effect of accumulating bug fixes and minor improvements for a month or two, which isn’t ideal.
So in 2022 and again this year, we’ve switched to setting an exceptionally high bar to land and roll out any code, but not a complete freeze. That means we expect every single code being produced to have an exhaustive amount of automated tests, manual QA, and a code review by at least two people where one of them is the domain expert, and make all the people involved in the process co-responsible for how it affects production. In practice, it can take a week to roll out something that usually takes hours, but we’re ok with that for a month out of the year. Productivity goes down by a ridiculous amount, but in return, we get productive customers at a time of the year when everyone’s heads down trying to get packages out the door.
We do other things at the organizational level, like having engineers on-call 24/7 to respond to incidents within minutes and hopefully get ahead of any issues before anyone notices.
It’s something we’re constantly improving, we’ve had years of exceptional reliability during peak season, and the only way to keep it that way is to keep making internal processes better all the time.
Martin Albisetti, ShipHero Vice President of Engineering