Reducing Website Downtime: Cost-Effective Strategies for Reliability

Jim Pierson
Jul 25, 2023
6 min read

Updated: Mar 4

Introduction

In a recent report by AWS, 51% of their survey respondents sat at two nines (99%) availability. Just 14% meet four-nines (99.99%). Also, recent social media discussions contemplate how costs for availability might follow an exponential trend. This blog examines how modern technologies and proactive performance optimization can help more websites achieve higher availability without skyrocketing costs. Engineers, execs, and Investors who need to raise the bar on their company’s website Availability or reduce excessive costs are the intended audience.

Defining Availability

For modern websites, Google defines availability as a request success rate, i.e., the number of successful requests divided by the total requests. In practice, the site is polled every minute, often by a third party (e.g., Catchpoint, CloudWatch_Synthetics). Webpages are ideally measured as a whole page rendered. Critical APIs should be measured and rolled up separately from web pages. Failed requests need not be contiguous (back to back) to be counted against availability. So, a site or API that has a 3% error rate has an availability of less than 97%.

Practical Realities

100% availability is not a realistic goal, but that doesn’t mean we need to settle for sketchy service. Over the past 10 years, we have looked at availability measurements for thousands of public-facing websites and API services and compared the availability of static files cached on multiple Content Delivery Networks (CDNs). Although the bar has risen over the years for the entire Internet ecosystem, the order of magnitude difference in availability between static files, APIs, and webpages has been consistent.

Static files cached on CDN are the most reliable. 99.99%+ availability for cached files across a month can be expected from the top CDN providers.
Cloud-based microservice APIs are more reliable than legacy on-prem services and more than whole pages. Well-built API services can be expected to reach at least 99.9% availability.
A dynamic web page's availability depends on the availability of every item and API call on the page, which can be hundreds. It’s also code running on an extremely high level of user hardware and software variations. 99.5% or even 99% is too high a bar for some complex web pages.

The takeaway should be that the more simplistic and cacheable, the more reliable and available. More complexity at the time of transaction means more potential for errors.

Steps To Improve Availability

Consider the table shown here with tiers of availability using the familiar "9's" notation. We can identify common traits of fundamental design and operational technique for each level of availability. While this categorization is grossly oversimplified, it shows a general pattern replicated by hundreds of development teams we have worked with.

The “Must, Should” colors and number values (for those colorblind) are there to set expectations for where data suggests we can pragmatically set the bar for availability. First, web pages, APIs, and static files are given increasingly (10x) higher-level availability goals. For example, given the level of redundancy in the fabric of every Content Delivery Network, static files ‘Must” have more than 99.5% availability. A static file on a CDN ‘Should’ measure above 99.99%. APIs have a lower goal than static files but are generally 10x more available than dynamic web pages, which might be loading and processing a hundred individual files.

90%+ Tier

At this basic level, developers deploy onto a single production instance on-prem or cloud. Some static files are cached on a CDN, and no warm backup systems exist. If the system encounters an issue, the developer will most likely fix it in the morning or after the weekend. A significant portion of legacy on-prem websites is in this sphere. This is low-cost infrastructure-wise, but downtime may impact revenue growth. Note: CDN is optional at this level if traffic is light, but offloading higher volume static downloads to CDN is cost-effective and improves reliability by reducing the load on dynamic services.

99%+ Tier

A critical factor that improves website availability to this level is the management of errors and latency. At this stage, a straightforward and cost-effective strategy involves loosely coupling API calls on web pages with a timeout and static backup or alternative. This approach enables the automatic bypass of failing APIs, reducing impact. Application Performance Monitoring (APM) and Observability are modern monitoring tools crucial for deep, routine scrutiny of all transactions. Machine Learning-based dashboards can expedite the identification and resolution of recurring issues. Automated On-Call Alerting is established, with developers taking turns on-call for the worst incidents. As an expense, APM is a small percentage of what you pay for a single instance. Learn more about The Future of Observability. Quick tip on APM costs, use ‘sampling’ if you have a high traffic volume, and rollup data often (i.e., hourly, daily) so you don’t need to keep tons of expensive raw data.

99.5%+ Tier

This tier is about the big outages that happen, often because there isn’t a second system to shift to immediately. To improve website availability above 99.5%, you usually have two sets of everything, hosting the site in multiple data centers, i.e., Availability Zones, Hot standby database storage must use an auto-failover method (e.g., MS-SQL Availability Groups, Amazon RDS, Azure geo-replication). Web Application Firewall (WAF) is being used to mitigate security attacks. The majority of large corporate websites are here or want to be here. The requirement to operate multiples of everything is the largest increase in overall availability cost. The challenge here is to balance the load and scale only as needed.

99.9%+ Tier

Sites need to be deployed, patched and rotated with no downtime. Remarkably, many product teams make significant investments to get to 99.5%+, yet overlook the 3 minutes of downtime with each deployment. A common cause, for example, is not reading the Kubernetes Signals correctly and sending traffic before an app is ready. This low-cost "9" usually requires only a few days of dev time to optimize the deployment steps.

99.99% Tier

Very few web pages in this tier use a lot of dynamic code, either on the client or behind a monolithic front end. Instead, client logic is refactored to be more simplistic, ideally using static files cached on CDN. Most business logic is completed by modern cloud-based single-purpose (service or microservice) APIs that tend to be 10x more available than whole websites. There is an upfront cost to fully refactoring legacy sites to be more ‘static’ and more ‘micro’. Be careful about APIs that are really just facades for complex apps behind the scenes. Timeouts and alternatives must be standard for every request.

99.999% Tier

These sites are almost entirely made up of static cached CDN files. The low-cost redundancy of the CDN makes these the most reliable and, in most cases, cheapest to operate. That said, certain constraints on the UI and database read/writes are significant tradeoffs for these sites.

Conclusion

Managing website costs while maximizing availability can be challenging, but high availability doesn't have to come at an exorbitant cost. We can achieve higher availability tiers by implementing smart, proactive performance optimization and using modern technologies.

To Recap The Top Cost Savings Steps

Shift complex dynamic web pages towards being entirely static files cached on Content Delivery Networks.
Shift business logic to use cloud-based single-purpose microservice APIs.
Use a database with auto-failover.
Use APM to troubleshoot errors and latency and automate alerting.
Limit your raw APM log storage retention and use rollups for long-term trends.
Cleanup deployment, node rotation, and patching outages.

These are major steps towards improving availability and the cost to improve it. You may find you need to compromise. Perhaps some of your sites can be entirely static and highly available, while others, more complex sites, will need hybrids.

The point is that higher availability can not be painted with one broad brush. With a few strategic adjustments, you can enhance your user experience and boost your revenue, all without breaking the bank.

Tell us about your site experience in the comments section. Does this map to your experience? Are you inspired to drive for higher availability?

About The Author

Jim Pierson is a Practitioner with RingStone and has 30+ years of experience in Product Management, Observability, and Quality Management, with Performance and Reliability Engineering as a core focus. Jim has led the creation of Cloud-based tools, resulting in products improving from ~99% availability to 99.99%+ and reducing page load times by 200%, and successfully pitched projects to Executive Leadership worth $100+ million. Jim has trained thousands of software engineers in the best performance and quality engineering practices. He has a military background in Special Forces and Electronic Communications. Contact Jim at Jim.Pierson@ringstonetech.com